[PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
@ 2025-06-02 18:40 Yeoreum Yun
  2025-06-03  2:01 ` David Wang
  2025-06-03 14:00 ` Leo Yan
  0 siblings, 2 replies; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-02 18:40 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, Yeoreum Yun, David Wang

commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
changes the event->state update before list_del_event().
This change prevents calling perf_cgroup_event_disable() as a result,
cpuctx->cgrp can't be cleared properly and point to dangling point of cgroup.

Because of this problem, some machin meets the below panic[0]:

863.881960] sysved_call_function_sing le+0x4c/0xc0
863.881301] asm_sysvec_call_function_single+0x16/0x20
869.881344] RIP: 0633:0x7f9alcea3367
663.681373] Code: 00 66 99 b8 ff ff ff ff c3 66 ....
863.881524] RSP: 002b:00007fffa526fcf8 EFLAGS: 00000246
869.881567] RAX: 0000562060c962d0 RBX: 0000000000000002 RCX: 00007f9a1cff1c60
863.881625] RDX: 00007f9a0c000030 RSI: 00007f9alcff1c60 RDI: 00007f9a1ca91c20
863.081682] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007f9a1d6217a0
869.881740] R10: 00007f9alca91c10 R11: 0000000000000246 R12: 00007f9a1d70c020
869.881798] R13: 00007fffa5270030 R14: 00007fffa526fd00 R15: 0000000000000000
863.881860] </TASK>
863.881876) Modules linked in: snd_seq_dummy (E) snd_hrtimer (E)...
...
863.887142] button (E)
863.912127] CR2: ffffe4afcc079650
863.914593] --- [ end trace 0000000000000000 1--
864.042750] RIP: 0010:ctx_sched_out+0x1ce/0x210
864.045214] Code: 89 c6 4c 8b b9 de 00 00 00 48 ...
864.050343] RSP: 0000:ffffaa4ec0f3fe60 EFLAGS: 00010086
864.052929] RAX: 0000000000000002 RBX: ffff8e8eeed2a580 RCX: ffff8e8bded9bf00
864.055518] RDX: 000000c92340b051 RSI: 000000c92340b051 RDI: ffff
864.058093] RBP: 0000000000000000 R08: 0000000000000002 R09: 00
864.060654] R10: 0000000000000000 R11: 0000000000000000 R12: 000
864.063183] R13: ffff8e8eeed2a580 R14: 0000000000000007 R15: ffffe4afcc079650
864.065729] FS: 00007f9a1ca91940 (0000) GS:ffff8e8f6b1c3000(0000) knIGS:0000000000000000
864.068312] CS: 0010 DS: 0000 ES: 0000 CRO: 0000000080050033
864.070898] CR2: ffffe4afcc079650 CR3: 00000001136d8000 CR4: 0000000000350ef0
864.673523] Kernel panic - not syncing: Fatal exception in interrupt
864.076410] Kernel Offset: 0xc00000 from 0xffffffff81000000 (relocation range: 0xff
864.205401] --- [ end Kernel panic - not syncing: Fatal exception in interrupt ]---

To address this call the perf_cgroup_event_disable() properly before
list_del_event() in __perf_remove_from_context().

Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Tested-by: David Wang <00107082@163.com>
---
 kernel/events/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f34c99f8ce8f..909b9d5a65c1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2498,6 +2498,10 @@ __perf_remove_from_context(struct perf_event *event,
 		state = PERF_EVENT_STATE_DEAD;
 	}
 	event_sched_out(event, ctx);
+
+	if (event->state > PERF_EVENT_STATE_OFF)
+		perf_cgroup_event_disable(event, ctx);
+
 	perf_event_set_state(event, min(event->state, state));
 
 	if (flags & DETACH_GROUP)
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re:[PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-02 18:40 [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Yeoreum Yun
@ 2025-06-03  2:01 ` David Wang
  2025-06-03  4:46   ` [PATCH " Yeoreum Yun
  2025-06-03 14:00 ` Leo Yan
  1 sibling, 1 reply; 37+ messages in thread
From: David Wang @ 2025-06-03  2:01 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 02:40:49, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>changes the event->state update before list_del_event().
>This change prevents calling perf_cgroup_event_disable() as a result,
>cpuctx->cgrp can't be cleared properly and point to dangling point of cgroup.
>
>Because of this problem, some machin meets the below panic[0]:
>
>863.881960] sysved_call_function_sing le+0x4c/0xc0
>863.881301] asm_sysvec_call_function_single+0x16/0x20
>869.881344] RIP: 0633:0x7f9alcea3367
>663.681373] Code: 00 66 99 b8 ff ff ff ff c3 66 ....
>863.881524] RSP: 002b:00007fffa526fcf8 EFLAGS: 00000246
>869.881567] RAX: 0000562060c962d0 RBX: 0000000000000002 RCX: 00007f9a1cff1c60
>863.881625] RDX: 00007f9a0c000030 RSI: 00007f9alcff1c60 RDI: 00007f9a1ca91c20
>863.081682] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007f9a1d6217a0
>869.881740] R10: 00007f9alca91c10 R11: 0000000000000246 R12: 00007f9a1d70c020
>869.881798] R13: 00007fffa5270030 R14: 00007fffa526fd00 R15: 0000000000000000
>863.881860] </TASK>
>863.881876) Modules linked in: snd_seq_dummy (E) snd_hrtimer (E)...
>...
>863.887142] button (E)
>863.912127] CR2: ffffe4afcc079650
>863.914593] --- [ end trace 0000000000000000 1--
>864.042750] RIP: 0010:ctx_sched_out+0x1ce/0x210
>864.045214] Code: 89 c6 4c 8b b9 de 00 00 00 48 ...
>864.050343] RSP: 0000:ffffaa4ec0f3fe60 EFLAGS: 00010086
>864.052929] RAX: 0000000000000002 RBX: ffff8e8eeed2a580 RCX: ffff8e8bded9bf00
>864.055518] RDX: 000000c92340b051 RSI: 000000c92340b051 RDI: ffff
>864.058093] RBP: 0000000000000000 R08: 0000000000000002 R09: 00
>864.060654] R10: 0000000000000000 R11: 0000000000000000 R12: 000
>864.063183] R13: ffff8e8eeed2a580 R14: 0000000000000007 R15: ffffe4afcc079650
>864.065729] FS: 00007f9a1ca91940 (0000) GS:ffff8e8f6b1c3000(0000) knIGS:0000000000000000
>864.068312] CS: 0010 DS: 0000 ES: 0000 CRO: 0000000080050033
>864.070898] CR2: ffffe4afcc079650 CR3: 00000001136d8000 CR4: 0000000000350ef0
>864.673523] Kernel panic - not syncing: Fatal exception in interrupt
>864.076410] Kernel Offset: 0xc00000 from 0xffffffff81000000 (relocation range: 0xff
>864.205401] --- [ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
>To address this call the perf_cgroup_event_disable() properly before
>list_del_event() in __perf_remove_from_context().
>
>Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
>Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>Tested-by: David Wang <00107082@163.com>
>---
> kernel/events/core.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
>diff --git a/kernel/events/core.c b/kernel/events/core.c
>index f34c99f8ce8f..909b9d5a65c1 100644
>--- a/kernel/events/core.c
>+++ b/kernel/events/core.c
>@@ -2498,6 +2498,10 @@ __perf_remove_from_context(struct perf_event *event,
> 		state = PERF_EVENT_STATE_DEAD;
> 	}
> 	event_sched_out(event, ctx);
>+
>+	if (event->state > PERF_EVENT_STATE_OFF)
>+		perf_cgroup_event_disable(event, ctx);
>+
> 	perf_event_set_state(event, min(event->state, state));
> 
> 	if (flags & DETACH_GROUP)
>-- 
>LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

I think this patch is no better than my patch in the original report
https://lore.kernel.org/all/20250601173603.3920-1-00107082@163.com/


This patch is more aggressive,  it add more changes to original logic, same practice 
as in the offending commit.  would raise more concerns about hidden side-effect.

For example, this code  in list_del_event should raise concern about this patch
 2099          * We can have double detach due to exit/hot-unplug + close.
 2100          */
 2101         if (!(event->attach_state & PERF_ATTACH_CONTEXT))
 2102                 return;


Thanks
David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  2:01 ` David Wang
@ 2025-06-03  4:46   ` Yeoreum Yun
  2025-06-03  5:44     ` David Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03  4:46 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

Hi Davia,

> I think this patch is no better than my patch in the original report
> https://lore.kernel.org/all/20250601173603.3920-1-00107082@163.com/
>
> This patch is more aggressive,  it add more changes to original logic, same practice
> as in the offending commit.  would raise more concerns about hidden side-effect.
>
> For example, this code  in list_del_event should raise concern about this patch
> 2099          * We can have double detach due to exit/hot-unplug + close.
>  2100          */
>  2101         if (!(event->attach_state & PERF_ATTACH_CONTEXT))
>  2102                 return;

attach_state doesn't related for event->state change.
if one event already cleared PERF_ATTACH_CONTEXT, that event is called
via list_del_event()

Also, your patch couldn't solve a problem describe in
commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
for INCATIVE event's total_enable_time.

Thanks.

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  4:46   ` [PATCH " Yeoreum Yun
@ 2025-06-03  5:44     ` David Wang
  2025-06-03  6:34       ` Yeoreum Yun
  0 siblings, 1 reply; 37+ messages in thread
From: David Wang @ 2025-06-03  5:44 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel

At 2025-06-03 12:46:03, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi Davia,
>
>> I think this patch is no better than my patch in the original report
>> https://lore.kernel.org/all/20250601173603.3920-1-00107082@163.com/
>>
>> This patch is more aggressive,  it add more changes to original logic, same practice
>> as in the offending commit.  would raise more concerns about hidden side-effect.
>>
>> For example, this code  in list_del_event should raise concern about this patch
>> 2099          * We can have double detach due to exit/hot-unplug + close.
>>  2100          */
>>  2101         if (!(event->attach_state & PERF_ATTACH_CONTEXT))
>>  2102                 return;
>
>attach_state doesn't related for event->state change.
>if one event already cleared PERF_ATTACH_CONTEXT, that event is called
>via list_del_event()

Maybe this concern could be clarified, what about other subtle impacts.
The change should be thorough reviewed, if you want to push it further.

It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
It is  just very hard to capture a bug happens in rare situation.

And your patch has a global impact, it changes behavior unnecessarily.

>
>Also, your patch couldn't solve a problem describe in
>commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>for INCATIVE event's total_enable_time.

I do not think so.
Correct me if I am making silly  mistakes,
The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?

David

>
>Thanks.
>
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  5:44     ` David Wang
@ 2025-06-03  6:34       ` Yeoreum Yun
  2025-06-03  6:39         ` Yeoreum Yun
                           ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03  6:34 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

Hi David,
> > attach_state doesn't related for event->state change.
> > if one event already cleared PERF_ATTACH_CONTEXT, that event is called
> > via list_del_event()
>
> Maybe this concern could be clarified, what about other subtle impacts.
> The change should be thorough reviewed, if you want to push it further.
>
> It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
> It is  just very hard to capture a bug happens in rare situation.
>
> And your patch has a global impact, it changes behavior unnecessarily.

TBH, this patch just change of time of "event->state" while doing,
As my bad miss the disable cgorup perf.
I think there seems no other side effect for chaning state while in
removing event.
But, Let's wait for other people's review.

> >
> > Also, your patch couldn't solve a problem describe in
> > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > for INCATIVE event's total_enable_time.
>
> I do not think so.
> Correct me if I am making silly  mistakes,
> The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
> calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
> If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
> I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
> Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?

Sorry for my bad explaination without detail.
Think about cpu specific event and closed by task.
If there is specific child cpu event specified in cpu 0.
  1. cpu 0 -> active
  2. scheulded to cpu1 -> inactive
  3. close the cpu event from parent -> inactive close

Can be failed to count total_enable_time.

Thanks.
--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:34       ` Yeoreum Yun
@ 2025-06-03  6:39         ` Yeoreum Yun
  2025-06-03  6:47           ` David Wang
  2025-06-03  6:42         ` David Wang
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03  6:39 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

> > > attach_state doesn't related for event->state change.
> > > if one event already cleared PERF_ATTACH_CONTEXT, that event is called
> > > via list_del_event()
> >
> > Maybe this concern could be clarified, what about other subtle impacts.
> > The change should be thorough reviewed, if you want to push it further.
> >
> > It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
> > It is  just very hard to capture a bug happens in rare situation.
> >
> > And your patch has a global impact, it changes behavior unnecessarily.
>
> TBH, this patch just change of time of "event->state" while doing,
> As my bad miss the disable cgorup perf.
> I think there seems no other side effect for chaning state while in
> removing event.
> But, Let's wait for other people's review.
>
> > >
> > > Also, your patch couldn't solve a problem describe in
> > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > > for INCATIVE event's total_enable_time.
> >
> > I do not think so.
> > Correct me if I am making silly  mistakes,
> > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
> > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
> > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
> > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
> > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>
> Sorry for my bad explaination without detail.
> Think about cpu specific event and closed by task.
> If there is specific child cpu event specified in cpu 0.
>   1. cpu 0 -> active
>   2. scheulded to cpu1 -> inactive
>   3. close the cpu event from parent -> inactive close
>
> Can be failed to count total_enable_time.
>
> Thanks.

And also, considering the your patch, for DETACH_EXIT case,
If it changes the state before list_del_event() that wouldn't disable
related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
as patch describe...

> --
> Sincerely,
> Yeoreum Yun

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:34       ` Yeoreum Yun
  2025-06-03  6:39         ` Yeoreum Yun
@ 2025-06-03  6:42         ` David Wang
  2025-06-03  7:16           ` Yeoreum Yun
  2025-06-03  6:54         ` David Wang
  2025-06-03  9:20         ` Yeoreum Yun
  3 siblings, 1 reply; 37+ messages in thread
From: David Wang @ 2025-06-03  6:42 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 14:34:59, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>> > attach_state doesn't related for event->state change.
>> > if one event already cleared PERF_ATTACH_CONTEXT, that event is called
>> > via list_del_event()
>>
>> Maybe this concern could be clarified, what about other subtle impacts.
>> The change should be thorough reviewed, if you want to push it further.
>>
>> It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
>> It is  just very hard to capture a bug happens in rare situation.
>>
>> And your patch has a global impact, it changes behavior unnecessarily.
>
>TBH, this patch just change of time of "event->state" while doing,
>As my bad miss the disable cgorup perf.
>I think there seems no other side effect for chaning state while in
>removing event.
>But, Let's wait for other people's review.
>
>> >
>> > Also, your patch couldn't solve a problem describe in
>> > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>> > for INCATIVE event's total_enable_time.
>>
>> I do not think so.
>> Correct me if I am making silly  mistakes,
>> The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
>> calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
>> If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
>> I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
>> Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>
>Sorry for my bad explaination without detail.
>Think about cpu specific event and closed by task.
>If there is specific child cpu event specified in cpu 0.
>  1. cpu 0 -> active
>  2. scheulded to cpu1 -> inactive
>  3. close the cpu event from parent -> inactive close
>
>Can be failed to count total_enable_time.

Is this explaining the purpose of commit a3c3c6667 ?
I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)

>
>Thanks.
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:39         ` Yeoreum Yun
@ 2025-06-03  6:47           ` David Wang
  0 siblings, 0 replies; 37+ messages in thread
From: David Wang @ 2025-06-03  6:47 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 14:39:33, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>> > > attach_state doesn't related for event->state change.
>> > > if one event already cleared PERF_ATTACH_CONTEXT, that event is called
>> > > via list_del_event()
>> >
>> > Maybe this concern could be clarified, what about other subtle impacts.
>> > The change should be thorough reviewed, if you want to push it further.
>> >
>> > It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
>> > It is  just very hard to capture a bug happens in rare situation.
>> >
>> > And your patch has a global impact, it changes behavior unnecessarily.
>>
>> TBH, this patch just change of time of "event->state" while doing,
>> As my bad miss the disable cgorup perf.
>> I think there seems no other side effect for chaning state while in
>> removing event.
>> But, Let's wait for other people's review.
>>
>> > >
>> > > Also, your patch couldn't solve a problem describe in
>> > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>> > > for INCATIVE event's total_enable_time.
>> >
>> > I do not think so.
>> > Correct me if I am making silly  mistakes,
>> > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
>> > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
>> > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
>> > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
>> > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>>
>> Sorry for my bad explaination without detail.
>> Think about cpu specific event and closed by task.
>> If there is specific child cpu event specified in cpu 0.
>>   1. cpu 0 -> active
>>   2. scheulded to cpu1 -> inactive
>>   3. close the cpu event from parent -> inactive close
>>
>> Can be failed to count total_enable_time.
>>
>> Thanks.
>
>And also, considering the your patch, for DETACH_EXIT case,
>If it changes the state before list_del_event() that wouldn't disable
>related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
>as patch describe...

No, I don't think so.
change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
And no such kernel panic happened  before commit a3c3c6667.

>
>> --
>> Sincerely,
>> Yeoreum Yun
>
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:34       ` Yeoreum Yun
  2025-06-03  6:39         ` Yeoreum Yun
  2025-06-03  6:42         ` David Wang
@ 2025-06-03  6:54         ` David Wang
  2025-06-03  9:20         ` Yeoreum Yun
  3 siblings, 0 replies; 37+ messages in thread
From: David Wang @ 2025-06-03  6:54 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 14:34:59, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>> > attach_state doesn't related for event->state change.
>> > if one event already cleared PERF_ATTACH_CONTEXT, that event is called
>> > via list_del_event()
>>
>> Maybe this concern could be clarified, what about other subtle impacts.
>> The change should be thorough reviewed, if you want to push it further.
>>
>> It takes me more than a month to figure out a procedure to reproduce the kernel panic bug,
>> It is  just very hard to capture a bug happens in rare situation.
>>
>> And your patch has a global impact, it changes behavior unnecessarily.
>
>TBH, this patch just change of time of "event->state" while doing,
>As my bad miss the disable cgorup perf.
>I think there seems no other side effect for chaning state while in
>removing event.
>But, Let's wait for other people's review.
>

The *risk* is unnecessary, we can confine the scope of DETACH_EXIT.
Why keep making global changes when confined changes are available?


David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:42         ` David Wang
@ 2025-06-03  7:16           ` Yeoreum Yun
  2025-06-03  7:31             ` David Wang
  2025-06-03  8:15             ` David Wang
  0 siblings, 2 replies; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03  7:16 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

Hi David,

> >
> > >
> > > Also, your patch couldn't solve a problem describe in
> > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > > for INCATIVE event's total_enable_time.
> >
> > I do not think so.
> > Correct me if I am making silly  mistakes,
> > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
> > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
> > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
> > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
> > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
> >
> > Sorry for my bad explaination without detail.
> > Think about cpu specific event and closed by task.
> > If there is specific child cpu event specified in cpu 0.
> > 1. cpu 0 -> active
> > 2. scheulded to cpu1 -> inactive
> > 3. close the cpu event from parent -> inactive close
> >
> > Can be failed to count total_enable_time.
>
> Is this explaining the purpose of commit a3c3c6667 ?
> I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)

In commit a3c3c6667, I explain the specific case but not with above
case. But the commit's purpose is "account total_enable_time" properly.

> > And also, considering the your patch, for DETACH_EXIT case,
> > If it changes the state before list_del_event() that wouldn't disable
> > related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
> > as patch describe...
> No, I don't think so.
> change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
> And no such kernel panic happened  before commit a3c3c6667.

That's why list_del_event() handle the perf_cgroup_disable() before the
commit a3c3c6667. However because of *my mistake*, I've forget to
perf_cgroup_disable() properly before change the event state.
Yes, your patch can make avoid the panic since as soon as exit,
the event->cgrp switched.

However, as I said, the INACTIVE event could be failed to count
total_enable_time.

So, set event should be occured before list_del_event().
And since it's event->state change on remove.
It shouldn't have any side effect the state change isn't cause of your
panic. But missed perf_cgroup_disable().

Thanks.
--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  7:16           ` Yeoreum Yun
@ 2025-06-03  7:31             ` David Wang
  2025-06-03  8:15             ` David Wang
  1 sibling, 0 replies; 37+ messages in thread
From: David Wang @ 2025-06-03  7:31 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 15:16:36, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>
>> >
>> > >
>> > > Also, your patch couldn't solve a problem describe in
>> > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>> > > for INCATIVE event's total_enable_time.
>> >
>> > I do not think so.
>> > Correct me if I am making silly  mistakes,
>> > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
>> > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
>> > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
>> > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
>> > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>> >
>> > Sorry for my bad explaination without detail.
>> > Think about cpu specific event and closed by task.
>> > If there is specific child cpu event specified in cpu 0.
>> > 1. cpu 0 -> active
>> > 2. scheulded to cpu1 -> inactive
>> > 3. close the cpu event from parent -> inactive close
>> >
>> > Can be failed to count total_enable_time.
>>
>> Is this explaining the purpose of commit a3c3c6667 ?
>> I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)
>
>In commit a3c3c6667, I explain the specific case but not with above
>case. But the commit's purpose is "account total_enable_time" properly.
>
>> > And also, considering the your patch, for DETACH_EXIT case,
>> > If it changes the state before list_del_event() that wouldn't disable
>> > related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
>> > as patch describe...
>> No, I don't think so.
>> change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
>> And no such kernel panic happened  before commit a3c3c6667.
>
>That's why list_del_event() handle the perf_cgroup_disable() before the
>commit a3c3c6667. However because of *my mistake*, I've forget to
>perf_cgroup_disable() properly before change the event state.
>Yes, your patch can make avoid the panic since as soon as exit,
>the event->cgrp switched.
>
>However, as I said, the INACTIVE event could be failed to count
>total_enable_time.

This time, you mean my patch, right?
Still have trouble understand this. My patch dose not change the logic along  DETACH_EXIT at all.
Could you elaborate on it more? What change my patch made to break commit  a3c3c6667?
(Hope you are not still talking the purpose of commit a3c3c6667)

>
>So, set event should be occured before list_del_event().
>And since it's event->state change on remove.
>It shouldn't have any side effect the state change isn't cause of your
>panic. But missed perf_cgroup_disable().

I may understand it wrong, but I think the kernel panic is caused by missing 
perf_cgroup_event_disable(), 
when reboot/shutdown, kernel were waiting cgroup ctx reference drop to 0, for ever, and hence the panic.




>
>Thanks.
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  7:16           ` Yeoreum Yun
  2025-06-03  7:31             ` David Wang
@ 2025-06-03  8:15             ` David Wang
  1 sibling, 0 replies; 37+ messages in thread
From: David Wang @ 2025-06-03  8:15 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 15:16:36, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>
>> >
>> > >
>> > > Also, your patch couldn't solve a problem describe in
>> > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>> > > for INCATIVE event's total_enable_time.
>> >
>> > I do not think so.
>> > Correct me if I am making silly  mistakes,
>> > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
>> > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
>> > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
>> > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
>> > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>> >
>> > Sorry for my bad explaination without detail.
>> > Think about cpu specific event and closed by task.
>> > If there is specific child cpu event specified in cpu 0.
>> > 1. cpu 0 -> active
>> > 2. scheulded to cpu1 -> inactive
>> > 3. close the cpu event from parent -> inactive close
>> >
>> > Can be failed to count total_enable_time.
>>
>> Is this explaining the purpose of commit a3c3c6667 ?
>> I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)
>
>In commit a3c3c6667, I explain the specific case but not with above
>case. But the commit's purpose is "account total_enable_time" properly.
>
>> > And also, considering the your patch, for DETACH_EXIT case,
>> > If it changes the state before list_del_event() that wouldn't disable
>> > related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
>> > as patch describe...
>> No, I don't think so.
>> change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
>> And no such kernel panic happened  before commit a3c3c6667.

Oh! I was wrong, before commit a3c3c6667, "change state" happened *after* list_del_event()
>
>That's why list_del_event() handle the perf_cgroup_disable() before the
>commit a3c3c6667. However because of *my mistake*, I've forget to
>perf_cgroup_disable() properly before change the event state.
>Yes, your patch can make avoid the panic since as soon as exit,
>the event->cgrp switched.

 I cannot agree with the reasoning, 
The panic dose not happened when exit, it happened when reboot/shutdown.
(I close perf_event_open before reboot)





>
>However, as I said, the INACTIVE event could be failed to count
>total_enable_time.
>
>So, set event should be occured before list_del_event().
>And since it's event->state change on remove.
>It shouldn't have any side effect the state change isn't cause of your
>panic. But missed perf_cgroup_disable().

Any procedure to bring out the impact of this missed perf_cgroup_disable()?
My system seems all normal, where should I check it?

But to fix it,  isn't following change less aggressive?
        event_sched_out(event, ctx);
-       perf_event_set_state(event, min(event->state, state));
        if (flags & DETACH_GROUP)
                perf_group_detach(event);
        if (flags & DETACH_CHILD)
                perf_child_detach(event);
        list_del_event(event, ctx);
+       perf_event_set_state(event, min(event->state, state));
 

David


>
>Thanks.
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  6:34       ` Yeoreum Yun
                           ` (2 preceding siblings ...)
  2025-06-03  6:54         ` David Wang
@ 2025-06-03  9:20         ` Yeoreum Yun
  2025-06-03 10:08           ` David Wang
  3 siblings, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03  9:20 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

Hi David,

> > > > >
> > > > > Also, your patch couldn't solve a problem describe in
> > > > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > > > > for INCATIVE event's total_enable_time.
> > > >
> > > > I do not think so.
> > > > Correct me if I am making silly  mistakes,
> > > > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
> > > > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
> > > > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
> > > > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
> > > > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
> > > >
> > > > Sorry for my bad explaination without detail.
> > > > Think about cpu specific event and closed by task.
> > > > If there is specific child cpu event specified in cpu 0.
> > > > 1. cpu 0 -> active
> > > > 2. scheulded to cpu1 -> inactive
> > > > 3. close the cpu event from parent -> inactive close
> > > >
> > > > Can be failed to count total_enable_time.
> > >
> > > Is this explaining the purpose of commit a3c3c6667 ?
> > > I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)
> >
> > In commit a3c3c6667, I explain the specific case but not with above
> > case. But the commit's purpose is "account total_enable_time" properly.
> >
> > > > And also, considering the your patch, for DETACH_EXIT case,
> > > > If it changes the state before list_del_event() that wouldn't disable
> > > > related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
> > > > as patch describe...
> > > No, I don't think so.
> > > change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
> > > And no such kernel panic happened  before commit a3c3c6667.
>
> Oh! I was wrong, before commit a3c3c6667, "change state" happened *after* list_del_event()
> >
> > That's why list_del_event() handle the perf_cgroup_disable() before the
> > commit a3c3c6667. However because of *my mistake*, I've forget to
> > perf_cgroup_disable() properly before change the event state.
> > Yes, your patch can make avoid the panic since as soon as exit,
> > the event->cgrp switched.
>
> I cannot agree with the reasoning,
> The panic dose not happened when exit, it happened when reboot/shutdown.
> (I close perf_event_open before reboot)
> >
> > However, as I said, the INACTIVE event could be failed to count
> >total_enable_time.
> >
> > So, set event should be occured before list_del_event().
> >And since it's event->state change on remove.
> >It shouldn't have any side effect the state change isn't cause of your
> > panic. But missed perf_cgroup_disable().
>
> Any procedure to bring out the impact of this missed perf_cgroup_disable()?
> My system seems all normal, where should I check it?

Here is possible senario:
  1. perf event open with cgroup.
  2. perf event open with cpu event (no cgroup).
  3. above task sets the cpuctx->cgrp the same to (1).
  3. close (1) events.
     here, perf_cgroup_event_disable() isn't called,
     cpuctx->cgrp still point the cgroup.
  4. by other task, the cgroup and is destroied.
  5. close (2) events.
     here, it is last event, in __perf_remove_from_context()
     and last event, it calls update_cgrp_time_from_cpuctx(),
     And this refers invalid pointer.

> But to fix it,  isn't following change less aggressive?
>         event_sched_out(event, ctx);
> -       perf_event_set_state(event, min(event->state, state));
>         if (flags & DETACH_GROUP)
>                 perf_group_detach(event);
>         if (flags & DETACH_CHILD)
>                perf_child_detach(event);
>         list_del_event(event, ctx);
> +       perf_event_set_state(event, min(event->state, state));

If perf_child_detach() is called first and perf_event_set_state() call,
since the parent is removed in perf_child_detatced,
It would be failed to account the total_enable_time which caculating
child_event's enable_time too.

Thanks

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03  9:20         ` Yeoreum Yun
@ 2025-06-03 10:08           ` David Wang
  2025-06-03 13:41             ` Yeoreum Yun
  0 siblings, 1 reply; 37+ messages in thread
From: David Wang @ 2025-06-03 10:08 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 17:20:04, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>
>> > > > >
>> > > > > Also, your patch couldn't solve a problem describe in
>> > > > > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
>> > > > > for INCATIVE event's total_enable_time.
>> > > >
>> > > > I do not think so.
>> > > > Correct me if I am making silly  mistakes,
>> > > > The patch, https://lore.kernel.org/lkml/20250603032651.3988-1-00107082@163.com/
>> > > > calls perf_event_set_state() based on DETACH_EXIT flag, which cover the INACTIVE state, right?
>> > > > If DETACH_EXIT is not used for this purpose? Then why should it exist at the first place?
>> > > > I think I does not revert the purpose of commit a3c3c6667.....But I could be wrong
>> > > > Would you show a call path where DETACH_EXIT is not set, but the changes in commit a3c3c6667 is still needed?
>> > > >
>> > > > Sorry for my bad explaination without detail.
>> > > > Think about cpu specific event and closed by task.
>> > > > If there is specific child cpu event specified in cpu 0.
>> > > > 1. cpu 0 -> active
>> > > > 2. scheulded to cpu1 -> inactive
>> > > > 3. close the cpu event from parent -> inactive close
>> > > >
>> > > > Can be failed to count total_enable_time.
>> > >
>> > > Is this explaining the purpose of commit a3c3c6667 ?
>> > > I am not arguing with it. And I also not suggest reverting it. (it is just that reverting it can fix the kernel panic.)
>> >
>> > In commit a3c3c6667, I explain the specific case but not with above
>> > case. But the commit's purpose is "account total_enable_time" properly.
>> >
>> > > > And also, considering the your patch, for DETACH_EXIT case,
>> > > > If it changes the state before list_del_event() that wouldn't disable
>> > > > related to the cgroup. So it would make cpuctx->cgrp pointer could be dangled
>> > > > as patch describe...
>> > > No, I don't think so.
>> > > change state before list_del_event(), this is the same behavior before commit a3c3c6667, right?
>> > > And no such kernel panic happened  before commit a3c3c6667.
>>
>> Oh! I was wrong, before commit a3c3c6667, "change state" happened *after* list_del_event()
>> >
>> > That's why list_del_event() handle the perf_cgroup_disable() before the
>> > commit a3c3c6667. However because of *my mistake*, I've forget to
>> > perf_cgroup_disable() properly before change the event state.
>> > Yes, your patch can make avoid the panic since as soon as exit,
>> > the event->cgrp switched.
>>
>> I cannot agree with the reasoning,
>> The panic dose not happened when exit, it happened when reboot/shutdown.
>> (I close perf_event_open before reboot)
>> >
>> > However, as I said, the INACTIVE event could be failed to count
>> >total_enable_time.
>> >
>> > So, set event should be occured before list_del_event().
>> >And since it's event->state change on remove.
>> >It shouldn't have any side effect the state change isn't cause of your
>> > panic. But missed perf_cgroup_disable().
>>
>> Any procedure to bring out the impact of this missed perf_cgroup_disable()?
>> My system seems all normal, where should I check it?
>
>Here is possible senario:
>  1. perf event open with cgroup.
>  2. perf event open with cpu event (no cgroup).
>  3. above task sets the cpuctx->cgrp the same to (1).
>  3. close (1) events.
>     here, perf_cgroup_event_disable() isn't called,
>     cpuctx->cgrp still point the cgroup.
>  4. by other task, the cgroup and is destroied.
>  5. close (2) events.
>     here, it is last event, in __perf_remove_from_context()
>     and last event, it calls update_cgrp_time_from_cpuctx(),
>     And this refers invalid pointer.

... seems too complicated to me to give it a try.
Would not destroying cgroup trigger some reaction to cpuctx? 
There has to be some connection between cgroup lifecycle and perf cpuctx->cgrp.
Have you try it out? or it is just theoretical? 


>
>> But to fix it,  isn't following change less aggressive?
>>         event_sched_out(event, ctx);
>> -       perf_event_set_state(event, min(event->state, state));
>>         if (flags & DETACH_GROUP)
>>                 perf_group_detach(event);
>>         if (flags & DETACH_CHILD)
>>                perf_child_detach(event);
>>         list_del_event(event, ctx);
>> +       perf_event_set_state(event, min(event->state, state));
>
>If perf_child_detach() is called first and perf_event_set_state() call,
>since the parent is removed in perf_child_detatced,
>It would be failed to account the total_enable_time which caculating
>child_event's enable_time too.

Thanks for clarifying this, 
So the whole point  in commit a3c3c6667 is to make  perf_event_set_state() happens before perf_child_detach(), right?
I feel I got lost somewhere when I rush to this suggestion. But I still don't understand why my patchv1  breaks commit 
a3c3c6667, really confused.


>
>Thanks
>
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 10:08           ` David Wang
@ 2025-06-03 13:41             ` Yeoreum Yun
  2025-06-03 14:02               ` David Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03 13:41 UTC (permalink / raw)
  To: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, David Wang

Hi David,

> >
> > > But to fix it,  isn't following change less aggressive?
> > >         event_sched_out(event, ctx);
> > > -       perf_event_set_state(event, min(event->state, state));
> > >         if (flags & DETACH_GROUP)
> > >                 perf_group_detach(event);
> > >         if (flags & DETACH_CHILD)
> > >                perf_child_detach(event);
> > >         list_del_event(event, ctx);
> > > +       perf_event_set_state(event, min(event->state, state));
> >
> > If perf_child_detach() is called first and perf_event_set_state() call,
> > since the parent is removed in perf_child_detatced,
> > It would be failed to account the total_enable_time which caculating
> > child_event's enable_time too.
>
> Thanks for clarifying this,
> So the whole point  in commit a3c3c6667 is to make  perf_event_set_state() happens before perf_child_detach(), right?
> I feel I got lost somewhere when I rush to this suggestion. But I still don't understand why my patchv1  breaks commit
> a3c3c6667, really confused.

I explained this in:
   https://lore.kernel.org/all/5d17f1d7.666d.197348b78d1.Coremail.00107082@163.com/

>> If there is specific child cpu event specified in cpu 0.
>>   1. cpu 0 -> active
>>   2. scheulded to cpu1 -> inactive
>>   3. close the cpu event from parent -> inactive close
>>
>> Can be failed to count total_enable_time.


Consider one event which attached to taskctx with specific cpu.
In case of your original patch is for only "DETACH_EXIT" case.
Here what I mean, the event is "closed".
In this case, based on your patch, it doesn't call the perf_event_set_state()
before list_del_event(), but perf_event_set_state() is called after list_del_event().

Thanks

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-02 18:40 [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Yeoreum Yun
  2025-06-03  2:01 ` David Wang
@ 2025-06-03 14:00 ` Leo Yan
  2025-06-03 14:44   ` Peter Zijlstra
  2025-06-03 15:05   ` Yeoreum Yun
  1 sibling, 2 replies; 37+ messages in thread
From: Leo Yan @ 2025-06-03 14:00 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

Hi Levi,

On Mon, Jun 02, 2025 at 07:40:49PM +0100, Yeoreum Yun wrote:
> commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> changes the event->state update before list_del_event().
> This change prevents calling perf_cgroup_event_disable() as a result,
> cpuctx->cgrp can't be cleared properly and point to dangling point of cgroup.
> 
> Because of this problem, some machin meets the below panic[0]:
> 
> 863.881960] sysved_call_function_sing le+0x4c/0xc0
> 863.881301] asm_sysvec_call_function_single+0x16/0x20
> 869.881344] RIP: 0633:0x7f9alcea3367
> 663.681373] Code: 00 66 99 b8 ff ff ff ff c3 66 ....
> 863.881524] RSP: 002b:00007fffa526fcf8 EFLAGS: 00000246
> 869.881567] RAX: 0000562060c962d0 RBX: 0000000000000002 RCX: 00007f9a1cff1c60
> 863.881625] RDX: 00007f9a0c000030 RSI: 00007f9alcff1c60 RDI: 00007f9a1ca91c20
> 863.081682] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007f9a1d6217a0
> 869.881740] R10: 00007f9alca91c10 R11: 0000000000000246 R12: 00007f9a1d70c020
> 869.881798] R13: 00007fffa5270030 R14: 00007fffa526fd00 R15: 0000000000000000
> 863.881860] </TASK>
> 863.881876) Modules linked in: snd_seq_dummy (E) snd_hrtimer (E)...
> ...
> 863.887142] button (E)
> 863.912127] CR2: ffffe4afcc079650
> 863.914593] --- [ end trace 0000000000000000 1--
> 864.042750] RIP: 0010:ctx_sched_out+0x1ce/0x210
> 864.045214] Code: 89 c6 4c 8b b9 de 00 00 00 48 ...
> 864.050343] RSP: 0000:ffffaa4ec0f3fe60 EFLAGS: 00010086
> 864.052929] RAX: 0000000000000002 RBX: ffff8e8eeed2a580 RCX: ffff8e8bded9bf00
> 864.055518] RDX: 000000c92340b051 RSI: 000000c92340b051 RDI: ffff
> 864.058093] RBP: 0000000000000000 R08: 0000000000000002 R09: 00
> 864.060654] R10: 0000000000000000 R11: 0000000000000000 R12: 000
> 864.063183] R13: ffff8e8eeed2a580 R14: 0000000000000007 R15: ffffe4afcc079650
> 864.065729] FS: 00007f9a1ca91940 (0000) GS:ffff8e8f6b1c3000(0000) knIGS:0000000000000000
> 864.068312] CS: 0010 DS: 0000 ES: 0000 CRO: 0000000080050033
> 864.070898] CR2: ffffe4afcc079650 CR3: 00000001136d8000 CR4: 0000000000350ef0
> 864.673523] Kernel panic - not syncing: Fatal exception in interrupt
> 864.076410] Kernel Offset: 0xc00000 from 0xffffffff81000000 (relocation range: 0xff
> 864.205401] --- [ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> 
> To address this call the perf_cgroup_event_disable() properly before
> list_del_event() in __perf_remove_from_context().
> 
> Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
> Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> Tested-by: David Wang <00107082@163.com>
> ---
>  kernel/events/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f34c99f8ce8f..909b9d5a65c1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2498,6 +2498,10 @@ __perf_remove_from_context(struct perf_event *event,
>  		state = PERF_EVENT_STATE_DEAD;
>  	}
>  	event_sched_out(event, ctx);
> +
> +	if (event->state > PERF_EVENT_STATE_OFF)
> +		perf_cgroup_event_disable(event, ctx);
> +

As we discussed, seems to me, the issue is caused by an ambigous state
machine transition:

When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
not transite the state to PERF_EVENT_STATE_OFF. As a result, the
list_del_event() function skips to clean up cgroup pointer for non OFF
states. This is different from the code prior to the commit a3c3c6667,
which transits states EXIT -> INACTIVE -> OFF.

My suggestion is not reliable. Roughly read code, except for the
PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
up the cgroup pointer.  The reason is I don't see other places to
clean up the cgroup pointer for these error cases:

  PERF_EVENT_STATE_REVOKED
  PERF_EVENT_STATE_DEAD

Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
cgroup as this has already been handled in merge_sched_in().

So a correct condition would be:

    if (event->state > PERF_EVENT_STATE_OFF ||
        event->state <= PERF_EVENT_STATE_EXIT)
        perf_cgroup_event_disable(event, ctx);

And we need to remove the perf_cgroup_event_disable() from
list_del_event() to avoid duplicate code.

Perhaps a better approach for code consolidation would be to modify
the conditions in list_del_event() to ensure the cgroup pointer is
cleaned up in error cases. However, I'm not confident that this is the
correct direction, so I would wait for suggestions from the maintainers.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 13:41             ` Yeoreum Yun
@ 2025-06-03 14:02               ` David Wang
  0 siblings, 0 replies; 37+ messages in thread
From: David Wang @ 2025-06-03 14:02 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: peterz, mingo, mingo, acme, namhyung, leo.yan, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel


At 2025-06-03 21:41:30, "Yeoreum Yun" <yeoreum.yun@arm.com> wrote:
>Hi David,
>
>> >
>> > > But to fix it,  isn't following change less aggressive?
>> > >         event_sched_out(event, ctx);
>> > > -       perf_event_set_state(event, min(event->state, state));
>> > >         if (flags & DETACH_GROUP)
>> > >                 perf_group_detach(event);
>> > >         if (flags & DETACH_CHILD)
>> > >                perf_child_detach(event);
>> > >         list_del_event(event, ctx);
>> > > +       perf_event_set_state(event, min(event->state, state));
>> >
>> > If perf_child_detach() is called first and perf_event_set_state() call,
>> > since the parent is removed in perf_child_detatced,
>> > It would be failed to account the total_enable_time which caculating
>> > child_event's enable_time too.
>>
>> Thanks for clarifying this,
>> So the whole point  in commit a3c3c6667 is to make  perf_event_set_state() happens before perf_child_detach(), right?
>> I feel I got lost somewhere when I rush to this suggestion. But I still don't understand why my patchv1  breaks commit
>> a3c3c6667, really confused.
>
>I explained this in:
>   https://lore.kernel.org/all/5d17f1d7.666d.197348b78d1.Coremail.00107082@163.com/
>
>>> If there is specific child cpu event specified in cpu 0.
>>>   1. cpu 0 -> active
>>>   2. scheulded to cpu1 -> inactive
>>>   3. close the cpu event from parent -> inactive close
>>>
>>> Can be failed to count total_enable_time.
>
>
>Consider one event which attached to taskctx with specific cpu.
>In case of your original patch is for only "DETACH_EXIT" case.
>Here what I mean, the event is "closed".
>In this case, based on your patch, it doesn't call the perf_event_set_state()
>before list_del_event(), but perf_event_set_state() is called after list_del_event().

Do you mean in this case, the event is not passed to perf_event_exit_event()?
Because in my mind, as long as a event reach perf_event_exit_event, DETACH_EXIT flag would always be set.
perf_event_exit_event()
   ---> perf_remove_from_context(event, detach_flags | DETACH_EXIT);  <---
            ---> __perf_remove_from_context
                    ----> perf_event_set_state  (DETACH_EXIT is always set in this call path)
                    ----> list_del_event

So I am still confused, even with cpu switch, the DETACH_EXIT flag is still there.
Could you explain it with a callchain?

Thanks
David

>
>Thanks
>
>--
>Sincerely,
>Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 14:00 ` Leo Yan
@ 2025-06-03 14:44   ` Peter Zijlstra
  2025-06-03 15:17     ` Yeoreum Yun
  2025-06-04  8:03     ` Peter Zijlstra
  2025-06-03 15:05   ` Yeoreum Yun
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-03 14:44 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Tue, Jun 03, 2025 at 03:00:40PM +0100, Leo Yan wrote:

> > +	if (event->state > PERF_EVENT_STATE_OFF)
> > +		perf_cgroup_event_disable(event, ctx);
> > +
> 
> As we discussed, seems to me, the issue is caused by an ambigous state
> machine transition:
> 
> When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
> not transite the state to PERF_EVENT_STATE_OFF. As a result, the
> list_del_event() function skips to clean up cgroup pointer for non OFF
> states. This is different from the code prior to the commit a3c3c6667,
> which transits states EXIT -> INACTIVE -> OFF.

Right.

> My suggestion is not reliable. Roughly read code, except for the
> PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> up the cgroup pointer.  The reason is I don't see other places to
> clean up the cgroup pointer for these error cases:
> 
>   PERF_EVENT_STATE_REVOKED
>   PERF_EVENT_STATE_DEAD

Those should be done here; on the first transition into these states.

> Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> cgroup as this has already been handled in merge_sched_in().
> 
> So a correct condition would be:
> 
>     if (event->state > PERF_EVENT_STATE_OFF ||
>         event->state <= PERF_EVENT_STATE_EXIT)
>         perf_cgroup_event_disable(event, ctx);

I'm too tired to get my head straight. I'll look tomorrow.

> And we need to remove the perf_cgroup_event_disable() from
> list_del_event() to avoid duplicate code.
> 
> Perhaps a better approach for code consolidation would be to modify
> the conditions in list_del_event() to ensure the cgroup pointer is
> cleaned up in error cases. However, I'm not confident that this is the
> correct direction, so I would wait for suggestions from the maintainers.

Probably easier to keep here in __perf_remove_from_context() where we
have prev and next state available.

Anyway, I currently have the below, but I'll update once I've had sleep.

---
Subject: perf: Fix dangling cgroup pointer in cpuctx
From: Yeoreum Yun <yeoreum.yun@arm.com>
Date: Mon, 2 Jun 2025 19:40:49 +0100

From: Yeoreum Yun <yeoreum.yun@arm.com>

Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting
bug at task exit") moves the event->state update to before
list_del_event(). This makes the event->state test in list_del_event()
always false; never calling perf_cgroup_event_disable().

As a result, cpuctx->cgrp won't be cleared properly; causing havoc.

Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
---
 kernel/events/core.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2120,18 +2120,6 @@ list_del_event(struct perf_event *event,
 	if (event->group_leader == event)
 		del_event_from_groups(event, ctx);
 
-	/*
-	 * If event was in error state, then keep it
-	 * that way, otherwise bogus counts will be
-	 * returned on read(). The only way to get out
-	 * of error state is by explicit re-enabling
-	 * of the event
-	 */
-	if (event->state > PERF_EVENT_STATE_OFF) {
-		perf_cgroup_event_disable(event, ctx);
-		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
-	}
-
 	ctx->generation++;
 	event->pmu_ctx->nr_events--;
 }
@@ -2493,11 +2481,14 @@ __perf_remove_from_context(struct perf_e
 		state = PERF_EVENT_STATE_EXIT;
 	if (flags & DETACH_REVOKE)
 		state = PERF_EVENT_STATE_REVOKED;
-	if (flags & DETACH_DEAD) {
-		event->pending_disable = 1;
+	if (flags & DETACH_DEAD)
 		state = PERF_EVENT_STATE_DEAD;
-	}
+
 	event_sched_out(event, ctx);
+
+	if (event->state > PERF_EVENT_STATE_OFF)
+		perf_cgroup_event_disable(event, ctx);
+
 	perf_event_set_state(event, min(event->state, state));
 
 	if (flags & DETACH_GROUP)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 14:00 ` Leo Yan
  2025-06-03 14:44   ` Peter Zijlstra
@ 2025-06-03 15:05   ` Yeoreum Yun
  1 sibling, 0 replies; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03 15:05 UTC (permalink / raw)
  To: Leo Yan
  Cc: peterz, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

Hi Leo,
>
> On Mon, Jun 02, 2025 at 07:40:49PM +0100, Yeoreum Yun wrote:
> > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > changes the event->state update before list_del_event().
> > This change prevents calling perf_cgroup_event_disable() as a result,
> > cpuctx->cgrp can't be cleared properly and point to dangling point of cgroup.
> >
> > Because of this problem, some machin meets the below panic[0]:
> >
> > 863.881960] sysved_call_function_sing le+0x4c/0xc0
> > 863.881301] asm_sysvec_call_function_single+0x16/0x20
> > 869.881344] RIP: 0633:0x7f9alcea3367
> > 663.681373] Code: 00 66 99 b8 ff ff ff ff c3 66 ....
> > 863.881524] RSP: 002b:00007fffa526fcf8 EFLAGS: 00000246
> > 869.881567] RAX: 0000562060c962d0 RBX: 0000000000000002 RCX: 00007f9a1cff1c60
> > 863.881625] RDX: 00007f9a0c000030 RSI: 00007f9alcff1c60 RDI: 00007f9a1ca91c20
> > 863.081682] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007f9a1d6217a0
> > 869.881740] R10: 00007f9alca91c10 R11: 0000000000000246 R12: 00007f9a1d70c020
> > 869.881798] R13: 00007fffa5270030 R14: 00007fffa526fd00 R15: 0000000000000000
> > 863.881860] </TASK>
> > 863.881876) Modules linked in: snd_seq_dummy (E) snd_hrtimer (E)...
> > ...
> > 863.887142] button (E)
> > 863.912127] CR2: ffffe4afcc079650
> > 863.914593] --- [ end trace 0000000000000000 1--
> > 864.042750] RIP: 0010:ctx_sched_out+0x1ce/0x210
> > 864.045214] Code: 89 c6 4c 8b b9 de 00 00 00 48 ...
> > 864.050343] RSP: 0000:ffffaa4ec0f3fe60 EFLAGS: 00010086
> > 864.052929] RAX: 0000000000000002 RBX: ffff8e8eeed2a580 RCX: ffff8e8bded9bf00
> > 864.055518] RDX: 000000c92340b051 RSI: 000000c92340b051 RDI: ffff
> > 864.058093] RBP: 0000000000000000 R08: 0000000000000002 R09: 00
> > 864.060654] R10: 0000000000000000 R11: 0000000000000000 R12: 000
> > 864.063183] R13: ffff8e8eeed2a580 R14: 0000000000000007 R15: ffffe4afcc079650
> > 864.065729] FS: 00007f9a1ca91940 (0000) GS:ffff8e8f6b1c3000(0000) knIGS:0000000000000000
> > 864.068312] CS: 0010 DS: 0000 ES: 0000 CRO: 0000000080050033
> > 864.070898] CR2: ffffe4afcc079650 CR3: 00000001136d8000 CR4: 0000000000350ef0
> > 864.673523] Kernel panic - not syncing: Fatal exception in interrupt
> > 864.076410] Kernel Offset: 0xc00000 from 0xffffffff81000000 (relocation range: 0xff
> > 864.205401] --- [ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >
> > To address this call the perf_cgroup_event_disable() properly before
> > list_del_event() in __perf_remove_from_context().
> >
> > Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
> > Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > Tested-by: David Wang <00107082@163.com>
> > ---
> >  kernel/events/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f34c99f8ce8f..909b9d5a65c1 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2498,6 +2498,10 @@ __perf_remove_from_context(struct perf_event *event,
> >  		state = PERF_EVENT_STATE_DEAD;
> >  	}
> >  	event_sched_out(event, ctx);
> > +
> > +	if (event->state > PERF_EVENT_STATE_OFF)
> > +		perf_cgroup_event_disable(event, ctx);
> > +
>
> As we discussed, seems to me, the issue is caused by an ambigous state
> machine transition:
>
> When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
> not transite the state to PERF_EVENT_STATE_OFF. As a result, the
> list_del_event() function skips to clean up cgroup pointer for non OFF
> states. This is different from the code prior to the commit a3c3c6667,
> which transits states EXIT -> INACTIVE -> OFF.

Just for correct this, prior to commit a3c3c6667,
Isn't the state change INACTIVE -> OFF -> EXIT?

when I saw the perf_event_exit_event(),
  perf_remove_from_context(event, detacg_flags
  ...
  perf_event_set_state(event, PERF_EVENT_STATE_EXIT),

So,

  1. event_sched_out() -> INACTIVE
  2. list_del_event() -> OFF
  3. perf_event_set_state() -> EXIT.

Therefore, the final event state is "PREF_EVENT_STATE_EXIT"

Therefore, I think calling "perf_cgroup_event_disable()" is called
as soon as event_sched_out() since commit a3c3c6667 change the state
INCATIVE -> EXIT.

so, the middle state of OFF is gone and I think this middle state "OFF"
doesn't affect any race condition since the removed event "wakeup" is
also called after it changed to "final state" -- in this case, the
PERF_EVENT_STATE_EXIT.

>
> My suggestion is not reliable. Roughly read code, except for the
> PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> up the cgroup pointer.  The reason is I don't see other places to
> clean up the cgroup pointer for these error cases:
>
>   PERF_EVENT_STATE_REVOKED
>   PERF_EVENT_STATE_DEAD
>
> Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> cgroup as this has already been handled in merge_sched_in().
>
> So a correct condition would be:
>
>     if (event->state > PERF_EVENT_STATE_OFF ||
>         event->state <= PERF_EVENT_STATE_EXIT)
>         perf_cgroup_event_disable(event, ctx);
>
> And we need to remove the perf_cgroup_event_disable() from
> list_del_event() to avoid duplicate code.

If we sustain perf_cgroup_event_disable()'s call in *list_del_event()*
But, Should it be? I think its enough to call as soon as after event_sched_out().

>
> Perhaps a better approach for code consolidation would be to modify
> the conditions in list_del_event() to ensure the cgroup pointer is
> cleaned up in error cases. However, I'm not confident that this is the
> correct direction, so I would wait for suggestions from the maintainers.
>
> Thanks,
> Leo
[...]

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 14:44   ` Peter Zijlstra
@ 2025-06-03 15:17     ` Yeoreum Yun
  2025-06-04  7:06       ` Peter Zijlstra
  2025-06-04  8:03     ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-03 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Leo Yan, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang


Hi Peter,

>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2120,18 +2120,6 @@ list_del_event(struct perf_event *event,
>  	if (event->group_leader == event)
>  		del_event_from_groups(event, ctx);
>
> -	/*
> -	 * If event was in error state, then keep it
> -	 * that way, otherwise bogus counts will be
> -	 * returned on read(). The only way to get out
> -	 * of error state is by explicit re-enabling
> -	 * of the event
> -	 */
> -	if (event->state > PERF_EVENT_STATE_OFF) {
> -		perf_cgroup_event_disable(event, ctx);
> -		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> -	}
> -
>  	ctx->generation++;
>  	event->pmu_ctx->nr_events--;

JFYI, this removal should be not included when you backport to v6.15
unless your patch backport together:
  commit 90661365021a ("perf Unify perf_event_free_task() / perf_evenet_exit_task_context()")

Thanks.

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 15:17     ` Yeoreum Yun
@ 2025-06-04  7:06       ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04  7:06 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Leo Yan, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Tue, Jun 03, 2025 at 04:17:04PM +0100, Yeoreum Yun wrote:
> 
> Hi Peter,
> 
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2120,18 +2120,6 @@ list_del_event(struct perf_event *event,
> >  	if (event->group_leader == event)
> >  		del_event_from_groups(event, ctx);
> >
> > -	/*
> > -	 * If event was in error state, then keep it
> > -	 * that way, otherwise bogus counts will be
> > -	 * returned on read(). The only way to get out
> > -	 * of error state is by explicit re-enabling
> > -	 * of the event
> > -	 */
> > -	if (event->state > PERF_EVENT_STATE_OFF) {
> > -		perf_cgroup_event_disable(event, ctx);
> > -		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> > -	}
> > -
> >  	ctx->generation++;
> >  	event->pmu_ctx->nr_events--;
> 
> JFYI, this removal should be not included when you backport to v6.15
> unless your patch backport together:
>   commit 90661365021a ("perf Unify perf_event_free_task() / perf_evenet_exit_task_context()")

Right. But lets just get the patch for the current tree right. Then we
can worry about backports :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-03 14:44   ` Peter Zijlstra
  2025-06-03 15:17     ` Yeoreum Yun
@ 2025-06-04  8:03     ` Peter Zijlstra
  2025-06-04 10:06       ` Yeoreum Yun
  2025-06-04 10:18       ` Leo Yan
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04  8:03 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Tue, Jun 03, 2025 at 04:44:14PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 03, 2025 at 03:00:40PM +0100, Leo Yan wrote:
> 
> > > +	if (event->state > PERF_EVENT_STATE_OFF)
> > > +		perf_cgroup_event_disable(event, ctx);
> > > +
> > 
> > As we discussed, seems to me, the issue is caused by an ambigous state
> > machine transition:
> > 
> > When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
> > not transite the state to PERF_EVENT_STATE_OFF. As a result, the
> > list_del_event() function skips to clean up cgroup pointer for non OFF
> > states. This is different from the code prior to the commit a3c3c6667,
> > which transits states EXIT -> INACTIVE -> OFF.
> 
> Right.
> 
> > My suggestion is not reliable. Roughly read code, except for the
> > PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> > up the cgroup pointer.  The reason is I don't see other places to
> > clean up the cgroup pointer for these error cases:
> > 
> >   PERF_EVENT_STATE_REVOKED
> >   PERF_EVENT_STATE_DEAD
> 
> Those should be done here; on the first transition into these states.
> 
> > Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> > cgroup as this has already been handled in merge_sched_in().
> > 
> > So a correct condition would be:
> > 
> >     if (event->state > PERF_EVENT_STATE_OFF ||
> >         event->state <= PERF_EVENT_STATE_EXIT)
> >         perf_cgroup_event_disable(event, ctx);
> 
> I'm too tired to get my head straight. I'll look tomorrow.


Right; had a sleep. Lets do this :-)


So the normal states are:

	    ACTIVE    ---.
	       ^         |
	       |         |
       sched_{in,out}()  |
	       |         |
	       v         |
      ,---> INACTIVE  <--+
      |                  |
      |            {dis,en}able()
  sched_in()             |
      |       OFF     <--+
      |                  |
      `--->  ERROR    ---'

That is:

sched_in:	INACTIVE          -> {ACTIVE,ERROR}
sched_out:	ACTIVE            -> INACTIVE
disable:	{ACTIVE,INACTIVE} -> OFF
enable:         {OFF,ERROR}       -> INACTIVE

Where OFF/ERROR are 'disable' and have this perf_cgroup_event_disable()
called.

Then we have {EXIT,REVOKED,DEAD} states which are various shades of
defunct events:

 - EXIT means task that the even was assigned to died, but child events
   still live, and further children can still be created. But the event
   itself will never be active again. It can only transition to
   {REVOKED,DEAD};

 - REVOKED means the PMU the event was associated with is gone; all
   functionality is stopped but the event is still alive. Can only
   transition to DEAD;

 - DEAD event really is DYING tearing down state and freeing bits.

And now we have the sitation that __perf_remove_from_context() can do:

  {ACTIVE,INACTIVE,OFF,ERROR} -> {OFF,EXIT,REVOKED,DEAD}

Where the {OFF,ERROR} -> * transition already have
perf_cgroup_event_disable(), but the {ACTIVE,INACTIVE} -> * part has
not.

The condition:

  if (event->state > PERF_EVENT_STATE_OFF)

captures exactly that {ACTIVE,INACTIVE} that still needs the cgroup
disable. Every other state is already a disabled state.


Agreed?

Also, I should probably stick most of all this in a comment somewhere.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04  8:03     ` Peter Zijlstra
@ 2025-06-04 10:06       ` Yeoreum Yun
  2025-06-04 12:37         ` Peter Zijlstra
  2025-06-04 10:18       ` Leo Yan
  1 sibling, 1 reply; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-04 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Leo Yan, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

Hi Peter,

> On Tue, Jun 03, 2025 at 04:44:14PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 03, 2025 at 03:00:40PM +0100, Leo Yan wrote:
> >
> > > > +	if (event->state > PERF_EVENT_STATE_OFF)
> > > > +		perf_cgroup_event_disable(event, ctx);
> > > > +
> > >
> > > As we discussed, seems to me, the issue is caused by an ambigous state
> > > machine transition:
> > >
> > > When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
> > > not transite the state to PERF_EVENT_STATE_OFF. As a result, the
> > > list_del_event() function skips to clean up cgroup pointer for non OFF
> > > states. This is different from the code prior to the commit a3c3c6667,
> > > which transits states EXIT -> INACTIVE -> OFF.
> >
> > Right.
> >
> > > My suggestion is not reliable. Roughly read code, except for the
> > > PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> > > up the cgroup pointer.  The reason is I don't see other places to
> > > clean up the cgroup pointer for these error cases:
> > >
> > >   PERF_EVENT_STATE_REVOKED
> > >   PERF_EVENT_STATE_DEAD
> >
> > Those should be done here; on the first transition into these states.
> >
> > > Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> > > cgroup as this has already been handled in merge_sched_in().
> > >
> > > So a correct condition would be:
> > >
> > >     if (event->state > PERF_EVENT_STATE_OFF ||
> > >         event->state <= PERF_EVENT_STATE_EXIT)
> > >         perf_cgroup_event_disable(event, ctx);
> >
> > I'm too tired to get my head straight. I'll look tomorrow.
>
>
> Right; had a sleep. Lets do this :-)
>
>
> So the normal states are:
>
> 	    ACTIVE    ---.
> 	       ^         |
> 	       |         |
>        sched_{in,out}()  |
> 	       |         |
> 	       v         |
>       ,---> INACTIVE  <--+
>       |                  |
>       |            {dis,en}able()
>   sched_in()             |
>       |       OFF     <--+
>       |                  |
>       `--->  ERROR    ---'
>
> That is:
>
> sched_in:	INACTIVE          -> {ACTIVE,ERROR}
> sched_out:	ACTIVE            -> INACTIVE
> disable:	{ACTIVE,INACTIVE} -> OFF
> enable:         {OFF,ERROR}       -> INACTIVE
>
> Where OFF/ERROR are 'disable' and have this perf_cgroup_event_disable()
> called.
>
> Then we have {EXIT,REVOKED,DEAD} states which are various shades of
> defunct events:
>
>  - EXIT means task that the even was assigned to died, but child events
>    still live, and further children can still be created. But the event
>    itself will never be active again. It can only transition to
>    {REVOKED,DEAD};

I have a slight quetions. after parent event set EXIT,
Does EXIT event should be inherited?

for example

   parent task(0, ...) -> parent_event(0, parent_event:NULL)
     ` child_task(1, parent:0) -> child_event(1, parent_event:0)
         ` child_task(2, parent:1) -> child_event(2, parent_event:0)

In this case when parent task(0) is exited,
parent->event will be set as EXIT state.

But suppose the child_task(2) try to fork (child_task3) and
inherit the event (create child_event(3, parent_event:0),
and at the fork, forking can observe the parent event state as "EXIT".
In thie situation why child_event(3, parent_event:0) should be created for
child_task(3)?

>  - REVOKED means the PMU the event was associated with is gone; all
>    functionality is stopped but the event is still alive. Can only
>    transition to DEAD;
>
>  - DEAD event really is DYING tearing down state and freeing bits.
>
> And now we have the sitation that __perf_remove_from_context() can do:
>
>   {ACTIVE,INACTIVE,OFF,ERROR} -> {OFF,EXIT,REVOKED,DEAD}
>
> Where the {OFF,ERROR} -> * transition already have
> perf_cgroup_event_disable(), but the {ACTIVE,INACTIVE} -> * part has
> not.
>
> The condition:
>
>   if (event->state > PERF_EVENT_STATE_OFF)
>
> captures exactly that {ACTIVE,INACTIVE} that still needs the cgroup
> disable. Every other state is already a disabled state.
>
>
> Agreed?
>
> Also, I should probably stick most of all this in a comment somewhere.

--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04  8:03     ` Peter Zijlstra
  2025-06-04 10:06       ` Yeoreum Yun
@ 2025-06-04 10:18       ` Leo Yan
  2025-06-04 13:58         ` Peter Zijlstra
  2025-06-04 14:16         ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Leo Yan @ 2025-06-04 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 10:03:39AM +0200, Peter Zijlstra wrote:

[...]

> > > My suggestion is not reliable. Roughly read code, except for the
> > > PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> > > up the cgroup pointer.  The reason is I don't see other places to
> > > clean up the cgroup pointer for these error cases:
> > > 
> > >   PERF_EVENT_STATE_REVOKED
> > >   PERF_EVENT_STATE_DEAD
> > 
> > Those should be done here; on the first transition into these states.
> > 
> > > Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> > > cgroup as this has already been handled in merge_sched_in().
> > > 
> > > So a correct condition would be:
> > > 
> > >     if (event->state > PERF_EVENT_STATE_OFF ||
> > >         event->state <= PERF_EVENT_STATE_EXIT)
> > >         perf_cgroup_event_disable(event, ctx);
> > 
> > I'm too tired to get my head straight. I'll look tomorrow.
> 
> 
> Right; had a sleep. Lets do this :-)

Thanks a lot for the state machine diagram.

> So the normal states are:
> 
>             ACTIVE    ---.
>                ^         |
>                |         |
>        sched_{in,out}()  |
>                |         |
>                v         |
>       ,---> INACTIVE  <--+
>       |                  |
>       |            {dis,en}able()
>   sched_in()             |
>       |       OFF     <--+
>       |                  |
>       `--->  ERROR    ---'
> 
> That is:
> 
> sched_in:     INACTIVE          -> {ACTIVE,ERROR}

Strictly speaking, sched_in() can keep INACTIVE state if an event is
rotated.

> sched_out:    ACTIVE            -> INACTIVE
> disable:      {ACTIVE,INACTIVE} -> OFF
> enable:       {OFF,ERROR}       -> INACTIVE
> 
> Where OFF/ERROR are 'disable' and have this perf_cgroup_event_disable()
> called.
> 
> Then we have {EXIT,REVOKED,DEAD} states which are various shades of
> defunct events:
> 
>  - EXIT means task that the even was assigned to died, but child events
>    still live, and further children can still be created. But the event
>    itself will never be active again. It can only transition to
>    {REVOKED,DEAD};
> 
>  - REVOKED means the PMU the event was associated with is gone; all
>    functionality is stopped but the event is still alive. Can only
>    transition to DEAD;
> 
>  - DEAD event really is DYING tearing down state and freeing bits.
> 
> And now we have the sitation that __perf_remove_from_context() can do:
> 
>   {ACTIVE,INACTIVE,OFF,ERROR} -> {OFF,EXIT,REVOKED,DEAD}

A detailed transition is:

  Case 1: {ACTIVE} -> {INACTIVE} -> {OFF,EXIT,REVOKED,DEAD}
  Case 2: {ERROR} -> {ERROR,EXIT,REVOKED,DEAD}
  Case 3: {OFF} -> {OFF,EXIT,REVOKED,DEAD}

> Where the {OFF,ERROR} -> * transition already have
> perf_cgroup_event_disable(), but the {ACTIVE,INACTIVE} -> * part has
> not.

Just a minor concern.

I noticed perf_put_aux_event() sets the ERROR state for sibling events
of an AUX event. IIUC, the AUX event is the group leader, so we only
need to clean up the cgroup pointer for the AUX event, and simply set
the ERROR state for its sibling events, correct?

> The condition:
> 
>   if (event->state > PERF_EVENT_STATE_OFF)
> 
> captures exactly that {ACTIVE,INACTIVE} that still needs the cgroup
> disable. Every other state is already a disabled state.
> 
> 
> Agreed?

Except above concern, otherwise, the conclusion makes sense to me.

> Also, I should probably stick most of all this in a comment somewhere.

Totally agree. The comment would be very useful!

Thanks,
Leo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 10:06       ` Yeoreum Yun
@ 2025-06-04 12:37         ` Peter Zijlstra
  2025-06-04 12:54           ` Yeoreum Yun
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04 12:37 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Leo Yan, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 11:06:51AM +0100, Yeoreum Yun wrote:

> >  - EXIT means task that the even was assigned to died, but child events
> >    still live, and further children can still be created. But the event
> >    itself will never be active again. It can only transition to
> >    {REVOKED,DEAD};
> 
> I have a slight quetions. after parent event set EXIT,
> Does EXIT event should be inherited?
> 
> for example
> 
>    parent task(0, ...) -> parent_event(0, parent_event:NULL)
>      ` child_task(1, parent:0) -> child_event(1, parent_event:0)
>          ` child_task(2, parent:1) -> child_event(2, parent_event:0)
> 
> In this case when parent task(0) is exited,
> parent->event will be set as EXIT state.
> 
> But suppose the child_task(2) try to fork (child_task3) and
> inherit the event (create child_event(3, parent_event:0),
> and at the fork, forking can observe the parent event state as "EXIT".
> In thie situation why child_event(3, parent_event:0) should be created for
> child_task(3)?

Yes. You set out to monitor the whole hierarchy, so any child created
after the first task should be monitored, until such time that you close
the event.

Notably, a fair number of daemons go about their business by explicitly
closing their original task in order to detach from the tty.

Also, per the context switch optimization the original event doesn't
need to stay with the original parent, it can end up on a random child.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 12:37         ` Peter Zijlstra
@ 2025-06-04 12:54           ` Yeoreum Yun
  0 siblings, 0 replies; 37+ messages in thread
From: Yeoreum Yun @ 2025-06-04 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Leo Yan, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

[...]

> > But suppose the child_task(2) try to fork (child_task3) and
> > inherit the event (create child_event(3, parent_event:0),
> > and at the fork, forking can observe the parent event state as "EXIT".
> > In thie situation why child_event(3, parent_event:0) should be created for
> > child_task(3)?
>
> Yes. You set out to monitor the whole hierarchy, so any child created
> after the first task should be monitored, until such time that you close
> the event.
>
> Notably, a fair number of daemons go about their business by explicitly
> closing their original task in order to detach from the tty.
>
> Also, per the context switch optimization the original event doesn't
> need to stay with the original parent, it can end up on a random child.

Ahh I overlooked daemon case. Thanks for clarification..!


--
Sincerely,
Yeoreum Yun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 10:18       ` Leo Yan
@ 2025-06-04 13:58         ` Peter Zijlstra
  2025-06-04 15:17           ` Leo Yan
  2025-06-11  9:29           ` [tip: perf/urgent] perf: Add comment to enum perf_event_state tip-bot2 for Peter Zijlstra
  2025-06-04 14:16         ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04 13:58 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 11:18:21AM +0100, Leo Yan wrote:

> Totally agree. The comment would be very useful!

I have the below. Let me go read your email in more than 2 seconds to
see if I should amend things.

I'm not sure doing the INACTIVE->INACTIVE cycle in the ASCII art is
going to make it clearer, might leave that off. But yeah, possible.

---

Subject: perf: Add comment to enum perf_event_state
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jun  4 10:21:38 CEST 2025

Better describe the event states.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |   42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -635,8 +635,46 @@ struct perf_addr_filter_range {
 	unsigned long			size;
 };
 
-/**
- * enum perf_event_state - the states of an event:
+/*
+ * The normal states are:
+ *
+ *            ACTIVE    --.
+ *               ^        |
+ *               |        |
+ *       sched_{in,out}() |
+ *               |        |
+ *               v        |
+ *      ,---> INACTIVE  --+ <-.
+ *      |                 |   |
+ *      |                {dis,en}able()
+ *   sched_in()           |   |
+ *      |       OFF    <--' --+
+ *      |                     |
+ *      `--->  ERROR    ------'
+ *
+ * That is:
+ *
+ * sched_in:       INACTIVE          -> {ACTIVE,ERROR}
+ * sched_out:      ACTIVE            -> INACTIVE
+ * disable:        {ACTIVE,INACTIVE} -> OFF
+ * enable:         {OFF,ERROR}       -> INACTIVE
+ *
+ * Where {OFF,ERROR} are disabled states.
+ *
+ * Then we have the {EXIT,REVOKED,DEAD} states which are various shades of
+ * defunct events:
+ *
+ *  - EXIT means task that the even was assigned to died, but child events
+ *    still live, and further children can still be created. But the event
+ *    itself will never be active again. It can only transition to
+ *    {REVOKED,DEAD};
+ *
+ *  - REVOKED means the PMU the event was associated with is gone; all
+ *    functionality is stopped but the event is still alive. Can only
+ *    transition to DEAD;
+ *
+ *  - DEAD event really is DYING tearing down state and freeing bits.
+ *
  */
 enum perf_event_state {
 	PERF_EVENT_STATE_DEAD		= -5,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 10:18       ` Leo Yan
  2025-06-04 13:58         ` Peter Zijlstra
@ 2025-06-04 14:16         ` Peter Zijlstra
  2025-06-04 15:46           ` Leo Yan
  2025-06-05 11:41           ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04 14:16 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 11:18:21AM +0100, Leo Yan wrote:
> On Wed, Jun 04, 2025 at 10:03:39AM +0200, Peter Zijlstra wrote:

> > And now we have the sitation that __perf_remove_from_context() can do:
> > 
> >   {ACTIVE,INACTIVE,OFF,ERROR} -> {OFF,EXIT,REVOKED,DEAD}
> 
> A detailed transition is:
> 
>   Case 1: {ACTIVE} -> {INACTIVE} -> {OFF,EXIT,REVOKED,DEAD}

It can also start with INACTIVE, but yeah..
 
>   Case 2: {ERROR} -> {ERROR,EXIT,REVOKED,DEAD}
>   Case 3: {OFF} -> {OFF,EXIT,REVOKED,DEAD}

> 
> > Where the {OFF,ERROR} -> * transition already have
> > perf_cgroup_event_disable(), but the {ACTIVE,INACTIVE} -> * part has
> > not.
> 
> Just a minor concern.
> 
> I noticed perf_put_aux_event() sets the ERROR state for sibling events
> of an AUX event. 

There is also perf_remove_sibling_event(), which can cause ERROR.

> IIUC, the AUX event is the group leader, so we only
> need to clean up the cgroup pointer for the AUX event, and simply set
> the ERROR state for its sibling events, correct?

Not sure; I forever forget how the AUX events are supposed to work :-/

It might be prudent to do something like so:


diff --git a/kernel/events/core.c b/kernel/events/core.c
index f34c99f8ce8f..e6583747838a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2149,8 +2149,9 @@ perf_aux_output_match(struct perf_event *event, struct perf_event *aux_event)
 }
 
 static void put_event(struct perf_event *event);
-static void event_sched_out(struct perf_event *event,
-			    struct perf_event_context *ctx);
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state);
 
 static void perf_put_aux_event(struct perf_event *event)
 {
@@ -2183,8 +2184,7 @@ static void perf_put_aux_event(struct perf_event *event)
 		 * state so that we don't try to schedule it again. Note
 		 * that perf_event_enable() will clear the ERROR status.
 		 */
-		event_sched_out(iter, ctx);
-		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+		__event_disable(event, ctx, PERF_EVENT_STATE_ERROR);
 	}
 }
 
@@ -2242,18 +2242,6 @@ static inline struct list_head *get_event_list(struct perf_event *event)
 				    &event->pmu_ctx->flexible_active;
 }
 
-/*
- * Events that have PERF_EV_CAP_SIBLING require being part of a group and
- * cannot exist on their own, schedule them out and move them into the ERROR
- * state. Also see _perf_event_enable(), it will not be able to recover
- * this ERROR state.
- */
-static inline void perf_remove_sibling_event(struct perf_event *event)
-{
-	event_sched_out(event, event->ctx);
-	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-}
-
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *leader = event->group_leader;
@@ -2289,8 +2277,15 @@ static void perf_group_detach(struct perf_event *event)
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
 
+		/*
+		 * Events that have PERF_EV_CAP_SIBLING require being part of
+		 * a group and cannot exist on their own, schedule them out
+		 * and move them into the ERROR state. Also see
+		 * _perf_event_enable(), it will not be able to recover this
+		 * ERROR state.
+		 */
 		if (sibling->event_caps & PERF_EV_CAP_SIBLING)
-			perf_remove_sibling_event(sibling);
+			__event_disable(sibling, ctx, PERF_EVENT_STATE_ERROR);
 
 		sibling->group_leader = sibling;
 		list_del_init(&sibling->sibling_list);
@@ -2562,6 +2557,19 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla
 	event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state)
+{
+	if (event == event->group_leader)
+		group_sched_out(event, ctx);
+	else
+		event_sched_out(event, ctx);
+
+	perf_event_set_state(event, state);
+	perf_cgroup_event_disable(event, ctx);
+}
+
 /*
  * Cross CPU call to disable a performance event
  */
@@ -2575,15 +2583,7 @@ static void __perf_event_disable(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu_ctx->pmu);
 	ctx_time_update_event(ctx, event);
-
-	if (event == event->group_leader)
-		group_sched_out(event, ctx);
-	else
-		event_sched_out(event, ctx);
-
-	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
-	perf_cgroup_event_disable(event, ctx);
-
+	__event_disable(event, ctx, PERF_EVENT_STATE_OFF);
 	perf_pmu_enable(event->pmu_ctx->pmu);
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 13:58         ` Peter Zijlstra
@ 2025-06-04 15:17           ` Leo Yan
  2025-06-11  9:29           ` [tip: perf/urgent] perf: Add comment to enum perf_event_state tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Leo Yan @ 2025-06-04 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 03:58:01PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 04, 2025 at 11:18:21AM +0100, Leo Yan wrote:
> 
> > Totally agree. The comment would be very useful!
> 
> I have the below. Let me go read your email in more than 2 seconds to
> see if I should amend things.
> 
> I'm not sure doing the INACTIVE->INACTIVE cycle in the ASCII art is
> going to make it clearer, might leave that off. But yeah, possible.
> 
> ---
> 
> Subject: perf: Add comment to enum perf_event_state
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Jun  4 10:21:38 CEST 2025
> 
> Better describe the event states.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/perf_event.h |   42 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 40 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -635,8 +635,46 @@ struct perf_addr_filter_range {
>  	unsigned long			size;
>  };
>  
> -/**
> - * enum perf_event_state - the states of an event:
> +/*
> + * The normal states are:
> + *
> + *            ACTIVE    --.
> + *               ^        |
> + *               |        |
> + *       sched_{in,out}() |
> + *               |        |
> + *               v        |
> + *      ,---> INACTIVE  --+ <-.
> + *      |                 |   |
> + *      |                {dis,en}able()
> + *   sched_in()           |   |
> + *      |       OFF    <--' --+
> + *      |                     |
> + *      `--->  ERROR    ------'
> + *
> + * That is:
> + *
> + * sched_in:       INACTIVE          -> {ACTIVE,ERROR}
> + * sched_out:      ACTIVE            -> INACTIVE
> + * disable:        {ACTIVE,INACTIVE} -> OFF
> + * enable:         {OFF,ERROR}       -> INACTIVE
> + *
> + * Where {OFF,ERROR} are disabled states.
> + *
> + * Then we have the {EXIT,REVOKED,DEAD} states which are various shades of
> + * defunct events:
> + *
> + *  - EXIT means task that the even was assigned to died, but child events
> + *    still live, and further children can still be created. But the event
> + *    itself will never be active again. It can only transition to
> + *    {REVOKED,DEAD};
> + *
> + *  - REVOKED means the PMU the event was associated with is gone; all
> + *    functionality is stopped but the event is still alive. Can only
> + *    transition to DEAD;
> + *
> + *  - DEAD event really is DYING tearing down state and freeing bits.
> + *

LGTM:

Reviewed-by: Leo Yan <leo.yan@arm.com>

>   */
>  enum perf_event_state {
>  	PERF_EVENT_STATE_DEAD		= -5,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 14:16         ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
@ 2025-06-04 15:46           ` Leo Yan
  2025-06-04 15:59             ` Peter Zijlstra
  2025-06-05 11:29             ` Peter Zijlstra
  2025-06-05 11:41           ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Leo Yan @ 2025-06-04 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 04:16:40PM +0200, Peter Zijlstra wrote:

[...]

> It might be prudent to do something like so:

Thanks for the patch.

> +static void __event_disable(struct perf_event *event,
> +			    struct perf_event_context *ctx,
> +			    enum perf_event_state state)
> +{
> +	if (event == event->group_leader)
> +		group_sched_out(event, ctx);

I am a bit struggle for this code line. It disables all events in a
group, but only clear cgroup pointer for group leader but miss to clear
for sibling events.

If the cgroup pointer is only used for group leader, maybe we only
maintain (set and clear) the cgroup pointer for the leader?

Thanks,
Leo

> +	else
> +		event_sched_out(event, ctx);
> +
> +	perf_event_set_state(event, state);
> +	perf_cgroup_event_disable(event, ctx);
> +}
> +
>  /*
>   * Cross CPU call to disable a performance event
>   */
> @@ -2575,15 +2583,7 @@ static void __perf_event_disable(struct perf_event *event,
>  
>  	perf_pmu_disable(event->pmu_ctx->pmu);
>  	ctx_time_update_event(ctx, event);
> -
> -	if (event == event->group_leader)
> -		group_sched_out(event, ctx);
> -	else
> -		event_sched_out(event, ctx);
> -
> -	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> -	perf_cgroup_event_disable(event, ctx);
> -
> +	__event_disable(event, ctx, PERF_EVENT_STATE_OFF);
>  	perf_pmu_enable(event->pmu_ctx->pmu);
>  }
>  

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 15:46           ` Leo Yan
@ 2025-06-04 15:59             ` Peter Zijlstra
  2025-06-05 11:29             ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-04 15:59 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 04:46:39PM +0100, Leo Yan wrote:
> On Wed, Jun 04, 2025 at 04:16:40PM +0200, Peter Zijlstra wrote:
> 
> [...]
> 
> > It might be prudent to do something like so:
> 
> Thanks for the patch.
> 
> > +static void __event_disable(struct perf_event *event,
> > +			    struct perf_event_context *ctx,
> > +			    enum perf_event_state state)
> > +{
> > +	if (event == event->group_leader)
> > +		group_sched_out(event, ctx);
> 
> I am a bit struggle for this code line. It disables all events in a
> group, but only clear cgroup pointer for group leader but miss to clear
> for sibling events.
> 
> If the cgroup pointer is only used for group leader, maybe we only
> maintain (set and clear) the cgroup pointer for the leader?

Hmm, so yeah, that is weird indeed.

So perf_cgroup_event_enable() is called in list_add_event(), which is
perf_install_in_context() and that should be every single event.

So yeah, I'm thinking we're having more bugs here still :-(


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 15:46           ` Leo Yan
  2025-06-04 15:59             ` Peter Zijlstra
@ 2025-06-05 11:29             ` Peter Zijlstra
  2025-06-05 12:33               ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-05 11:29 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 04:46:39PM +0100, Leo Yan wrote:
> On Wed, Jun 04, 2025 at 04:16:40PM +0200, Peter Zijlstra wrote:
> 
> [...]
> 
> > It might be prudent to do something like so:
> 
> Thanks for the patch.
> 
> > +static void __event_disable(struct perf_event *event,
> > +			    struct perf_event_context *ctx,
> > +			    enum perf_event_state state)
> > +{
> > +	if (event == event->group_leader)
> > +		group_sched_out(event, ctx);
> 
> I am a bit struggle for this code line. It disables all events in a
> group, but only clear cgroup pointer for group leader but miss to clear
> for sibling events.

What happens is that perf_event_disable() will only mark the group
leader as OFF. By having the group leader marked OFF, the whole group
becomes ineligible to run. But the siblings are still INACTIVE and thus
don't need to get perf_cgroup_event_disable() called.

perf_event_enable() does the symmetric thing, it only marks the group
leader INACTIVE (and then tries to schedule it, possibly resulting in
ACTIVE).

Now, it is entirely possible to call {dis,en}able() on siblings, in
which case only the sibling gets marked OFF and gets scheduled out, but
the rest of the group can continue functioning.

So I think this was indeed correct.

> If the cgroup pointer is only used for group leader, maybe we only
> maintain (set and clear) the cgroup pointer for the leader?

And the patch re-used this function on siblings, do it would never hit
this group_sched_out() case.

But yes, slightly confusing. Let me see if I can make a less confusing
patch, and if not, sprinkle comments.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-04 14:16         ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
  2025-06-04 15:46           ` Leo Yan
@ 2025-06-05 11:41           ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-05 11:41 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Wed, Jun 04, 2025 at 04:16:40PM +0200, Peter Zijlstra wrote:

Alexander, I just noticed:

> @@ -2183,8 +2184,7 @@ static void perf_put_aux_event(struct perf_event *event)
>  		 * state so that we don't try to schedule it again. Note
>  		 * that perf_event_enable() will clear the ERROR status.
>  		 */
> -		event_sched_out(iter, ctx);
> -		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);

That this is scheduling out the sibling, and marking the group leader
ERROR; surely this is a mistake?

> +		__event_disable(event, ctx, PERF_EVENT_STATE_ERROR);
>  	}
>  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-05 11:29             ` Peter Zijlstra
@ 2025-06-05 12:33               ` Peter Zijlstra
  2025-06-05 17:21                 ` Leo Yan
  2025-06-11  9:29                 ` [tip: perf/urgent] perf: Fix cgroup state vs ERROR tip-bot2 for Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2025-06-05 12:33 UTC (permalink / raw)
  To: Leo Yan
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Thu, Jun 05, 2025 at 01:29:21PM +0200, Peter Zijlstra wrote:

> But yes, slightly confusing. Let me see if I can make a less confusing
> patch, and if not, sprinkle comments.

I've settled on the below.

---
Subject: perf: Fix cgroup state vs ERROR
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Jun 5 12:37:11 CEST 2025

While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.

Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().

Unify the disable paths to ensure this.

Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   51 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2149,8 +2149,9 @@ perf_aux_output_match(struct perf_event
 }
 
 static void put_event(struct perf_event *event);
-static void event_sched_out(struct perf_event *event,
-			    struct perf_event_context *ctx);
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state);
 
 static void perf_put_aux_event(struct perf_event *event)
 {
@@ -2183,8 +2184,7 @@ static void perf_put_aux_event(struct pe
 		 * state so that we don't try to schedule it again. Note
 		 * that perf_event_enable() will clear the ERROR status.
 		 */
-		event_sched_out(iter, ctx);
-		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+		__event_disable(iter, ctx, PERF_EVENT_STATE_ERROR);
 	}
 }
 
@@ -2242,18 +2242,6 @@ static inline struct list_head *get_even
 				    &event->pmu_ctx->flexible_active;
 }
 
-/*
- * Events that have PERF_EV_CAP_SIBLING require being part of a group and
- * cannot exist on their own, schedule them out and move them into the ERROR
- * state. Also see _perf_event_enable(), it will not be able to recover
- * this ERROR state.
- */
-static inline void perf_remove_sibling_event(struct perf_event *event)
-{
-	event_sched_out(event, event->ctx);
-	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-}
-
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *leader = event->group_leader;
@@ -2289,8 +2277,15 @@ static void perf_group_detach(struct per
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
 
+		/*
+		 * Events that have PERF_EV_CAP_SIBLING require being part of
+		 * a group and cannot exist on their own, schedule them out
+		 * and move them into the ERROR state. Also see
+		 * _perf_event_enable(), it will not be able to recover this
+		 * ERROR state.
+		 */
 		if (sibling->event_caps & PERF_EV_CAP_SIBLING)
-			perf_remove_sibling_event(sibling);
+			__event_disable(sibling, ctx, PERF_EVENT_STATE_ERROR);
 
 		sibling->group_leader = sibling;
 		list_del_init(&sibling->sibling_list);
@@ -2562,6 +2557,15 @@ static void perf_remove_from_context(str
 	event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state)
+{
+	event_sched_out(event, ctx);
+	perf_cgroup_event_disable(event, ctx);
+	perf_event_set_state(event, state);
+}
+
 /*
  * Cross CPU call to disable a performance event
  */
@@ -2576,13 +2580,18 @@ static void __perf_event_disable(struct
 	perf_pmu_disable(event->pmu_ctx->pmu);
 	ctx_time_update_event(ctx, event);
 
+	/*
+	 * When disabling a group leader, the whole group becomes ineligible
+	 * to run, so schedule out the full group.
+	 */
 	if (event == event->group_leader)
 		group_sched_out(event, ctx);
-	else
-		event_sched_out(event, ctx);
 
-	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
-	perf_cgroup_event_disable(event, ctx);
+	/*
+	 * But only mark the leader OFF; the siblings will remain
+	 * INACTIVE.
+	 */
+	__event_disable(event, ctx, PERF_EVENT_STATE_OFF);
 
 	perf_pmu_enable(event->pmu_ctx->pmu);
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx
  2025-06-05 12:33               ` Peter Zijlstra
@ 2025-06-05 17:21                 ` Leo Yan
  2025-06-11  9:29                 ` [tip: perf/urgent] perf: Fix cgroup state vs ERROR tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Leo Yan @ 2025-06-05 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yeoreum Yun, mingo, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	linux-perf-users, linux-kernel, David Wang

On Thu, Jun 05, 2025 at 02:33:43PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 05, 2025 at 01:29:21PM +0200, Peter Zijlstra wrote:
> 
> > But yes, slightly confusing. Let me see if I can make a less confusing
> > patch, and if not, sprinkle comments.
> 
> I've settled on the below.
> 
> ---
> Subject: perf: Fix cgroup state vs ERROR
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu Jun 5 12:37:11 CEST 2025
> 
> While chasing down a missing perf_cgroup_event_disable() elsewhere,
> Leo Yan found that both perf_put_aux_event() and
> perf_remove_sibling_event() were also missing one.
> 
> Specifically, the rule is that events that switch to OFF,ERROR need to
> call perf_cgroup_event_disable().
> 
> Unify the disable paths to ensure this.
> 
> Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
> Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
> Reported-by: Leo Yan <leo.yan@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/events/core.c |   51 ++++++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 21 deletions(-)
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2149,8 +2149,9 @@ perf_aux_output_match(struct perf_event
>  }
>  
>  static void put_event(struct perf_event *event);
> -static void event_sched_out(struct perf_event *event,
> -			    struct perf_event_context *ctx);
> +static void __event_disable(struct perf_event *event,
> +			    struct perf_event_context *ctx,
> +			    enum perf_event_state state);
>  
>  static void perf_put_aux_event(struct perf_event *event)
>  {
> @@ -2183,8 +2184,7 @@ static void perf_put_aux_event(struct pe
>  		 * state so that we don't try to schedule it again. Note
>  		 * that perf_event_enable() will clear the ERROR status.
>  		 */
> -		event_sched_out(iter, ctx);
> -		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> +		__event_disable(iter, ctx, PERF_EVENT_STATE_ERROR);
>  	}
>  }
>  
> @@ -2242,18 +2242,6 @@ static inline struct list_head *get_even
>  				    &event->pmu_ctx->flexible_active;
>  }
>  
> -/*
> - * Events that have PERF_EV_CAP_SIBLING require being part of a group and
> - * cannot exist on their own, schedule them out and move them into the ERROR
> - * state. Also see _perf_event_enable(), it will not be able to recover
> - * this ERROR state.
> - */
> -static inline void perf_remove_sibling_event(struct perf_event *event)
> -{
> -	event_sched_out(event, event->ctx);
> -	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> -}
> -
>  static void perf_group_detach(struct perf_event *event)
>  {
>  	struct perf_event *leader = event->group_leader;
> @@ -2289,8 +2277,15 @@ static void perf_group_detach(struct per
>  	 */
>  	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
>  
> +		/*
> +		 * Events that have PERF_EV_CAP_SIBLING require being part of
> +		 * a group and cannot exist on their own, schedule them out
> +		 * and move them into the ERROR state. Also see
> +		 * _perf_event_enable(), it will not be able to recover this
> +		 * ERROR state.
> +		 */
>  		if (sibling->event_caps & PERF_EV_CAP_SIBLING)
> -			perf_remove_sibling_event(sibling);
> +			__event_disable(sibling, ctx, PERF_EVENT_STATE_ERROR);
>  
>  		sibling->group_leader = sibling;
>  		list_del_init(&sibling->sibling_list);
> @@ -2562,6 +2557,15 @@ static void perf_remove_from_context(str
>  	event_function_call(event, __perf_remove_from_context, (void *)flags);
>  }
>  
> +static void __event_disable(struct perf_event *event,
> +			    struct perf_event_context *ctx,
> +			    enum perf_event_state state)
> +{
> +	event_sched_out(event, ctx);
> +	perf_cgroup_event_disable(event, ctx);
> +	perf_event_set_state(event, state);
> +}
> +
>  /*
>   * Cross CPU call to disable a performance event
>   */
> @@ -2576,13 +2580,18 @@ static void __perf_event_disable(struct
>  	perf_pmu_disable(event->pmu_ctx->pmu);
>  	ctx_time_update_event(ctx, event);
>  
> +	/*
> +	 * When disabling a group leader, the whole group becomes ineligible
> +	 * to run, so schedule out the full group.
> +	 */
>  	if (event == event->group_leader)
>  		group_sched_out(event, ctx);
> -	else
> -		event_sched_out(event, ctx);
>  
> -	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> -	perf_cgroup_event_disable(event, ctx);
> +	/*
> +	 * But only mark the leader OFF; the siblings will remain
> +	 * INACTIVE.
> +	 */
> +	__event_disable(event, ctx, PERF_EVENT_STATE_OFF);

Here, a group lead will invoke event_sched_out() twice: one is in
group_sched_out() (above) andin __event_disable(). This would be fine,
as the second call to event_sched_out() will directly bail out due to
the following condition:

  if (event->state != PERF_EVENT_STATE_ACTIVE)
      return;

I think you have already noticed this minor redundancy.

Reviewed-by: Leo Yan <leo.yan@arm.com>

And thanks for the explaination in your another reply, it makes sense to
me.

Leo

>  	perf_pmu_enable(event->pmu_ctx->pmu);
>  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip: perf/urgent] perf: Add comment to enum perf_event_state
  2025-06-04 13:58         ` Peter Zijlstra
  2025-06-04 15:17           ` Leo Yan
@ 2025-06-11  9:29           ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-06-11  9:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Leo Yan, x86, linux-kernel

The following commit has been merged into the perf/urgent branch of tip:

Commit-ID:     49b393af3130c7712c7e8f215f4126c9a8060fa6
Gitweb:        https://git.kernel.org/tip/49b393af3130c7712c7e8f215f4126c9a8060fa6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 04 Jun 2025 10:21:38 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 05 Jun 2025 14:37:53 +02:00

perf: Add comment to enum perf_event_state

Better describe the event states.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Link: https://lkml.kernel.org/r/20250604135801.GK38114@noisy.programming.kicks-ass.net
---
 include/linux/perf_event.h | 42 +++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 52dc7cf..ec9d960 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -635,8 +635,46 @@ struct perf_addr_filter_range {
 	unsigned long			size;
 };
 
-/**
- * enum perf_event_state - the states of an event:
+/*
+ * The normal states are:
+ *
+ *            ACTIVE    --.
+ *               ^        |
+ *               |        |
+ *       sched_{in,out}() |
+ *               |        |
+ *               v        |
+ *      ,---> INACTIVE  --+ <-.
+ *      |                 |   |
+ *      |                {dis,en}able()
+ *   sched_in()           |   |
+ *      |       OFF    <--' --+
+ *      |                     |
+ *      `--->  ERROR    ------'
+ *
+ * That is:
+ *
+ * sched_in:       INACTIVE          -> {ACTIVE,ERROR}
+ * sched_out:      ACTIVE            -> INACTIVE
+ * disable:        {ACTIVE,INACTIVE} -> OFF
+ * enable:         {OFF,ERROR}       -> INACTIVE
+ *
+ * Where {OFF,ERROR} are disabled states.
+ *
+ * Then we have the {EXIT,REVOKED,DEAD} states which are various shades of
+ * defunct events:
+ *
+ *  - EXIT means task that the even was assigned to died, but child events
+ *    still live, and further children can still be created. But the event
+ *    itself will never be active again. It can only transition to
+ *    {REVOKED,DEAD};
+ *
+ *  - REVOKED means the PMU the event was associated with is gone; all
+ *    functionality is stopped but the event is still alive. Can only
+ *    transition to DEAD;
+ *
+ *  - DEAD event really is DYING tearing down state and freeing bits.
+ *
  */
 enum perf_event_state {
 	PERF_EVENT_STATE_DEAD		= -5,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [tip: perf/urgent] perf: Fix cgroup state vs ERROR
  2025-06-05 12:33               ` Peter Zijlstra
  2025-06-05 17:21                 ` Leo Yan
@ 2025-06-11  9:29                 ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-06-11  9:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Leo Yan, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the perf/urgent branch of tip:

Commit-ID:     61988e36dc5457cdff7ae7927e8d9ad1419ee998
Gitweb:        https://git.kernel.org/tip/61988e36dc5457cdff7ae7927e8d9ad1419ee998
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 05 Jun 2025 12:37:11 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 05 Jun 2025 14:37:52 +02:00

perf: Fix cgroup state vs ERROR

While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.

Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().

Unify the disable paths to ensure this.

Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
---
 kernel/events/core.c | 51 +++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 26dec1d..1cc98b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2149,8 +2149,9 @@ perf_aux_output_match(struct perf_event *event, struct perf_event *aux_event)
 }
 
 static void put_event(struct perf_event *event);
-static void event_sched_out(struct perf_event *event,
-			    struct perf_event_context *ctx);
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state);
 
 static void perf_put_aux_event(struct perf_event *event)
 {
@@ -2183,8 +2184,7 @@ static void perf_put_aux_event(struct perf_event *event)
 		 * state so that we don't try to schedule it again. Note
 		 * that perf_event_enable() will clear the ERROR status.
 		 */
-		event_sched_out(iter, ctx);
-		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
+		__event_disable(iter, ctx, PERF_EVENT_STATE_ERROR);
 	}
 }
 
@@ -2242,18 +2242,6 @@ static inline struct list_head *get_event_list(struct perf_event *event)
 				    &event->pmu_ctx->flexible_active;
 }
 
-/*
- * Events that have PERF_EV_CAP_SIBLING require being part of a group and
- * cannot exist on their own, schedule them out and move them into the ERROR
- * state. Also see _perf_event_enable(), it will not be able to recover
- * this ERROR state.
- */
-static inline void perf_remove_sibling_event(struct perf_event *event)
-{
-	event_sched_out(event, event->ctx);
-	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-}
-
 static void perf_group_detach(struct perf_event *event)
 {
 	struct perf_event *leader = event->group_leader;
@@ -2289,8 +2277,15 @@ static void perf_group_detach(struct perf_event *event)
 	 */
 	list_for_each_entry_safe(sibling, tmp, &event->sibling_list, sibling_list) {
 
+		/*
+		 * Events that have PERF_EV_CAP_SIBLING require being part of
+		 * a group and cannot exist on their own, schedule them out
+		 * and move them into the ERROR state. Also see
+		 * _perf_event_enable(), it will not be able to recover this
+		 * ERROR state.
+		 */
 		if (sibling->event_caps & PERF_EV_CAP_SIBLING)
-			perf_remove_sibling_event(sibling);
+			__event_disable(sibling, ctx, PERF_EVENT_STATE_ERROR);
 
 		sibling->group_leader = sibling;
 		list_del_init(&sibling->sibling_list);
@@ -2562,6 +2557,15 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla
 	event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
+static void __event_disable(struct perf_event *event,
+			    struct perf_event_context *ctx,
+			    enum perf_event_state state)
+{
+	event_sched_out(event, ctx);
+	perf_cgroup_event_disable(event, ctx);
+	perf_event_set_state(event, state);
+}
+
 /*
  * Cross CPU call to disable a performance event
  */
@@ -2576,13 +2580,18 @@ static void __perf_event_disable(struct perf_event *event,
 	perf_pmu_disable(event->pmu_ctx->pmu);
 	ctx_time_update_event(ctx, event);
 
+	/*
+	 * When disabling a group leader, the whole group becomes ineligible
+	 * to run, so schedule out the full group.
+	 */
 	if (event == event->group_leader)
 		group_sched_out(event, ctx);
-	else
-		event_sched_out(event, ctx);
 
-	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
-	perf_cgroup_event_disable(event, ctx);
+	/*
+	 * But only mark the leader OFF; the siblings will remain
+	 * INACTIVE.
+	 */
+	__event_disable(event, ctx, PERF_EVENT_STATE_OFF);
 
 	perf_pmu_enable(event->pmu_ctx->pmu);
 }

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-06-11  9:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 18:40 [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Yeoreum Yun
2025-06-03  2:01 ` David Wang
2025-06-03  4:46   ` [PATCH " Yeoreum Yun
2025-06-03  5:44     ` David Wang
2025-06-03  6:34       ` Yeoreum Yun
2025-06-03  6:39         ` Yeoreum Yun
2025-06-03  6:47           ` David Wang
2025-06-03  6:42         ` David Wang
2025-06-03  7:16           ` Yeoreum Yun
2025-06-03  7:31             ` David Wang
2025-06-03  8:15             ` David Wang
2025-06-03  6:54         ` David Wang
2025-06-03  9:20         ` Yeoreum Yun
2025-06-03 10:08           ` David Wang
2025-06-03 13:41             ` Yeoreum Yun
2025-06-03 14:02               ` David Wang
2025-06-03 14:00 ` Leo Yan
2025-06-03 14:44   ` Peter Zijlstra
2025-06-03 15:17     ` Yeoreum Yun
2025-06-04  7:06       ` Peter Zijlstra
2025-06-04  8:03     ` Peter Zijlstra
2025-06-04 10:06       ` Yeoreum Yun
2025-06-04 12:37         ` Peter Zijlstra
2025-06-04 12:54           ` Yeoreum Yun
2025-06-04 10:18       ` Leo Yan
2025-06-04 13:58         ` Peter Zijlstra
2025-06-04 15:17           ` Leo Yan
2025-06-11  9:29           ` [tip: perf/urgent] perf: Add comment to enum perf_event_state tip-bot2 for Peter Zijlstra
2025-06-04 14:16         ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
2025-06-04 15:46           ` Leo Yan
2025-06-04 15:59             ` Peter Zijlstra
2025-06-05 11:29             ` Peter Zijlstra
2025-06-05 12:33               ` Peter Zijlstra
2025-06-05 17:21                 ` Leo Yan
2025-06-11  9:29                 ` [tip: perf/urgent] perf: Fix cgroup state vs ERROR tip-bot2 for Peter Zijlstra
2025-06-05 11:41           ` [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx Peter Zijlstra
2025-06-03 15:05   ` Yeoreum Yun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).