Possible race between CPU hotplug and perf_pmu_migrate

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Possible race between CPU hotplug and perf_pmu_migrate_context
@ 2014-09-01 18:18 Mark Rutland
  2014-09-01 19:05 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2014-09-01 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Yan, Zheng, Stephane Eranian, Ingo Molnar

Hi all,

While trying some rework of the ARM CCI PMU driver on v3.17-rc2, I
encountered what seems to be a race between CPU hotplug and perf event
context migration, which results in a BUG in mm/slub.c.

It looks like this is a generic issue as I'm able to cause the same
splat with the uncore_imc driver on a Haswell machine (on v3.16.1 at
least):

[   66.621306] ------------[ cut here ]------------
[   66.625933] kernel BUG at mm/slub.c:3380!
[   66.629947] invalid opcode: 0000 [#1] SMP
[   66.634101] Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) x86_pkg_temp_thermal
[   66.643476] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O  3.16.1-uncore-pmu-test #2
[   66.653132] Hardware name: LENOVO 10A6A03EUK/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[   66.660530] task: ffff88040b584f50 ti: ffff88040b5d4000 task.ti: ffff88040b5d4000
[   66.668009] RIP: 0010:[<ffffffff8114a443>]  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.675615] RSP: 0018:ffff88041dc43ea8  EFLAGS: 00010246
[   66.680930] RAX: 0200000000000400 RBX: ffff88041dc18100 RCX: 00000000000000c8
[   66.688066] RDX: 0200000000000000 RSI: ffff8800db601800 RDI: ffff88041dc18100
[   66.695202] RBP: ffff88041dc43ec0 R08: 00000000000156e0 R09: ffff88041dc556e0
[   66.702334] R10: ffffea0010770600 R11: ffffea00036d8000 R12: ffffffff81c3dec0
[   66.709472] R13: ffffffff8109dd33 R14: ffff880409b96b08 R15: 0000000000000006
[   66.716607] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
[   66.724697] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.730443] CR2: 00007fae8a93b000 CR3: 00000000dc962000 CR4: 00000000001407e0
[   66.737580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   66.744714] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   66.751852] Stack:
[   66.753873]  ffff88041dc4d300 ffffffff81c3dec0 000000000000000a ffff88041dc43f20
[   66.761371]  ffffffff8109dd33 ffff8800db600500 ffff88040b584f50 ffff88040b5d7fd8
[   66.768873]  ffff88041dc4d328 0000000000000000 0000000000000009 ffffffff81c090c8
[   66.776371] Call Trace:
[   66.778823]  <IRQ>
[   66.780759]  [<ffffffff8109dd33>] rcu_process_callbacks+0x1e3/0x540
[   66.787254]  [<ffffffff8104e70e>] __do_softirq+0xee/0x280
[   66.792654]  [<ffffffff8104eaad>] irq_exit+0x9d/0xb0
[   66.797625]  [<ffffffff81032b4f>] smp_apic_timer_interrupt+0x3f/0x50
[   66.803982]  [<ffffffff817de68a>] apic_timer_interrupt+0x6a/0x70
[   66.809994]  <EOI>
[   66.811926]  [<ffffffff81590ce7>] ? cpuidle_enter_state+0x47/0xc0
[   66.818250]  [<ffffffff81590e12>] cpuidle_enter+0x12/0x20
[   66.823650]  [<ffffffff81086aa6>] cpu_startup_entry+0x256/0x3f0
[   66.829572]  [<ffffffff81030d82>] start_secondary+0x192/0x200
[   66.835319] Code: 49 8b 02 31 f6 f6 c4 40 74 04 41 8b 72 68 4c 89 d7 e8 92 ed fb ff eb 93 4c 8b 50 30 48 8b 10 80 e6 80 4c 0f 44 d0 e9 36 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 c7 c0 ea ff ff ff
[   66.855859] RIP  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.861113]  RSP <ffff88041dc43ea8>
[   66.864617] ---[ end trace 825fa0ba52ca10eb ]---
[   66.869240] Kernel panic - not syncing: Fatal exception in interrupt
[   66.875616] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[   66.885791] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Has anything seen anything like this before? Is this a known issue?

I'm testing by opening and closing uncore/system PMU events while
hotplugging CPUs to force migration. I run a few instances of the
following program and script in parallel (please forgive the hardcoded
numbers).

Thanks,
Mark.

---->8----
#include <errno.h>
#include <linux/hw_breakpoint.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
			   int group_fd, unsigned long flags)
{
	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
}

#define PMU_TYPE	6	/* uncore_imc */
#define PMU_EVENT	1	/* data_read */

struct perf_event_attr attr = {
	.type = PMU_TYPE,
	.config = PMU_EVENT,
	.size = sizeof(attr),
};

int main(int argc, char *argv[])
{
	while (1) {
		int ret = perf_event_open(&attr, -1, 0, -1, 0);
		if (ret < 0) {
			fprintf(stderr, "Unable to open event: %d (%d)\n", ret, errno);
			return ret;
		}
		close(ret);
	}

	return 0;
}
----8<----

---->8----
#!/bin/sh

MAX_CPU=7

while true; do
	for i in $(seq 0 ${MAX_CPU}); do
		echo 0 > /sys/devices/system/cpu/cpu${i}/online;
		echo 1 > /sys/devices/system/cpu/cpu${i}/online;
	done
done
----8<----


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-01 18:18 Possible race between CPU hotplug and perf_pmu_migrate_context Mark Rutland
@ 2014-09-01 19:05 ` Peter Zijlstra
  2014-09-02 18:58   ` Mark Rutland
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2014-09-01 19:05 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-kernel, Yan, Zheng, Stephane Eranian, Ingo Molnar

On Mon, Sep 01, 2014 at 07:18:08PM +0100, Mark Rutland wrote:
> Hi all,

> [   66.780759]  [<ffffffff8109dd33>] rcu_process_callbacks+0x1e3/0x540

> Has anything seen anything like this before? Is this a known issue?

I've not seen it reported.. sounds like 'fun' though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-01 19:05 ` Peter Zijlstra
@ 2014-09-02 18:58   ` Mark Rutland
  2014-09-03 11:50     ` Mark Rutland
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2014-09-02 18:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, Yan@leverpostej.cambridge.arm.com,
	Zheng, Stephane Eranian, Ingo Molnar

On Mon, Sep 01, 2014 at 08:05:34PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 01, 2014 at 07:18:08PM +0100, Mark Rutland wrote:
> > Hi all,
> 
> > [   66.780759]  [<ffffffff8109dd33>] rcu_process_callbacks+0x1e3/0x540
> 
> > Has anything seen anything like this before? Is this a known issue?
> 
> I've not seen it reported.. sounds like 'fun' though.
> 

This has been a tremendous source of 'fun' so far...

The rcu_process_callbacks line is a red herring. What seems to be
happening is:

A CPU goes down, and perf_pmu_migrate_context removes all events from
per_cpu_ptr(pmu->pmu_cpu_context, src_cpu)->ctx. The events are in a
state of limbo, with their ctx pointers pointing at the old context,
whose refcount is 1. The src_ctx->mutex is unlocked.

Concurrently on another CPU the fds are closed, and perf_event_release
goes and removes each event from their event->ctx. We skip the double
detach in list_del_event and carry on to __free_event where we put_ctx
the old context for a second time for each event. The refcount goes to 0
and we queue a kfree_rcu of the context (inside the PMU's percpu
perf_event_cpu_context, allocated with alloc_percpu).

We run the queued kfree_rcu, and explode trying to kfree something we
didn't k*alloc. I'm not sure when exactly we run the queued kfree_rcu
w.r.t. everything else.

So the problem here seems to be a race between the
perf_pmu_migrate_context and something down the perf_event_release
callchain.

Any ideas?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-02 18:58   ` Mark Rutland
@ 2014-09-03 11:50     ` Mark Rutland
  2014-09-04 10:44       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2014-09-03 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, Yan Zheng, Stephane Eranian,
	Ingo Molnar, Vince Weaver

Hi all,

Further to my earlier reply I've come up with a potential fix below,
which has survived my stress test for both my WIP driver and the intel
uncore imc driver.

As it's impossible to synchronize with the event->ctx I'd hoped it would
be possible to synchronize with a field on the event itself.
Unfortunately all I managed to come up with were some shiny new ABBA
deadlocks.

Instead I've followed the example set by perf_event_open and inhibited
CPU hotplug for the portion of put_event that removes an event from its
context, which will prevent perf_pmu_migrate_context from modifying
event->ctx under our feet.

While there's the potential to starve CPU hotplug, that's already the
case for the perf_event_open path, so I'm not sure if that's a big deal
or not.

Thoughts?

Mark.

---->8----
>From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Wed, 3 Sep 2014 11:06:22 +0100
Subject: [PATCH] perf: prevent hotplug race on event->ctx

The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
Introduce perf_pmu_migrate_context()) didn't take the tear-down of
events into account, and left open a race with put_event on event->ctx.
A resulting duplicate put_ctx of an event's original context can lead to
the context being erroneously kfreed via RCU, resulting in the below
splat with the intel uncore_imc PMU driver:

[   66.621306] ------------[ cut here ]------------
[   66.625933] kernel BUG at mm/slub.c:3380!
[   66.629947] invalid opcode: 0000 [#1] SMP
[   66.634101] Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) x86_pkg_temp_thermal
[   66.643476] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O  3.16.1-uncore-pmu-test #2
[   66.653132] Hardware name: LENOVO 10A6A03EUK/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[   66.660530] task: ffff88040b584f50 ti: ffff88040b5d4000 task.ti: ffff88040b5d4000
[   66.668009] RIP: 0010:[<ffffffff8114a443>]  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.675615] RSP: 0018:ffff88041dc43ea8  EFLAGS: 00010246
[   66.680930] RAX: 0200000000000400 RBX: ffff88041dc18100 RCX: 00000000000000c8
[   66.688066] RDX: 0200000000000000 RSI: ffff8800db601800 RDI: ffff88041dc18100
[   66.695202] RBP: ffff88041dc43ec0 R08: 00000000000156e0 R09: ffff88041dc556e0
[   66.702334] R10: ffffea0010770600 R11: ffffea00036d8000 R12: ffffffff81c3dec0
[   66.709472] R13: ffffffff8109dd33 R14: ffff880409b96b08 R15: 0000000000000006
[   66.716607] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
[   66.724697] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.730443] CR2: 00007fae8a93b000 CR3: 00000000dc962000 CR4: 00000000001407e0
[   66.737580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   66.744714] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   66.751852] Stack:
[   66.753873]  ffff88041dc4d300 ffffffff81c3dec0 000000000000000a ffff88041dc43f20
[   66.761371]  ffffffff8109dd33 ffff8800db600500 ffff88040b584f50 ffff88040b5d7fd8
[   66.768873]  ffff88041dc4d328 0000000000000000 0000000000000009 ffffffff81c090c8
[   66.776371] Call Trace:
[   66.778823]  <IRQ>
[   66.780759]  [<ffffffff8109dd33>] rcu_process_callbacks+0x1e3/0x540
[   66.787254]  [<ffffffff8104e70e>] __do_softirq+0xee/0x280
[   66.792654]  [<ffffffff8104eaad>] irq_exit+0x9d/0xb0
[   66.797625]  [<ffffffff81032b4f>] smp_apic_timer_interrupt+0x3f/0x50
[   66.803982]  [<ffffffff817de68a>] apic_timer_interrupt+0x6a/0x70
[   66.809994]  <EOI>
[   66.811926]  [<ffffffff81590ce7>] ? cpuidle_enter_state+0x47/0xc0
[   66.818250]  [<ffffffff81590e12>] cpuidle_enter+0x12/0x20
[   66.823650]  [<ffffffff81086aa6>] cpu_startup_entry+0x256/0x3f0
[   66.829572]  [<ffffffff81030d82>] start_secondary+0x192/0x200
[   66.835319] Code: 49 8b 02 31 f6 f6 c4 40 74 04 41 8b 72 68 4c 89 d7 e8 92 ed fb ff eb 93 4c 8b 50 30 48 8b 10 80 e6 80 4c 0f 44 d0 e9 36 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 c7 c0 ea ff ff ff
[   66.855859] RIP  [<ffffffff8114a443>] kfree+0x133/0x140
[   66.861113]  RSP <ffff88041dc43ea8>
[   66.864617] ---[ end trace 825fa0ba52ca10eb ]---
[   66.869240] Kernel panic - not syncing: Fatal exception in interrupt
[   66.875616] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[   66.885791] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

In response to a CPU notifier an uncore PMU driver calls
perf_pmu_migrate context, which will remove all events from the old CPU
context before placing them all into the new CPU context. For a short
period the events are in limbo and are part of neither context, though
their event->ctx pointers still point at the old context.

During this period another CPU may enter put_event, which will try to
remove the event from event->ctx. As this may still point at the old
context, put_ctx can be called twice for a given event on the original
context. The context's refcount may drop to zero unexpectedly, whereupon
put_ctx will queue up a kfree with RCU. This blows up at the end of the
next grace period as the uncore PMU contexts are housed within
perf_cpu_context and weren't directly allocated with k*alloc.

This patch prevents the issue by inhibiting hotplug for the portion of
put_event which must access event->ctx, preventing the notifiers which
call perf_pmu_migrate_context from running concurrently. Once the event
has been removed from its context perf_pmu_migrate_context will no
longer be able to access it, so it is not necessary to inhibit hotplug
for the duration of event tear-down.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zheng, Yan <zheng.z.yan@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
---
 kernel/events/core.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9c1ed0..9e44cae 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3317,7 +3317,7 @@ static void free_event(struct perf_event *event)
  */
 static void put_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
+	struct perf_event_context *ctx;
 	struct task_struct *owner;

 	if (!atomic_long_dec_and_test(&event->refcount))
@@ -3356,6 +3356,16 @@ static void put_event(struct perf_event *event)
 		put_task_struct(owner);
 	}

+	/*
+	 * We must ensure that perf_pmu_migrate_context can't race on
+	 * event->ctx. Inhibit hotplug (and hence the notifiers
+	 * perf_pmu_migrate_context is called from) until the event is removed
+	 * from its current context.
+	 */
+	get_online_cpus();
+
+	ctx = event->ctx;
+
 	WARN_ON_ONCE(ctx->parent_ctx);
 	/*
 	 * There are two ways this annotation is useful:
@@ -3373,6 +3383,8 @@ static void put_event(struct perf_event *event)
 	perf_remove_from_context(event, true);
 	mutex_unlock(&ctx->mutex);

+	put_online_cpus();
+
 	_free_event(event);
 }

-- 
1.9.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-03 11:50     ` Mark Rutland
@ 2014-09-04 10:44       ` Peter Zijlstra
  2014-09-04 11:07         ` Mark Rutland
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2014-09-04 10:44 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel@vger.kernel.org, Yan Zheng, Stephane Eranian,
	Ingo Molnar, Vince Weaver, Linus Torvalds

On Wed, Sep 03, 2014 at 12:50:14PM +0100, Mark Rutland wrote:
> From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <mark.rutland@arm.com>
> Date: Wed, 3 Sep 2014 11:06:22 +0100
> Subject: [PATCH] perf: prevent hotplug race on event->ctx
> 
> The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
> Introduce perf_pmu_migrate_context()) didn't take the tear-down of
> events into account, and left open a race with put_event on event->ctx.
> A resulting duplicate put_ctx of an event's original context can lead to
> the context being erroneously kfreed via RCU, resulting in the below
> splat with the intel uncore_imc PMU driver:

<snip>

> In response to a CPU notifier an uncore PMU driver calls
> perf_pmu_migrate context, which will remove all events from the old CPU
> context before placing them all into the new CPU context. For a short
> period the events are in limbo and are part of neither context, though
> their event->ctx pointers still point at the old context.
> 
> During this period another CPU may enter put_event, which will try to
> remove the event from event->ctx. As this may still point at the old
> context, put_ctx can be called twice for a given event on the original
> context. The context's refcount may drop to zero unexpectedly, whereupon
> put_ctx will queue up a kfree with RCU. This blows up at the end of the
> next grace period as the uncore PMU contexts are housed within
> perf_cpu_context and weren't directly allocated with k*alloc.
> 
> This patch prevents the issue by inhibiting hotplug for the portion of
> put_event which must access event->ctx, preventing the notifiers which
> call perf_pmu_migrate_context from running concurrently. Once the event
> has been removed from its context perf_pmu_migrate_context will no
> longer be able to access it, so it is not necessary to inhibit hotplug
> for the duration of event tear-down.

Right, so that works I suppose. The thing is, get_online_cpus() is a
global lock and we can potentially do a lot of put_event()s. I had a
patch a while back that rewrote the cpuhotplug locking, but Linus didn't
particularly like that at the time.

I'll try and see if I can come up with anything else, but so far I've
only discovered a lot of ways that don't work (like I'm sure you did
too).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-04 10:44       ` Peter Zijlstra
@ 2014-09-04 11:07         ` Mark Rutland
  2014-09-05 15:16           ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2014-09-04 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, Yan Zheng, Stephane Eranian,
	Ingo Molnar, Vince Weaver, Linus Torvalds

On Thu, Sep 04, 2014 at 11:44:02AM +0100, Peter Zijlstra wrote:
> On Wed, Sep 03, 2014 at 12:50:14PM +0100, Mark Rutland wrote:
> > From 6465beace3ad9b12039127468f4596b8e87a53e8 Mon Sep 17 00:00:00 2001
> > From: Mark Rutland <mark.rutland@arm.com>
> > Date: Wed, 3 Sep 2014 11:06:22 +0100
> > Subject: [PATCH] perf: prevent hotplug race on event->ctx
> > 
> > The perf_pmu_migrate_context code introduced in 0cda4c023132 (perf:
> > Introduce perf_pmu_migrate_context()) didn't take the tear-down of
> > events into account, and left open a race with put_event on event->ctx.
> > A resulting duplicate put_ctx of an event's original context can lead to
> > the context being erroneously kfreed via RCU, resulting in the below
> > splat with the intel uncore_imc PMU driver:
> 
> <snip>
> 
> > In response to a CPU notifier an uncore PMU driver calls
> > perf_pmu_migrate context, which will remove all events from the old CPU
> > context before placing them all into the new CPU context. For a short
> > period the events are in limbo and are part of neither context, though
> > their event->ctx pointers still point at the old context.
> > 
> > During this period another CPU may enter put_event, which will try to
> > remove the event from event->ctx. As this may still point at the old
> > context, put_ctx can be called twice for a given event on the original
> > context. The context's refcount may drop to zero unexpectedly, whereupon
> > put_ctx will queue up a kfree with RCU. This blows up at the end of the
> > next grace period as the uncore PMU contexts are housed within
> > perf_cpu_context and weren't directly allocated with k*alloc.
> > 
> > This patch prevents the issue by inhibiting hotplug for the portion of
> > put_event which must access event->ctx, preventing the notifiers which
> > call perf_pmu_migrate_context from running concurrently. Once the event
> > has been removed from its context perf_pmu_migrate_context will no
> > longer be able to access it, so it is not necessary to inhibit hotplug
> > for the duration of event tear-down.
> 
> Right, so that works I suppose. The thing is, get_online_cpus() is a
> global lock and we can potentially do a lot of put_event()s. I had a
> patch a while back that rewrote the cpuhotplug locking, but Linus didn't
> particularly like that at the time.

Yeah, calling {get,put}_online_cpus() is far from ideal.

When testing open/close and hotplug had a rather noticeable effect on
each others' progress (per visible rate of output over serial, I didn't
make any actual measurements). Killing a few tasks with ~1024 events
open each would crawl to completion over a few seconds.

> I'll try and see if I can come up with anything else, but so far I've
> only discovered a lot of ways that don't work (like I'm sure you did
> too).

Yup; ABBA deadlock or too many/few put_ctx(old_ctx) calls.

Thanks for taking a look. If you have any ideas I'm happy to try another
approach.

Mark.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-04 11:07         ` Mark Rutland
@ 2014-09-05 15:16           ` Peter Zijlstra
  2014-09-05 15:41             ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2014-09-05 15:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel@vger.kernel.org, Yan Zheng, Stephane Eranian,
	Ingo Molnar, Vince Weaver, Linus Torvalds

On Thu, Sep 04, 2014 at 12:07:40PM +0100, Mark Rutland wrote:
> Thanks for taking a look. If you have any ideas I'm happy to try another
> approach.

How horrible is the below patch (performance wise). It does pretty much
the same thing except that percpu_rw_semaphore is a lot saner, its
read side performance should be minimal in the absence of writes.

---
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1573,6 +1573,7 @@ config PERF_EVENTS
 	depends on HAVE_PERF_EVENTS
 	select ANON_INODES
 	select IRQ_WORK
+	select PERCPU_RWSEM
 	help
 	  Enable kernel support for various performance events provided
 	  by software and hardware.
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/percpu-rwsem.h>
 
 #include "internal.h"
 
@@ -3418,12 +3419,14 @@ static void perf_remove_from_owner(struc
 	}
 }
 
+static struct percpu_rw_semaphore perf_rwsem;
+
 /*
  * Called when the last reference to the file is gone.
  */
 static void put_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
+	struct perf_event_context *ctx;
 
 	if (!atomic_long_dec_and_test(&event->refcount))
 		return;
@@ -3431,6 +3434,9 @@ static void put_event(struct perf_event
 	if (!is_kernel_event(event))
 		perf_remove_from_owner(event);
 
+	percpu_down_read(&perf_rwsem);
+	ctx = event->ctx;
+
 	WARN_ON_ONCE(ctx->parent_ctx);
 	/*
 	 * There are two ways this annotation is useful:
@@ -3447,6 +3453,7 @@ static void put_event(struct perf_event
 	mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING);
 	perf_remove_from_context(event, true);
 	mutex_unlock(&ctx->mutex);
+	percpu_up_read(&perf_rwsem);
 
 	_free_event(event);
 }
@@ -7534,6 +7541,8 @@ void perf_pmu_migrate_context(struct pmu
 	struct perf_event *event, *tmp;
 	LIST_HEAD(events);
 
+	percpu_down_write(&perf_rwsem);
+
 	src_ctx = &per_cpu_ptr(pmu->pmu_cpu_context, src_cpu)->ctx;
 	dst_ctx = &per_cpu_ptr(pmu->pmu_cpu_context, dst_cpu)->ctx;
 
@@ -7559,6 +7568,8 @@ void perf_pmu_migrate_context(struct pmu
 		get_ctx(dst_ctx);
 	}
 	mutex_unlock(&dst_ctx->mutex);
+
+	percpu_up_write(&perf_rwsem);
 }
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
@@ -8198,6 +8209,7 @@ void __init perf_event_init(void)
 
 	idr_init(&pmu_idr);
 
+	percpu_init_rwsem(&perf_rwsem);
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 15:16           ` Peter Zijlstra
@ 2014-09-05 15:41             ` Linus Torvalds
  2014-09-05 16:50               ` Vince Weaver
  2014-09-05 16:59               ` Mark Rutland
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2014-09-05 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mark Rutland, linux-kernel@vger.kernel.org, Yan Zheng,
	Stephane Eranian, Ingo Molnar, Vince Weaver

On Fri, Sep 5, 2014 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> How horrible is the below patch (performance wise). It does pretty much
> the same thing except that percpu_rw_semaphore is a lot saner, its
> read side performance should be minimal in the absence of writes.

Ugh. Why do any locking at all (whether a new 'perf_rwsem' or using
'get_online_cpus()').

Wouldn't it be much nicer to just do what memory management routines
are *supposed* to do, and get a reference count to the context while
having a pointer to it?

IOW, why doesn't put_event() just have a

   get_ctx(ctx);
   ..
   put_ctx(ctx);

around its use of the context pointer? So if the context ends up being
migrated during this time, it doesn't get freed.

However, the more fundamental question is "what protects accesses to
'events->ctx'". Why is "put_event()" so special that *it* gets locking
for the reading of "event->ctx", but none of the other cases of
reading the ctx pointer gets it or needs it?

I'm getting the feeling that this race is bigger than just put_event().

            Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 15:41             ` Linus Torvalds
@ 2014-09-05 16:50               ` Vince Weaver
  2014-09-05 16:59               ` Mark Rutland
  1 sibling, 0 replies; 13+ messages in thread
From: Vince Weaver @ 2014-09-05 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mark Rutland, linux-kernel@vger.kernel.org,
	Yan Zheng, Stephane Eranian, Ingo Molnar, Vince Weaver

On Fri, 5 Sep 2014, Linus Torvalds wrote:

> However, the more fundamental question is "what protects accesses to
> 'events->ctx'". Why is "put_event()" so special that *it* gets locking
> for the reading of "event->ctx", but none of the other cases of
> reading the ctx pointer gets it or needs it?
> 
> I'm getting the feeling that this race is bigger than just put_event().

I've been chasing a bug triggered by my perf_fuzzer program (with a 
forking workload) for the past few months.  It will reliably oops the 
machine or worse (I've had it somehow not only take down the test 
machine, but the whole local network somehow).

Often it seems to come from deep inside the perf_event context locking, in 
conjunction with complex open/fork/close/migrate workloads.

Here's a link to an older bug writeup, I've had it happen more recently 
but I've been too busy to bother writing it up.

	http://web.eece.maine.edu/~vweaver/projects/perf_events/fuzzer/3.15-rc5.get_cpu_context_gpf.html

Is there hope that we've finally found a plausible source for this bug?

Vince

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 15:41             ` Linus Torvalds
  2014-09-05 16:50               ` Vince Weaver
@ 2014-09-05 16:59               ` Mark Rutland
  2014-09-05 17:31                 ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2014-09-05 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, Yan Zheng,
	Stephane Eranian, Ingo Molnar, Vince Weaver

On Fri, Sep 05, 2014 at 04:41:43PM +0100, Linus Torvalds wrote:
> On Fri, Sep 5, 2014 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > How horrible is the below patch (performance wise). It does pretty much
> > the same thing except that percpu_rw_semaphore is a lot saner, its
> > read side performance should be minimal in the absence of writes.
> 
> Ugh. Why do any locking at all (whether a new 'perf_rwsem' or using
> 'get_online_cpus()').
> 
> Wouldn't it be much nicer to just do what memory management routines
> are *supposed* to do, and get a reference count to the context while
> having a pointer to it?
> 
> IOW, why doesn't put_event() just have a
> 
>    get_ctx(ctx);
>    ..
>    put_ctx(ctx);
> 
> around its use of the context pointer? So if the context ends up being
> migrated during this time, it doesn't get freed.

For the duration of put_event, the event holds a ref on the context. That only
gets decremented _after_ we're done dealing with event->ctx, at the very end of
put_event. Follow the callchain:

	put_event(event)
	-> _free_event(event)
	-> __free_event(event)
	-> put_ctx(event->ctx).

As you point out below, the race on event->ctx is the fundamental issue. That
is what results in decrementing the refcount twice (once on a stale event->ctx
pointer).

> However, the more fundamental question is "what protects accesses to
> 'events->ctx'". Why is "put_event()" so special that *it* gets locking
> for the reading of "event->ctx", but none of the other cases of
> reading the ctx pointer gets it or needs it?

The key point is that it doesn't, which is precisely what this patch attempted
to correct. 

Regardless you're right that other uses of event->ctx are just as broken.  What
perf_pmu_migrate_context failed to take into account was that it is possible to
access an event without going via its owning context and holding ctx->mutex.

> I'm getting the feeling that this race is bigger than just put_event().

We certainly have at least one more race; for event groups perf_read can lock
the stale context.

Mark.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 16:59               ` Mark Rutland
@ 2014-09-05 17:31                 ` Linus Torvalds
  2014-09-05 19:54                   ` Peter Zijlstra
  2014-09-08  8:39                   ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2014-09-05 17:31 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, linux-kernel@vger.kernel.org, Yan Zheng,
	Stephane Eranian, Ingo Molnar, Vince Weaver

On Fri, Sep 5, 2014 at 9:59 AM, Mark Rutland <mark.rutland@arm.com> wrote:
>
> As you point out below, the race on event->ctx is the fundamental issue. That
> is what results in decrementing the refcount twice (once on a stale event->ctx
> pointer).

So quite frankly, the whole perf_pmu_migrate_context() thing looks
completely and fundamentally broken.

Your patch doesn't really solve anything in general, and would need to
be extended to do that get_online_cpus() around every single case
where we read it. Which isn't really acceptable.

perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
It seems to think that "the event->ctx" pointer is a RCU-protected
pointer, but it really isn't. Its lifetime is protected by
refcounting, and the pointer access itself isn't protected by
anything, because "event->ctx" isn't supposed to change over the
lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
around it, nor access it with rcu_dereference(), because all users
know that it's just fixed.

Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
perf_pmu_migrate_context()") is unfixably broken. The whole thing
needs to be re-thought. It violates everything that everybody else
does with the context. No amount of (sane) locking can fix the
breakage, I suspect.

A possible (but unfortunate) real fix is to try to keep the broken
code around, and make its broken assumptions be actually true. Make
"event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
required rcu read-locking, and make everything really do what that
currently completely broken perf_pmu_migrate_context() thinks it does.

   That implies, for example, that any time you use the thing in a
sleeping context, you'd need to do something like this (possibly with
a helper function to do that: "get_event_ctx()"

        rcu_read_lock();
        ctx = rcu_dereference(event->rcu);
        get_ctx(ctx);
        rcu_read_unlock();

        .. you can now sleep here and use 'ctx' ...

        put_ctx(ctx);

    and if you don't need to sleep, you can just do

        rcu_read_lock();
        ctx = rcu_dereference(event->ctx);
        .. use ctx in an atomic context ...
        rcu_read_unlock();

but that is a much bigger patch than either of the "introduce a
completely broken ad-hoc lock around just one random use case"
patches.

Alternatively (and this sounds like the better fix), use some way to
avoid that broken perf_pmu_migrate_context() model entirely. Never
"migrate" events. Create completely new ones on the new CPU, and leave
the old stale ones alone even on dead CPU's (they will presumably not
actually be activated, and who the hell cares?).

Or even just say: "if somebody takes down a CPU with existing uncore
events, those events are now effectively dead". Don't try to migrate
them, don't try to create new events on another CPU, just let it go.
The CPU is down, the events are inactive.

Keep it simple. Not the current completely broken thing that clearly
doesn't honor the actual real rules for "event->ctx". Because the
current rules are effectively that event->ctx is a constant over the
whole lifetime of the event. Agreed?

                    Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 17:31                 ` Linus Torvalds
@ 2014-09-05 19:54                   ` Peter Zijlstra
  2014-09-08  8:39                   ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2014-09-05 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Rutland, linux-kernel@vger.kernel.org, Yan Zheng,
	Stephane Eranian, Ingo Molnar, Vince Weaver

On Fri, Sep 05, 2014 at 10:31:49AM -0700, Linus Torvalds wrote:
> So quite frankly, the whole perf_pmu_migrate_context() thing looks
> completely and fundamentally broken.

Yes, agreed. We can go play very nasty games, but fundamentally agreed.

> Or even just say: "if somebody takes down a CPU with existing uncore
> events, those events are now effectively dead". Don't try to migrate
> them, don't try to create new events on another CPU, just let it go.
> The CPU is down, the events are inactive.
> 
> Keep it simple. Not the current completely broken thing that clearly
> doesn't honor the actual real rules for "event->ctx". Because the
> current rules are effectively that event->ctx is a constant over the
> whole lifetime of the event. Agreed?

I like this best, I've never really been a supporter of hotplug games. 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Possible race between CPU hotplug and perf_pmu_migrate_context
  2014-09-05 17:31                 ` Linus Torvalds
  2014-09-05 19:54                   ` Peter Zijlstra
@ 2014-09-08  8:39                   ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2014-09-08  8:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Rutland, linux-kernel@vger.kernel.org, Yan Zheng,
	Stephane Eranian, Ingo Molnar, Vince Weaver

[-- Attachment #1: Type: text/plain, Size: 4837 bytes --]

On Fri, Sep 05, 2014 at 10:31:49AM -0700, Linus Torvalds wrote:
> So quite frankly, the whole perf_pmu_migrate_context() thing looks
> completely and fundamentally broken.
> 
> Your patch doesn't really solve anything in general, and would need to
> be extended to do that get_online_cpus() around every single case
> where we read it. Which isn't really acceptable.
> 
> perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
> It seems to think that "the event->ctx" pointer is a RCU-protected
> pointer, but it really isn't. Its lifetime is protected by
> refcounting, and the pointer access itself isn't protected by
> anything, because "event->ctx" isn't supposed to change over the
> lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
> around it, nor access it with rcu_dereference(), because all users
> know that it's just fixed.

Its worse; there's a second site that does exactly the same broken thing
:/

The 'move_group' logic in sys_perf_event_open() migrates a software
event into a hardware context when creating event groups.

> Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
> perf_pmu_migrate_context()") is unfixably broken. The whole thing
> needs to be re-thought. It violates everything that everybody else
> does with the context. No amount of (sane) locking can fix the
> breakage, I suspect.

So yes and no; yes for sanity and consistency, no because after we've
done perf_remove_from_context() the event itself is inactive and we only
need to 'worry' about external (mostly userspace and exit paths)
interaction.

The hard exclusion provided by a rwsem/hotplug lock can do this, but I
agree its quite horrible.

> A possible (but unfortunate) real fix is to try to keep the broken
> code around, and make its broken assumptions be actually true. Make
> "event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
> required rcu read-locking, and make everything really do what that
> currently completely broken perf_pmu_migrate_context() thinks it does.

I'm not entirely sure RCU works here.

>    That implies, for example, that any time you use the thing in a
> sleeping context, you'd need to do something like this (possibly with
> a helper function to do that: "get_event_ctx()"
> 
>         rcu_read_lock();
>         ctx = rcu_dereference(event->rcu);
>         get_ctx(ctx);
>         rcu_read_unlock();
> 
>         .. you can now sleep here and use 'ctx' ...
> 
>         put_ctx(ctx);
> 
>     and if you don't need to sleep, you can just do
> 
>         rcu_read_lock();
>         ctx = rcu_dereference(event->ctx);
>         .. use ctx in an atomic context ...
>         rcu_read_unlock();
> 
> but that is a much bigger patch than either of the "introduce a
> completely broken ad-hoc lock around just one random use case"
> patches.

So the problem with an RCU like approach is that it would allow multiple
users to see different ctxs. Even calling synchronize_rcu() in between
doesn't really solve that because we don't clear the event->ctx pointer
after remove_from_context, install_in_context simply sets a new one.

If we were to clear the context pointer, all usage sites would have to
read/wait until there's a valid pointer again, at which point we've just
build a rwsem with rcu.

> Alternatively (and this sounds like the better fix), use some way to
> avoid that broken perf_pmu_migrate_context() model entirely. Never
> "migrate" events. Create completely new ones on the new CPU, and leave
> the old stale ones alone even on dead CPU's (they will presumably not
> actually be activated, and who the hell cares?).

That appears to simply move the problem; instead of event->ctx being the
problem we then get filp->private_data the exact same problem. It might
be a better delimited problem perhaps, but adds a little complexity
because we need to copy the event state around.

Similarly we can't use RCU here because allowing concurrent operations
on different 'versions' of the event makes it very 'interesting' to sync
state between the old and new object.

> Or even just say: "if somebody takes down a CPU with existing uncore
> events, those events are now effectively dead". Don't try to migrate
> them, don't try to create new events on another CPU, just let it go.
> The CPU is down, the events are inactive.

Which, is indeed very attractive; whoever with the additional
'move_group' problem I'm not entirely seeing how we can make that
happen. It would end up disallowing certain event groups that are
currently possible. Most notable a software group leader with hardware
events -- a most useful construct for those people who don't have
interrupts on their hardware PMUs.

Lemme ponder this a bit more.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-09-08  8:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-01 18:18 Possible race between CPU hotplug and perf_pmu_migrate_context Mark Rutland
2014-09-01 19:05 ` Peter Zijlstra
2014-09-02 18:58   ` Mark Rutland
2014-09-03 11:50     ` Mark Rutland
2014-09-04 10:44       ` Peter Zijlstra
2014-09-04 11:07         ` Mark Rutland
2014-09-05 15:16           ` Peter Zijlstra
2014-09-05 15:41             ` Linus Torvalds
2014-09-05 16:50               ` Vince Weaver
2014-09-05 16:59               ` Mark Rutland
2014-09-05 17:31                 ` Linus Torvalds
2014-09-05 19:54                   ` Peter Zijlstra
2014-09-08  8:39                   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox