[PATCH 0/3] generic-ipi: patches -v5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] generic-ipi: patches -v5
@ 2009-02-17 21:59 Peter Zijlstra
  2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, Rusty Russell
  Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra

against -tip
-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] generic-ipi: simplify the barriers
  2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra
@ 2009-02-17 21:59 ` Peter Zijlstra
  2009-02-18  0:27   ` Paul E. McKenney
  2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra
  2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, Rusty Russell
  Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra

[-- Attachment #1: nick-simplify-generic-smp-barriers.patch --]
[-- Type: text/plain, Size: 5438 bytes --]

From: Nick Piggin <npiggin@suse.de>

Firstly, just unconditionally take the lock and check the list in the
generic_call_function_single_interrupt IPI handler. As we've just taken
an IPI here, the chances are fairly high that there will be work on the
list for us, so do the locking unconditionally. This removes the tricky
lockless list_empty check and dubious barriers. The change looks bigger
than it is because it is just removing an outer loop.

Secondly, clarify architecture specific IPI locking rules. Generic code
has no tools to impose any sane ordering on IPIs if they go outside
normal cache coherency, ergo the arch code must make them appear to
obey cache coherency as a "memory operation" to initiate an IPI, and
a "memory operation" to receive one. This way at least they can be
reasoned about in generic code, and smp_mb used to provide ordering.

The combination of these two changes means that explict barriers can
be taken out of queue handling for the single case -- shared data is
explicitly locked, and ipi ordering must conform to that, so no
barriers needed. An extra barrier is needed in the many handler, so
as to ensure we load the list element after the IPI is received.

Does any architecture actually needs barriers? For the initiator I
could see it, but for the handler I would be surprised. The other
thing we could do for simplicity is just to require that a full
barrier is required before generating an IPI, and after receiving an
IPI. We can't just do that in generic code without auditing
architectures. There have been subtle hangs here on some archs in
the past.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/smp.c |   83 +++++++++++++++++++++++++++++++----------------------------
 1 file changed, 44 insertions(+), 39 deletions(-)

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -74,9 +74,16 @@ static void generic_exec_single(int cpu,
 	spin_unlock_irqrestore(&dst->lock, flags);
 
 	/*
-	 * Make the list addition visible before sending the ipi.
+	 * The list addition should be visible before sending the IPI
+	 * handler locks the list to pull the entry off it because of
+	 * normal cache coherency rules implied by spinlocks.
+	 *
+	 * If IPIs can go out of order to the cache coherency protocol
+	 * in an architecture, sufficient synchronisation should be added
+	 * to arch code to make it appear to obey cache coherency WRT
+	 * locking and barrier primitives. Generic code isn't really equipped
+	 * to do the right thing...
 	 */
-	smp_mb();
 
 	if (ipi)
 		arch_send_call_function_single_ipi(cpu);
@@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
 	int cpu = get_cpu();
 
 	/*
+	 * Ensure entry is visible on call_function_queue after we have
+	 * entered the IPI. See comment in smp_call_function_many.
+	 * If we don't have this, then we may miss an entry on the list
+	 * and never get another IPI to process it.
+	 */
+	smp_mb();
+
+	/*
 	 * It's ok to use list_for_each_rcu() here even though we may delete
 	 * 'pos', since list_del_rcu() doesn't clear ->next
 	 */
@@ -154,49 +169,37 @@ void generic_smp_call_function_single_in
 {
 	struct call_single_queue *q = &__get_cpu_var(call_single_queue);
 	LIST_HEAD(list);
+	unsigned int data_flags;
 
-	/*
-	 * Need to see other stores to list head for checking whether
-	 * list is empty without holding q->lock
-	 */
-	smp_read_barrier_depends();
-	while (!list_empty(&q->list)) {
-		unsigned int data_flags;
-
-		spin_lock(&q->lock);
-		list_replace_init(&q->list, &list);
-		spin_unlock(&q->lock);
-
-		while (!list_empty(&list)) {
-			struct call_single_data *data;
-
-			data = list_entry(list.next, struct call_single_data,
-						list);
-			list_del(&data->list);
+	spin_lock(&q->lock);
+	list_replace_init(&q->list, &list);
+	spin_unlock(&q->lock);
 
-			/*
-			 * 'data' can be invalid after this call if
-			 * flags == 0 (when called through
-			 * generic_exec_single(), so save them away before
-			 * making the call.
-			 */
-			data_flags = data->flags;
+	while (!list_empty(&list)) {
+		struct call_single_data *data;
 
-			data->func(data->info);
+		data = list_entry(list.next, struct call_single_data,
+					list);
+		list_del(&data->list);
 
-			if (data_flags & CSD_FLAG_WAIT) {
-				smp_wmb();
-				data->flags &= ~CSD_FLAG_WAIT;
-			} else if (data_flags & CSD_FLAG_LOCK) {
-				smp_wmb();
-				data->flags &= ~CSD_FLAG_LOCK;
-			} else if (data_flags & CSD_FLAG_ALLOC)
-				kfree(data);
-		}
 		/*
-		 * See comment on outer loop
+		 * 'data' can be invalid after this call if
+		 * flags == 0 (when called through
+		 * generic_exec_single(), so save them away before
+		 * making the call.
 		 */
-		smp_read_barrier_depends();
+		data_flags = data->flags;
+
+		data->func(data->info);
+
+		if (data_flags & CSD_FLAG_WAIT) {
+			smp_wmb();
+			data->flags &= ~CSD_FLAG_WAIT;
+		} else if (data_flags & CSD_FLAG_LOCK) {
+			smp_wmb();
+			data->flags &= ~CSD_FLAG_LOCK;
+		} else if (data_flags & CSD_FLAG_ALLOC)
+			kfree(data);
 	}
 }
 
@@ -375,6 +378,8 @@ void smp_call_function_many(const struct
 
 	/*
 	 * Make the list addition visible before sending the ipi.
+	 * (IPIs must obey or appear to obey normal Linux cache coherency
+	 * rules -- see comment in generic_exec_single).
 	 */
 	smp_mb();
 

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra
  2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
@ 2009-02-17 21:59 ` Peter Zijlstra
  2009-02-18  0:28   ` Paul E. McKenney
  2009-02-18  5:31   ` Rusty Russell
  2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra
  2 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, Rusty Russell
  Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra

[-- Attachment #1: smp-remove-kmalloc.patch --]
[-- Type: text/plain, Size: 12784 bytes --]

Remove the use of kmalloc() from the smp_call_function_*() calls.

Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu data for
single cpu ipi calls) started the discussion on the use of kmalloc() in
this code and fixed the smp_call_function_single(.wait=0) fallback case.

In this patch we complete this by also providing means for the _many()
call, which fully removes the need for kmalloc() in this code.

The problem with the _many() call is that other cpus might still be
observing our entry when we're done with it. It solved this by
dynamically allocating data elements and RCU-freeing it.

We solve it by using a single per-cpu entry which provides static
storage and solves one half of the problem (avoiding referencing freed
data).

The other half, ensuring the queue iteration it still possible, is done
by placing re-used entries at the head of the list. This means that if
someone was still iterating that entry when it got moved, he will now
re-visit the entries on the list he had already seen, but avoids
skipping over entries like would have happened had we placed the new
entry at the end.

Furthermore, visiting entries twice is not a problem, since we remove
our cpu from the entry's cpumask once its called.

Many thanks to Oleg for his suggestions and poking him holes in my
earlier attempts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/smp.c |  258 ++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 159 insertions(+), 99 deletions(-)

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -10,23 +10,28 @@
 #include <linux/rcupdate.h>
 #include <linux/rculist.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
 
 static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
-static LIST_HEAD(call_function_queue);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
+
+static struct {
+	struct list_head	queue;
+	spinlock_t		lock;
+} call_function __cacheline_aligned_in_smp = {
+	.queue = LIST_HEAD_INIT(call_function.queue),
+	.lock  = __SPIN_LOCK_UNLOCKED(call_function.lock),
+};
 
 enum {
 	CSD_FLAG_WAIT		= 0x01,
-	CSD_FLAG_ALLOC		= 0x02,
-	CSD_FLAG_LOCK		= 0x04,
+	CSD_FLAG_LOCK		= 0x02,
 };
 
 struct call_function_data {
 	struct call_single_data csd;
 	spinlock_t lock;
 	unsigned int refs;
-	struct rcu_head rcu_head;
-	unsigned long cpumask_bits[];
+	cpumask_var_t cpumask;
 };
 
 struct call_single_queue {
@@ -34,8 +39,45 @@ struct call_single_queue {
 	spinlock_t lock;
 };
 
+static DEFINE_PER_CPU(struct call_function_data, cfd_data) = {
+	.lock = __SPIN_LOCK_UNLOCKED(cfd_data.lock),
+};
+
+static int
+hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
+				cpu_to_node(cpu)))
+			return NOTIFY_BAD;
+		break;
+
+#ifdef CONFIG_CPU_HOTPLUG
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		free_cpumask_var(cfd->cpumask);
+		break;
+#endif
+	};
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata hotplug_cfd_notifier = {
+	.notifier_call = hotplug_cfd,
+};
+
 static int __cpuinit init_call_single_data(void)
 {
+	void *cpu = (void *)(long)smp_processor_id();
 	int i;
 
 	for_each_possible_cpu(i) {
@@ -44,18 +86,61 @@ static int __cpuinit init_call_single_da
 		spin_lock_init(&q->lock);
 		INIT_LIST_HEAD(&q->list);
 	}
+
+	hotplug_cfd(&hotplug_cfd_notifier, CPU_UP_PREPARE, cpu);
+	register_cpu_notifier(&hotplug_cfd_notifier);
+
 	return 0;
 }
 early_initcall(init_call_single_data);
 
-static void csd_flag_wait(struct call_single_data *data)
+/*
+ * csd_wait/csd_complete are used for synchronous ipi calls
+ */
+static void csd_wait_prepare(struct call_single_data *data)
+{
+	data->flags |= CSD_FLAG_WAIT;
+}
+
+static void csd_complete(struct call_single_data *data)
+{
+	if (data->flags & CSD_FLAG_WAIT) {
+		/*
+		 * ensure we're all done before saying we are
+		 */
+		smp_mb();
+		data->flags &= ~CSD_FLAG_WAIT;
+	}
+}
+
+static void csd_wait(struct call_single_data *data)
+{
+	while (data->flags & CSD_FLAG_WAIT)
+		cpu_relax();
+}
+
+/*
+ * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
+ *
+ * For non-synchronous ipi calls the csd can still be in use by the previous
+ * function call. For multi-cpu calls its even more interesting as we'll have
+ * to ensure no other cpu is observing our csd.
+ */
+static void csd_lock(struct call_single_data *data)
 {
-	/* Wait for response */
-	do {
-		if (!(data->flags & CSD_FLAG_WAIT))
-			break;
+	while (data->flags & CSD_FLAG_LOCK)
 		cpu_relax();
-	} while (1);
+	data->flags = CSD_FLAG_LOCK;
+}
+
+static void csd_unlock(struct call_single_data *data)
+{
+	WARN_ON(!(data->flags & CSD_FLAG_LOCK));
+	/*
+	 * ensure we're all done before releasing data
+	 */
+	smp_mb();
+	data->flags &= ~CSD_FLAG_LOCK;
 }
 
 /*
@@ -89,16 +174,7 @@ static void generic_exec_single(int cpu,
 		arch_send_call_function_single_ipi(cpu);
 
 	if (wait)
-		csd_flag_wait(data);
-}
-
-static void rcu_free_call_data(struct rcu_head *head)
-{
-	struct call_function_data *data;
-
-	data = container_of(head, struct call_function_data, rcu_head);
-
-	kfree(data);
+		csd_wait(data);
 }
 
 /*
@@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt
 	 * It's ok to use list_for_each_rcu() here even though we may delete
 	 * 'pos', since list_del_rcu() doesn't clear ->next
 	 */
-	rcu_read_lock();
-	list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
+	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
 		int refs;
 
-		if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits)))
+		spin_lock(&data->lock);
+		if (!cpumask_test_cpu(cpu, data->cpumask)) {
+			spin_unlock(&data->lock);
 			continue;
+		}
+		cpumask_clear_cpu(cpu, data->cpumask);
+		spin_unlock(&data->lock);
 
 		data->csd.func(data->csd.info);
 
 		spin_lock(&data->lock);
-		cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits));
 		WARN_ON(data->refs == 0);
-		data->refs--;
-		refs = data->refs;
+		refs = --data->refs;
+		if (!refs) {
+			spin_lock(&call_function.lock);
+			list_del_rcu(&data->csd.list);
+			spin_unlock(&call_function.lock);
+		}
 		spin_unlock(&data->lock);
 
 		if (refs)
 			continue;
 
-		spin_lock(&call_function_lock);
-		list_del_rcu(&data->csd.list);
-		spin_unlock(&call_function_lock);
-
-		if (data->csd.flags & CSD_FLAG_WAIT) {
-			/*
-			 * serialize stores to data with the flag clear
-			 * and wakeup
-			 */
-			smp_wmb();
-			data->csd.flags &= ~CSD_FLAG_WAIT;
-		}
-		if (data->csd.flags & CSD_FLAG_ALLOC)
-			call_rcu(&data->rcu_head, rcu_free_call_data);
+		csd_complete(&data->csd);
+		csd_unlock(&data->csd);
 	}
-	rcu_read_unlock();
 
 	put_cpu();
 }
@@ -192,14 +262,14 @@ void generic_smp_call_function_single_in
 
 		data->func(data->info);
 
-		if (data_flags & CSD_FLAG_WAIT) {
-			smp_wmb();
-			data->flags &= ~CSD_FLAG_WAIT;
-		} else if (data_flags & CSD_FLAG_LOCK) {
-			smp_wmb();
-			data->flags &= ~CSD_FLAG_LOCK;
-		} else if (data_flags & CSD_FLAG_ALLOC)
-			kfree(data);
+		if (data_flags & CSD_FLAG_WAIT)
+			csd_complete(data);
+
+		/*
+		 * Unlocked CSDs are valid through generic_exec_single()
+		 */
+		if (data_flags & CSD_FLAG_LOCK)
+			csd_unlock(data);
 	}
 }
 
@@ -218,7 +288,9 @@ static DEFINE_PER_CPU(struct call_single
 int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
 			     int wait)
 {
-	struct call_single_data d;
+	struct call_single_data d = {
+		.flags = 0,
+	};
 	unsigned long flags;
 	/* prevent preemption and reschedule on another processor,
 	   as well as CPU removal */
@@ -239,13 +311,11 @@ int smp_call_function_single(int cpu, vo
 			/*
 			 * We are calling a function on a single CPU
 			 * and we are not going to wait for it to finish.
-			 * We first try to allocate the data, but if we
-			 * fail, we fall back to use a per cpu data to pass
-			 * the information to that CPU. Since all callers
-			 * of this code will use the same data, we must
-			 * synchronize the callers to prevent a new caller
-			 * from corrupting the data before the callee
-			 * can access it.
+			 * We use a per cpu data to pass the information to
+			 * that CPU. Since all callers of this code will
+			 * use the same data, we must synchronize the
+			 * callers to prevent a new caller from corrupting
+			 * the data before the callee can access it.
 			 *
 			 * The CSD_FLAG_LOCK is used to let us know when
 			 * the IPI handler is done with the data.
@@ -255,18 +325,11 @@ int smp_call_function_single(int cpu, vo
 			 * will make sure the callee is done with the
 			 * data before a new caller will use it.
 			 */
-			data = kmalloc(sizeof(*data), GFP_ATOMIC);
-			if (data)
-				data->flags = CSD_FLAG_ALLOC;
-			else {
-				data = &per_cpu(csd_data, me);
-				while (data->flags & CSD_FLAG_LOCK)
-					cpu_relax();
-				data->flags = CSD_FLAG_LOCK;
-			}
+			data = &per_cpu(csd_data, me);
+			csd_lock(data);
 		} else {
 			data = &d;
-			data->flags = CSD_FLAG_WAIT;
+			csd_wait_prepare(data);
 		}
 
 		data->func = func;
@@ -326,14 +389,14 @@ void smp_call_function_many(const struct
 {
 	struct call_function_data *data;
 	unsigned long flags;
-	int cpu, next_cpu;
+	int cpu, next_cpu, me = smp_processor_id();
 
 	/* Can deadlock when called with interrupts disabled */
 	WARN_ON(irqs_disabled());
 
 	/* So, what's a CPU they want?  Ignoring this one. */
 	cpu = cpumask_first_and(mask, cpu_online_mask);
-	if (cpu == smp_processor_id())
+	if (cpu == me)
 		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
 	/* No online cpus?  We're done. */
 	if (cpu >= nr_cpu_ids)
@@ -341,7 +404,7 @@ void smp_call_function_many(const struct
 
 	/* Do we have another CPU which isn't us? */
 	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
-	if (next_cpu == smp_processor_id())
+	if (next_cpu == me)
 		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
 
 	/* Fastpath: do that cpu by itself. */
@@ -350,31 +413,28 @@ void smp_call_function_many(const struct
 		return;
 	}
 
-	data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
-	if (unlikely(!data)) {
-		/* Slow path. */
-		for_each_online_cpu(cpu) {
-			if (cpu == smp_processor_id())
-				continue;
-			if (cpumask_test_cpu(cpu, mask))
-				smp_call_function_single(cpu, func, info, wait);
-		}
-		return;
-	}
+	data = &per_cpu(cfd_data, me);
+	csd_lock(&data->csd);
 
-	spin_lock_init(&data->lock);
-	data->csd.flags = CSD_FLAG_ALLOC;
+	spin_lock_irqsave(&data->lock, flags);
 	if (wait)
-		data->csd.flags |= CSD_FLAG_WAIT;
+		csd_wait_prepare(&data->csd);
+
 	data->csd.func = func;
 	data->csd.info = info;
-	cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask);
-	cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits));
-	data->refs = cpumask_weight(to_cpumask(data->cpumask_bits));
-
-	spin_lock_irqsave(&call_function_lock, flags);
-	list_add_tail_rcu(&data->csd.list, &call_function_queue);
-	spin_unlock_irqrestore(&call_function_lock, flags);
+	cpumask_and(data->cpumask, mask, cpu_online_mask);
+	cpumask_clear_cpu(me, data->cpumask);
+	data->refs = cpumask_weight(data->cpumask);
+
+	spin_lock(&call_function.lock);
+	/*
+	 * Place entry at the _HEAD_ of the list, so that any cpu still
+	 * observing the entry in generic_smp_call_function_interrupt() will
+	 * not miss any other list entries.
+	 */
+	list_add_rcu(&data->csd.list, &call_function.queue);
+	spin_unlock(&call_function.lock);
+	spin_unlock_irqrestore(&data->lock, flags);
 
 	/*
 	 * Make the list addition visible before sending the ipi.
@@ -384,11 +444,11 @@ void smp_call_function_many(const struct
 	smp_mb();
 
 	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits));
+	arch_send_call_function_ipi_mask(data->cpumask);
 
 	/* optionally wait for the CPUs to complete */
 	if (wait)
-		csd_flag_wait(&data->csd);
+		csd_wait(&data->csd);
 }
 EXPORT_SYMBOL(smp_call_function_many);
 
@@ -418,20 +478,20 @@ EXPORT_SYMBOL(smp_call_function);
 
 void ipi_call_lock(void)
 {
-	spin_lock(&call_function_lock);
+	spin_lock(&call_function.lock);
 }
 
 void ipi_call_unlock(void)
 {
-	spin_unlock(&call_function_lock);
+	spin_unlock(&call_function.lock);
 }
 
 void ipi_call_lock_irq(void)
 {
-	spin_lock_irq(&call_function_lock);
+	spin_lock_irq(&call_function.lock);
 }
 
 void ipi_call_unlock_irq(void)
 {
-	spin_unlock_irq(&call_function_lock);
+	spin_unlock_irq(&call_function.lock);
 }

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT
  2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra
  2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
  2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra
@ 2009-02-17 21:59 ` Peter Zijlstra
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, Rusty Russell
  Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra

[-- Attachment #1: smp-wait.patch --]
[-- Type: text/plain, Size: 7199 bytes --]

Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework the code so
that we can use CSD_FLAG_LOCK for both purposes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/blk-softirq.c |    2 -
 include/linux/smp.h |    3 +
 kernel/sched.c      |    2 -
 kernel/smp.c        |   90 +++++++++++++---------------------------------------
 kernel/softirq.c    |    2 -
 5 files changed, 28 insertions(+), 71 deletions(-)

Index: linux-2.6/block/blk-softirq.c
===================================================================
--- linux-2.6.orig/block/blk-softirq.c
+++ linux-2.6/block/blk-softirq.c
@@ -64,7 +64,7 @@ static int raise_blk_irq(int cpu, struct
 		data->info = rq;
 		data->flags = 0;
 
-		__smp_call_function_single(cpu, data);
+		__smp_call_function_single(cpu, data, 0);
 		return 0;
 	}
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1120,7 +1120,7 @@ static void hrtick_start(struct rq *rq, 
 	if (rq == this_rq()) {
 		hrtimer_restart(timer);
 	} else if (!rq->hrtick_csd_pending) {
-		__smp_call_function_single(cpu_of(rq), &rq->hrtick_csd);
+		__smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0);
 		rq->hrtick_csd_pending = 1;
 	}
 }
Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -23,8 +23,7 @@ static struct {
 };
 
 enum {
-	CSD_FLAG_WAIT		= 0x01,
-	CSD_FLAG_LOCK		= 0x02,
+	CSD_FLAG_LOCK		= 0x01,
 };
 
 struct call_function_data {
@@ -95,41 +94,21 @@ static int __cpuinit init_call_single_da
 early_initcall(init_call_single_data);
 
 /*
- * csd_wait/csd_complete are used for synchronous ipi calls
- */
-static void csd_wait_prepare(struct call_single_data *data)
-{
-	data->flags |= CSD_FLAG_WAIT;
-}
-
-static void csd_complete(struct call_single_data *data)
-{
-	if (data->flags & CSD_FLAG_WAIT) {
-		/*
-		 * ensure we're all done before saying we are
-		 */
-		smp_mb();
-		data->flags &= ~CSD_FLAG_WAIT;
-	}
-}
-
-static void csd_wait(struct call_single_data *data)
-{
-	while (data->flags & CSD_FLAG_WAIT)
-		cpu_relax();
-}
-
-/*
  * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
  *
  * For non-synchronous ipi calls the csd can still be in use by the previous
  * function call. For multi-cpu calls its even more interesting as we'll have
  * to ensure no other cpu is observing our csd.
  */
-static void csd_lock(struct call_single_data *data)
+static void csd_lock_wait(struct call_single_data *data)
 {
 	while (data->flags & CSD_FLAG_LOCK)
 		cpu_relax();
+}
+
+static void csd_lock(struct call_single_data *data)
+{
+	csd_lock_wait(data);
 	data->flags = CSD_FLAG_LOCK;
 }
 
@@ -147,11 +126,12 @@ static void csd_unlock(struct call_singl
  * Insert a previously allocated call_single_data element for execution
  * on the given CPU. data must already have ->func, ->info, and ->flags set.
  */
-static void generic_exec_single(int cpu, struct call_single_data *data)
+static
+void generic_exec_single(int cpu, struct call_single_data *data, int wait)
 {
 	struct call_single_queue *dst = &per_cpu(call_single_queue, cpu);
-	int wait = data->flags & CSD_FLAG_WAIT, ipi;
 	unsigned long flags;
+	int ipi;
 
 	spin_lock_irqsave(&dst->lock, flags);
 	ipi = list_empty(&dst->list);
@@ -174,7 +154,7 @@ static void generic_exec_single(int cpu,
 		arch_send_call_function_single_ipi(cpu);
 
 	if (wait)
-		csd_wait(data);
+		csd_lock_wait(data);
 }
 
 /*
@@ -224,7 +204,6 @@ void generic_smp_call_function_interrupt
 		if (refs)
 			continue;
 
-		csd_complete(&data->csd);
 		csd_unlock(&data->csd);
 	}
 
@@ -262,9 +241,6 @@ void generic_smp_call_function_single_in
 
 		data->func(data->info);
 
-		if (data_flags & CSD_FLAG_WAIT)
-			csd_complete(data);
-
 		/*
 		 * Unlocked CSDs are valid through generic_exec_single()
 		 */
@@ -305,36 +281,16 @@ int smp_call_function_single(int cpu, vo
 		func(info);
 		local_irq_restore(flags);
 	} else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
-		struct call_single_data *data;
+		struct call_single_data *data = &d;
 
-		if (!wait) {
-			/*
-			 * We are calling a function on a single CPU
-			 * and we are not going to wait for it to finish.
-			 * We use a per cpu data to pass the information to
-			 * that CPU. Since all callers of this code will
-			 * use the same data, we must synchronize the
-			 * callers to prevent a new caller from corrupting
-			 * the data before the callee can access it.
-			 *
-			 * The CSD_FLAG_LOCK is used to let us know when
-			 * the IPI handler is done with the data.
-			 * The first caller will set it, and the callee
-			 * will clear it. The next caller must wait for
-			 * it to clear before we set it again. This
-			 * will make sure the callee is done with the
-			 * data before a new caller will use it.
-			 */
+		if (!wait)
 			data = &per_cpu(csd_data, me);
-			csd_lock(data);
-		} else {
-			data = &d;
-			csd_wait_prepare(data);
-		}
+
+		csd_lock(data);
 
 		data->func = func;
 		data->info = info;
-		generic_exec_single(cpu, data);
+		generic_exec_single(cpu, data, wait);
 	} else {
 		err = -ENXIO;	/* CPU not online */
 	}
@@ -354,12 +310,15 @@ EXPORT_SYMBOL(smp_call_function_single);
  * instance.
  *
  */
-void __smp_call_function_single(int cpu, struct call_single_data *data)
+void __smp_call_function_single(int cpu, struct call_single_data *data,
+				int wait)
 {
+	csd_lock(data);
+
 	/* Can deadlock when called with interrupts disabled */
-	WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
+	WARN_ON(wait && irqs_disabled());
 
-	generic_exec_single(cpu, data);
+	generic_exec_single(cpu, data, wait);
 }
 
 /* FIXME: Shim for archs using old arch_send_call_function_ipi API. */
@@ -417,9 +376,6 @@ void smp_call_function_many(const struct
 	csd_lock(&data->csd);
 
 	spin_lock_irqsave(&data->lock, flags);
-	if (wait)
-		csd_wait_prepare(&data->csd);
-
 	data->csd.func = func;
 	data->csd.info = info;
 	cpumask_and(data->cpumask, mask, cpu_online_mask);
@@ -448,7 +404,7 @@ void smp_call_function_many(const struct
 
 	/* optionally wait for the CPUs to complete */
 	if (wait)
-		csd_wait(&data->csd);
+		csd_lock_wait(&data->csd);
 }
 EXPORT_SYMBOL(smp_call_function_many);
 
Index: linux-2.6/kernel/softirq.c
===================================================================
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -518,7 +518,7 @@ static int __try_remote_softirq(struct c
 		cp->flags = 0;
 		cp->priv = softirq;
 
-		__smp_call_function_single(cpu, cp);
+		__smp_call_function_single(cpu, cp, 0);
 		return 0;
 	}
 	return 1;
Index: linux-2.6/include/linux/smp.h
===================================================================
--- linux-2.6.orig/include/linux/smp.h
+++ linux-2.6/include/linux/smp.h
@@ -82,7 +82,8 @@ smp_call_function_mask(cpumask_t mask, v
 	return 0;
 }
 
-void __smp_call_function_single(int cpuid, struct call_single_data *data);
+void __smp_call_function_single(int cpuid, struct call_single_data *data,
+				int wait);
 
 /*
  * Generic and arch helpers

-- 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] generic-ipi: simplify the barriers
  2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
@ 2009-02-18  0:27   ` Paul E. McKenney
  2009-02-18  9:45     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-18  0:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar,
	Rusty Russell, linux-kernel, Oleg Nesterov

On Tue, Feb 17, 2009 at 10:59:05PM +0100, Peter Zijlstra wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> Firstly, just unconditionally take the lock and check the list in the
> generic_call_function_single_interrupt IPI handler. As we've just taken
> an IPI here, the chances are fairly high that there will be work on the
> list for us, so do the locking unconditionally. This removes the tricky
> lockless list_empty check and dubious barriers. The change looks bigger
> than it is because it is just removing an outer loop.
> 
> Secondly, clarify architecture specific IPI locking rules. Generic code
> has no tools to impose any sane ordering on IPIs if they go outside
> normal cache coherency, ergo the arch code must make them appear to
> obey cache coherency as a "memory operation" to initiate an IPI, and
> a "memory operation" to receive one. This way at least they can be
> reasoned about in generic code, and smp_mb used to provide ordering.
> 
> The combination of these two changes means that explict barriers can
> be taken out of queue handling for the single case -- shared data is
> explicitly locked, and ipi ordering must conform to that, so no
> barriers needed. An extra barrier is needed in the many handler, so
> as to ensure we load the list element after the IPI is received.
> 
> Does any architecture actually needs barriers? For the initiator I
> could see it, but for the handler I would be surprised. The other
> thing we could do for simplicity is just to require that a full
> barrier is required before generating an IPI, and after receiving an
> IPI. We can't just do that in generic code without auditing
> architectures. There have been subtle hangs here on some archs in
> the past.

While I sympathize with pushing memory barriers down into the
arch-specific functions, you -are- running this by the various
arch maintainers so that they have an opportunity to adjust, right?

							Thanx, Paul

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/smp.c |   83 +++++++++++++++++++++++++++++++----------------------------
>  1 file changed, 44 insertions(+), 39 deletions(-)
> 
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c
> +++ linux-2.6/kernel/smp.c
> @@ -74,9 +74,16 @@ static void generic_exec_single(int cpu,
>  	spin_unlock_irqrestore(&dst->lock, flags);
> 
>  	/*
> -	 * Make the list addition visible before sending the ipi.
> +	 * The list addition should be visible before sending the IPI
> +	 * handler locks the list to pull the entry off it because of
> +	 * normal cache coherency rules implied by spinlocks.
> +	 *
> +	 * If IPIs can go out of order to the cache coherency protocol
> +	 * in an architecture, sufficient synchronisation should be added
> +	 * to arch code to make it appear to obey cache coherency WRT
> +	 * locking and barrier primitives. Generic code isn't really equipped
> +	 * to do the right thing...
>  	 */
> -	smp_mb();

While I sympathize with the above, you -are- running this by the various
arch maintainers so that they have an opportunity to adjust, right?

> 
>  	if (ipi)
>  		arch_send_call_function_single_ipi(cpu);
> @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
>  	int cpu = get_cpu();
> 
>  	/*
> +	 * Ensure entry is visible on call_function_queue after we have
> +	 * entered the IPI. See comment in smp_call_function_many.
> +	 * If we don't have this, then we may miss an entry on the list
> +	 * and never get another IPI to process it.
> +	 */
> +	smp_mb();
> +
> +	/*
>  	 * It's ok to use list_for_each_rcu() here even though we may delete
>  	 * 'pos', since list_del_rcu() doesn't clear ->next
>  	 */
> @@ -154,49 +169,37 @@ void generic_smp_call_function_single_in
>  {
>  	struct call_single_queue *q = &__get_cpu_var(call_single_queue);
>  	LIST_HEAD(list);
> +	unsigned int data_flags;
> 
> -	/*
> -	 * Need to see other stores to list head for checking whether
> -	 * list is empty without holding q->lock
> -	 */
> -	smp_read_barrier_depends();
> -	while (!list_empty(&q->list)) {
> -		unsigned int data_flags;
> -
> -		spin_lock(&q->lock);
> -		list_replace_init(&q->list, &list);
> -		spin_unlock(&q->lock);
> -
> -		while (!list_empty(&list)) {
> -			struct call_single_data *data;
> -
> -			data = list_entry(list.next, struct call_single_data,
> -						list);
> -			list_del(&data->list);
> +	spin_lock(&q->lock);
> +	list_replace_init(&q->list, &list);
> +	spin_unlock(&q->lock);
> 
> -			/*
> -			 * 'data' can be invalid after this call if
> -			 * flags == 0 (when called through
> -			 * generic_exec_single(), so save them away before
> -			 * making the call.
> -			 */
> -			data_flags = data->flags;
> +	while (!list_empty(&list)) {
> +		struct call_single_data *data;
> 
> -			data->func(data->info);
> +		data = list_entry(list.next, struct call_single_data,
> +					list);
> +		list_del(&data->list);
> 
> -			if (data_flags & CSD_FLAG_WAIT) {
> -				smp_wmb();
> -				data->flags &= ~CSD_FLAG_WAIT;
> -			} else if (data_flags & CSD_FLAG_LOCK) {
> -				smp_wmb();
> -				data->flags &= ~CSD_FLAG_LOCK;
> -			} else if (data_flags & CSD_FLAG_ALLOC)
> -				kfree(data);
> -		}
>  		/*
> -		 * See comment on outer loop
> +		 * 'data' can be invalid after this call if
> +		 * flags == 0 (when called through
> +		 * generic_exec_single(), so save them away before
> +		 * making the call.
>  		 */
> -		smp_read_barrier_depends();
> +		data_flags = data->flags;
> +
> +		data->func(data->info);
> +
> +		if (data_flags & CSD_FLAG_WAIT) {
> +			smp_wmb();
> +			data->flags &= ~CSD_FLAG_WAIT;
> +		} else if (data_flags & CSD_FLAG_LOCK) {
> +			smp_wmb();
> +			data->flags &= ~CSD_FLAG_LOCK;
> +		} else if (data_flags & CSD_FLAG_ALLOC)
> +			kfree(data);
>  	}
>  }
> 
> @@ -375,6 +378,8 @@ void smp_call_function_many(const struct
> 
>  	/*
>  	 * Make the list addition visible before sending the ipi.
> +	 * (IPIs must obey or appear to obey normal Linux cache coherency
> +	 * rules -- see comment in generic_exec_single).
>  	 */
>  	smp_mb();
> 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra
@ 2009-02-18  0:28   ` Paul E. McKenney
  2009-02-18 10:42     ` Peter Zijlstra
  2009-02-18 16:15     ` Oleg Nesterov
  2009-02-18  5:31   ` Rusty Russell
  1 sibling, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-18  0:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar,
	Rusty Russell, linux-kernel, Oleg Nesterov

On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> Remove the use of kmalloc() from the smp_call_function_*() calls.
> 
> Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu data for
> single cpu ipi calls) started the discussion on the use of kmalloc() in
> this code and fixed the smp_call_function_single(.wait=0) fallback case.
> 
> In this patch we complete this by also providing means for the _many()
> call, which fully removes the need for kmalloc() in this code.
> 
> The problem with the _many() call is that other cpus might still be
> observing our entry when we're done with it. It solved this by
> dynamically allocating data elements and RCU-freeing it.
> 
> We solve it by using a single per-cpu entry which provides static
> storage and solves one half of the problem (avoiding referencing freed
> data).
> 
> The other half, ensuring the queue iteration it still possible, is done
> by placing re-used entries at the head of the list. This means that if
> someone was still iterating that entry when it got moved, he will now
> re-visit the entries on the list he had already seen, but avoids
> skipping over entries like would have happened had we placed the new
> entry at the end.
> 
> Furthermore, visiting entries twice is not a problem, since we remove
> our cpu from the entry's cpumask once its called.
> 
> Many thanks to Oleg for his suggestions and poking him holes in my
> earlier attempts.

A couple small questions and one big one below...

							Thanx, Paul

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/smp.c |  258 ++++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 159 insertions(+), 99 deletions(-)
> 
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c
> +++ linux-2.6/kernel/smp.c
> @@ -10,23 +10,28 @@
>  #include <linux/rcupdate.h>
>  #include <linux/rculist.h>
>  #include <linux/smp.h>
> +#include <linux/cpu.h>
> 
>  static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
> -static LIST_HEAD(call_function_queue);
> -__cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock);
> +
> +static struct {
> +	struct list_head	queue;
> +	spinlock_t		lock;
> +} call_function __cacheline_aligned_in_smp = {
> +	.queue = LIST_HEAD_INIT(call_function.queue),
> +	.lock  = __SPIN_LOCK_UNLOCKED(call_function.lock),
> +};
> 
>  enum {
>  	CSD_FLAG_WAIT		= 0x01,
> -	CSD_FLAG_ALLOC		= 0x02,
> -	CSD_FLAG_LOCK		= 0x04,
> +	CSD_FLAG_LOCK		= 0x02,
>  };
> 
>  struct call_function_data {
>  	struct call_single_data csd;
>  	spinlock_t lock;
>  	unsigned int refs;
> -	struct rcu_head rcu_head;
> -	unsigned long cpumask_bits[];
> +	cpumask_var_t cpumask;
>  };
> 
>  struct call_single_queue {
> @@ -34,8 +39,45 @@ struct call_single_queue {
>  	spinlock_t lock;
>  };
> 
> +static DEFINE_PER_CPU(struct call_function_data, cfd_data) = {
> +	.lock = __SPIN_LOCK_UNLOCKED(cfd_data.lock),
> +};
> +
> +static int
> +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
> +{
> +	long cpu = (long)hcpu;
> +	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
> +
> +	switch (action) {
> +	case CPU_UP_PREPARE:
> +	case CPU_UP_PREPARE_FROZEN:
> +		if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
> +				cpu_to_node(cpu)))
> +			return NOTIFY_BAD;
> +		break;
> +
> +#ifdef CONFIG_CPU_HOTPLUG
> +	case CPU_UP_CANCELED:
> +	case CPU_UP_CANCELED_FROZEN:
> +
> +	case CPU_DEAD:
> +	case CPU_DEAD_FROZEN:
> +		free_cpumask_var(cfd->cpumask);
> +		break;
> +#endif
> +
> +	return NOTIFY_OK;
> +	};
> +}

Hmmm....  Why not the following?  Do we really need to free the cpumask
when a CPU departs, given that it might return later?

+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		if (cfd->cpumask == NULL &&
+		    (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
+				cpu_to_node(cpu))))
+			return NOTIFY_BAD;
+		break;
+
+	return NOTIFY_OK;
+	};
+}

> +
> +static struct notifier_block __cpuinitdata hotplug_cfd_notifier = {
> +	.notifier_call = hotplug_cfd,
> +};
> +
>  static int __cpuinit init_call_single_data(void)
>  {
> +	void *cpu = (void *)(long)smp_processor_id();
>  	int i;
> 
>  	for_each_possible_cpu(i) {
> @@ -44,18 +86,61 @@ static int __cpuinit init_call_single_da
>  		spin_lock_init(&q->lock);
>  		INIT_LIST_HEAD(&q->list);
>  	}
> +
> +	hotplug_cfd(&hotplug_cfd_notifier, CPU_UP_PREPARE, cpu);
> +	register_cpu_notifier(&hotplug_cfd_notifier);
> +
>  	return 0;
>  }
>  early_initcall(init_call_single_data);
> 
> -static void csd_flag_wait(struct call_single_data *data)
> +/*
> + * csd_wait/csd_complete are used for synchronous ipi calls
> + */
> +static void csd_wait_prepare(struct call_single_data *data)
> +{
> +	data->flags |= CSD_FLAG_WAIT;
> +}
> +
> +static void csd_complete(struct call_single_data *data)
> +{
> +	if (data->flags & CSD_FLAG_WAIT) {
> +		/*
> +		 * ensure we're all done before saying we are
> +		 */
> +		smp_mb();
> +		data->flags &= ~CSD_FLAG_WAIT;
> +	}
> +}
> +
> +static void csd_wait(struct call_single_data *data)
> +{
> +	while (data->flags & CSD_FLAG_WAIT)
> +		cpu_relax();
> +}
> +
> +/*
> + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
> + *
> + * For non-synchronous ipi calls the csd can still be in use by the previous
> + * function call. For multi-cpu calls its even more interesting as we'll have
> + * to ensure no other cpu is observing our csd.
> + */
> +static void csd_lock(struct call_single_data *data)
>  {
> -	/* Wait for response */
> -	do {
> -		if (!(data->flags & CSD_FLAG_WAIT))
> -			break;
> +	while (data->flags & CSD_FLAG_LOCK)
>  		cpu_relax();
> -	} while (1);
> +	data->flags = CSD_FLAG_LOCK;

We do need an smp_mb() here, otherwise, the call from
smp_call_function_single() could be reordered by either CPU or compiler
as follows:

	data->func = func;
	data->info = info;
	csd_lock(data);

This might come as a bit of a surprise to the other CPU still trying to
use the old values for data->func and data->info.

Note that this smb_mb() is required even if cpu_relax() contains a
memory barrier, as it is possible to execute csd_lock_wait() without
executing the cpu_relax(), if you get there at just the right time.

> +}
> +
> +static void csd_unlock(struct call_single_data *data)
> +{
> +	WARN_ON(!(data->flags & CSD_FLAG_LOCK));
> +	/*
> +	 * ensure we're all done before releasing data
> +	 */
> +	smp_mb();
> +	data->flags &= ~CSD_FLAG_LOCK;
>  }
> 
>  /*
> @@ -89,16 +174,7 @@ static void generic_exec_single(int cpu,
>  		arch_send_call_function_single_ipi(cpu);
> 
>  	if (wait)
> -		csd_flag_wait(data);
> -}
> -
> -static void rcu_free_call_data(struct rcu_head *head)
> -{
> -	struct call_function_data *data;
> -
> -	data = container_of(head, struct call_function_data, rcu_head);
> -
> -	kfree(data);
> +		csd_wait(data);
>  }
> 
>  /*
> @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt
>  	 * It's ok to use list_for_each_rcu() here even though we may delete
>  	 * 'pos', since list_del_rcu() doesn't clear ->next
>  	 */
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
> +	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {

OK...  What prevents the following sequence of events?

o	CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
	including CPU 2.  CPU 0 therefore enqueues this on the global
	call_function.queue.  "wait" is not specified, so CPU 0 returns
	immediately after sending the IPIs.

o	CPU 1 disables irqs and leaves them disabled for awhile.

o	CPU 2 receives the IPI, and duly calls the needed function.
	It decrements the ->refs field, but, finding the result
	non-zero, refrains from removing the element that CPU 0
	enqueued, and also refrains from invoking csd_unlock().

o	CPU 3 also receives the IPI, and also calls the needed function.
	Now, only CPU 1 need receive the IPI for the element to be
	removed.

o	CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
	but -not- including CPU 2.  CPU 3 therefore also this on the
	global call_function.queue and sends the IPIs, but no IPI for
	CPU 2.	Your choice as to whether CPU 3 waits or not.

o	CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
	list.  CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
	and then...

o	CPU 1 finally re-enables irqs and receives the IPIs!!!  It
	finds CPU 0's element on the queue, calls the function,
	decrements the ->refs field, and finds that it is zero.
	So, CPU 1 invokes list_del_rcu() to remove the element
	(OK so far, as list_del_rcu() doesn't overwrite the next
	pointer), then invokes csd_unlock() to release the element.

o	CPU 0 then invokes another smp_call_function_many(), also
	multiple CPUs, but -not- to CPU 2.  It requeues the element
	that was just csd_unlock()ed above, carrying CPU 2 with it.
	It IPIs CPUs 1 and 3, but not CPU 2.

o	CPU 2 continues, and falls off the bottom of the list.  It will
	continue to ignore CPU 0's IPI until some other CPU IPIs it.
	On some architectures, a single-target IPI won't cut it, only
	a multi-target IPI.

Or am I missing something subtle here?

If this really is a problem, there are a number of counter-based solutions
to it.  (Famous last words...)

Abandoning all caution and attempting one on the fly...  Make each CPU
receiving an IPI increment one per-CPU counter upon entry, and increment
it again upon exit with memory barriers after and before, respectively.
Then any CPU with an even value can be ignored, and any CPU whose value
changes can also be ignored.  Of course, this means you have to scan all
CPUs...  But in the worst case, you also had to IPI them all.

Given that this operation is relatively rare, it might be worth using
shared reference counters, possibly one pair of such counters per (say)
16 CPUs.  Then the caller flips the counter.

Alternatively, you can explain to me why my scenario above cannot
happen -- but at present, it will take some serious explaining!!!

>  		int refs;
> 
> -		if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits)))
> +		spin_lock(&data->lock);
> +		if (!cpumask_test_cpu(cpu, data->cpumask)) {
> +			spin_unlock(&data->lock);
>  			continue;
> +		}
> +		cpumask_clear_cpu(cpu, data->cpumask);
> +		spin_unlock(&data->lock);
> 
>  		data->csd.func(data->csd.info);
> 
>  		spin_lock(&data->lock);
> -		cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits));
>  		WARN_ON(data->refs == 0);
> -		data->refs--;
> -		refs = data->refs;
> +		refs = --data->refs;
> +		if (!refs) {
> +			spin_lock(&call_function.lock);
> +			list_del_rcu(&data->csd.list);
> +			spin_unlock(&call_function.lock);
> +		}
>  		spin_unlock(&data->lock);
> 
>  		if (refs)
>  			continue;
> 
> -		spin_lock(&call_function_lock);
> -		list_del_rcu(&data->csd.list);
> -		spin_unlock(&call_function_lock);
> -
> -		if (data->csd.flags & CSD_FLAG_WAIT) {
> -			/*
> -			 * serialize stores to data with the flag clear
> -			 * and wakeup
> -			 */
> -			smp_wmb();
> -			data->csd.flags &= ~CSD_FLAG_WAIT;
> -		}
> -		if (data->csd.flags & CSD_FLAG_ALLOC)
> -			call_rcu(&data->rcu_head, rcu_free_call_data);
> +		csd_complete(&data->csd);
> +		csd_unlock(&data->csd);
>  	}
> -	rcu_read_unlock();
> 
>  	put_cpu();
>  }
> @@ -192,14 +262,14 @@ void generic_smp_call_function_single_in
> 
>  		data->func(data->info);
> 
> -		if (data_flags & CSD_FLAG_WAIT) {
> -			smp_wmb();
> -			data->flags &= ~CSD_FLAG_WAIT;
> -		} else if (data_flags & CSD_FLAG_LOCK) {
> -			smp_wmb();
> -			data->flags &= ~CSD_FLAG_LOCK;
> -		} else if (data_flags & CSD_FLAG_ALLOC)
> -			kfree(data);
> +		if (data_flags & CSD_FLAG_WAIT)
> +			csd_complete(data);
> +
> +		/*
> +		 * Unlocked CSDs are valid through generic_exec_single()
> +		 */
> +		if (data_flags & CSD_FLAG_LOCK)
> +			csd_unlock(data);
>  	}
>  }
> 
> @@ -218,7 +288,9 @@ static DEFINE_PER_CPU(struct call_single
>  int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
>  			     int wait)
>  {
> -	struct call_single_data d;
> +	struct call_single_data d = {
> +		.flags = 0,
> +	};
>  	unsigned long flags;
>  	/* prevent preemption and reschedule on another processor,
>  	   as well as CPU removal */
> @@ -239,13 +311,11 @@ int smp_call_function_single(int cpu, vo
>  			/*
>  			 * We are calling a function on a single CPU
>  			 * and we are not going to wait for it to finish.
> -			 * We first try to allocate the data, but if we
> -			 * fail, we fall back to use a per cpu data to pass
> -			 * the information to that CPU. Since all callers
> -			 * of this code will use the same data, we must
> -			 * synchronize the callers to prevent a new caller
> -			 * from corrupting the data before the callee
> -			 * can access it.
> +			 * We use a per cpu data to pass the information to
> +			 * that CPU. Since all callers of this code will
> +			 * use the same data, we must synchronize the
> +			 * callers to prevent a new caller from corrupting
> +			 * the data before the callee can access it.
>  			 *
>  			 * The CSD_FLAG_LOCK is used to let us know when
>  			 * the IPI handler is done with the data.
> @@ -255,18 +325,11 @@ int smp_call_function_single(int cpu, vo
>  			 * will make sure the callee is done with the
>  			 * data before a new caller will use it.
>  			 */
> -			data = kmalloc(sizeof(*data), GFP_ATOMIC);
> -			if (data)
> -				data->flags = CSD_FLAG_ALLOC;
> -			else {
> -				data = &per_cpu(csd_data, me);
> -				while (data->flags & CSD_FLAG_LOCK)
> -					cpu_relax();
> -				data->flags = CSD_FLAG_LOCK;
> -			}
> +			data = &per_cpu(csd_data, me);
> +			csd_lock(data);
>  		} else {
>  			data = &d;
> -			data->flags = CSD_FLAG_WAIT;
> +			csd_wait_prepare(data);
>  		}
> 
>  		data->func = func;
> @@ -326,14 +389,14 @@ void smp_call_function_many(const struct
>  {
>  	struct call_function_data *data;
>  	unsigned long flags;
> -	int cpu, next_cpu;
> +	int cpu, next_cpu, me = smp_processor_id();
> 
>  	/* Can deadlock when called with interrupts disabled */
>  	WARN_ON(irqs_disabled());
> 
>  	/* So, what's a CPU they want?  Ignoring this one. */
>  	cpu = cpumask_first_and(mask, cpu_online_mask);
> -	if (cpu == smp_processor_id())
> +	if (cpu == me)
>  		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
>  	/* No online cpus?  We're done. */
>  	if (cpu >= nr_cpu_ids)
> @@ -341,7 +404,7 @@ void smp_call_function_many(const struct
> 
>  	/* Do we have another CPU which isn't us? */
>  	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
> -	if (next_cpu == smp_processor_id())
> +	if (next_cpu == me)
>  		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
> 
>  	/* Fastpath: do that cpu by itself. */
> @@ -350,31 +413,28 @@ void smp_call_function_many(const struct
>  		return;
>  	}
> 
> -	data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC);
> -	if (unlikely(!data)) {
> -		/* Slow path. */
> -		for_each_online_cpu(cpu) {
> -			if (cpu == smp_processor_id())
> -				continue;
> -			if (cpumask_test_cpu(cpu, mask))
> -				smp_call_function_single(cpu, func, info, wait);
> -		}
> -		return;
> -	}
> +	data = &per_cpu(cfd_data, me);
> +	csd_lock(&data->csd);
> 
> -	spin_lock_init(&data->lock);
> -	data->csd.flags = CSD_FLAG_ALLOC;
> +	spin_lock_irqsave(&data->lock, flags);
>  	if (wait)
> -		data->csd.flags |= CSD_FLAG_WAIT;
> +		csd_wait_prepare(&data->csd);
> +
>  	data->csd.func = func;
>  	data->csd.info = info;
> -	cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask);
> -	cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits));
> -	data->refs = cpumask_weight(to_cpumask(data->cpumask_bits));
> -
> -	spin_lock_irqsave(&call_function_lock, flags);
> -	list_add_tail_rcu(&data->csd.list, &call_function_queue);
> -	spin_unlock_irqrestore(&call_function_lock, flags);
> +	cpumask_and(data->cpumask, mask, cpu_online_mask);
> +	cpumask_clear_cpu(me, data->cpumask);
> +	data->refs = cpumask_weight(data->cpumask);
> +
> +	spin_lock(&call_function.lock);
> +	/*
> +	 * Place entry at the _HEAD_ of the list, so that any cpu still
> +	 * observing the entry in generic_smp_call_function_interrupt() will
> +	 * not miss any other list entries.
> +	 */
> +	list_add_rcu(&data->csd.list, &call_function.queue);
> +	spin_unlock(&call_function.lock);
> +	spin_unlock_irqrestore(&data->lock, flags);
> 
>  	/*
>  	 * Make the list addition visible before sending the ipi.
> @@ -384,11 +444,11 @@ void smp_call_function_many(const struct
>  	smp_mb();
> 
>  	/* Send a message to all CPUs in the map */
> -	arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits));
> +	arch_send_call_function_ipi_mask(data->cpumask);
> 
>  	/* optionally wait for the CPUs to complete */
>  	if (wait)
> -		csd_flag_wait(&data->csd);
> +		csd_wait(&data->csd);
>  }
>  EXPORT_SYMBOL(smp_call_function_many);
> 
> @@ -418,20 +478,20 @@ EXPORT_SYMBOL(smp_call_function);
> 
>  void ipi_call_lock(void)
>  {
> -	spin_lock(&call_function_lock);
> +	spin_lock(&call_function.lock);
>  }
> 
>  void ipi_call_unlock(void)
>  {
> -	spin_unlock(&call_function_lock);
> +	spin_unlock(&call_function.lock);
>  }
> 
>  void ipi_call_lock_irq(void)
>  {
> -	spin_lock_irq(&call_function_lock);
> +	spin_lock_irq(&call_function.lock);
>  }
> 
>  void ipi_call_unlock_irq(void)
>  {
> -	spin_unlock_irq(&call_function_lock);
> +	spin_unlock_irq(&call_function.lock);
>  }
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra
  2009-02-18  0:28   ` Paul E. McKenney
@ 2009-02-18  5:31   ` Rusty Russell
  2009-02-18 10:52     ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Rusty Russell @ 2009-02-18  5:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, linux-kernel, Oleg Nesterov

On Wednesday 18 February 2009 08:29:06 Peter Zijlstra wrote:
> Remove the use of kmalloc() from the smp_call_function_*() calls.

This patch is actually quite nice.  Not sure it's correct, but it's neat :)

One minor quibble:

> +			data = &per_cpu(csd_data, me);

"data = &__get_cpu_var(csd_data);" is slightly preferred.  A few less cycles.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] generic-ipi: simplify the barriers
  2009-02-18  0:27   ` Paul E. McKenney
@ 2009-02-18  9:45     ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-18  9:45 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar,
	Rusty Russell, linux-kernel, Oleg Nesterov

On Tue, 2009-02-17 at 16:27 -0800, Paul E. McKenney wrote:

> While I sympathize with pushing memory barriers down into the
> arch-specific functions, you -are- running this by the various
> arch maintainers so that they have an opportunity to adjust, right?

Nick's initial posting of this patch included linux-arch -- maybe I
should have added that too.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18  0:28   ` Paul E. McKenney
@ 2009-02-18 10:42     ` Peter Zijlstra
  2009-02-18 16:06       ` Paul E. McKenney
  2009-02-18 16:15     ` Oleg Nesterov
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-18 10:42 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar,
	Rusty Russell, linux-kernel, Oleg Nesterov

On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote:
> On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:

> > +static int
> > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > +{
> > +	long cpu = (long)hcpu;
> > +	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
> > +
> > +	switch (action) {
> > +	case CPU_UP_PREPARE:
> > +	case CPU_UP_PREPARE_FROZEN:
> > +		if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
> > +				cpu_to_node(cpu)))
> > +			return NOTIFY_BAD;
> > +		break;
> > +
> > +#ifdef CONFIG_CPU_HOTPLUG
> > +	case CPU_UP_CANCELED:
> > +	case CPU_UP_CANCELED_FROZEN:
> > +
> > +	case CPU_DEAD:
> > +	case CPU_DEAD_FROZEN:
> > +		free_cpumask_var(cfd->cpumask);
> > +		break;
> > +#endif
> > +
> > +	return NOTIFY_OK;
> > +	};
> > +}
> 
> Hmmm....  Why not the following?  Do we really need to free the cpumask
> when a CPU departs, given that it might return later?

Probably not, but that's what we have these callbacks for, might as well
use them.

> > +/*
> > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
> > + *
> > + * For non-synchronous ipi calls the csd can still be in use by the previous
> > + * function call. For multi-cpu calls its even more interesting as we'll have
> > + * to ensure no other cpu is observing our csd.
> > + */
> > +static void csd_lock(struct call_single_data *data)
> >  {
> > -	/* Wait for response */
> > -	do {
> > -		if (!(data->flags & CSD_FLAG_WAIT))
> > -			break;
> > +	while (data->flags & CSD_FLAG_LOCK)
> >  		cpu_relax();
> > -	} while (1);
> > +	data->flags = CSD_FLAG_LOCK;
> 
> We do need an smp_mb() here, otherwise, the call from
> smp_call_function_single() could be reordered by either CPU or compiler
> as follows:
> 
> 	data->func = func;
> 	data->info = info;
> 	csd_lock(data);
> 
> This might come as a bit of a surprise to the other CPU still trying to
> use the old values for data->func and data->info.
> 
> Note that this smb_mb() is required even if cpu_relax() contains a
> memory barrier, as it is possible to execute csd_lock_wait() without
> executing the cpu_relax(), if you get there at just the right time.

I'm not quite sure I follow, if we observe !(flags & LOCK) then we're
guaranteed that no cpu will still needs func and info. They might still
be observing the list entry, but should not find themselves on the
cpumask -- which is guarded by ->lock.

Anyway, I'm happy to add the mb().

> > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt
> >  	 * It's ok to use list_for_each_rcu() here even though we may delete
> >  	 * 'pos', since list_del_rcu() doesn't clear ->next
> >  	 */
> > -	rcu_read_lock();
> > -	list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
> > +	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
> 
> OK...  What prevents the following sequence of events?
> 
> o	CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
> 	including CPU 2.  CPU 0 therefore enqueues this on the global
> 	call_function.queue.  "wait" is not specified, so CPU 0 returns
> 	immediately after sending the IPIs.
> 
> o	CPU 1 disables irqs and leaves them disabled for awhile.
> 
> o	CPU 2 receives the IPI, and duly calls the needed function.
> 	It decrements the ->refs field, but, finding the result
> 	non-zero, refrains from removing the element that CPU 0
> 	enqueued, and also refrains from invoking csd_unlock().
> 
> o	CPU 3 also receives the IPI, and also calls the needed function.
> 	Now, only CPU 1 need receive the IPI for the element to be
> 	removed.
> 
> o	CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
> 	but -not- including CPU 2.  CPU 3 therefore also this on the
> 	global call_function.queue and sends the IPIs, but no IPI for
> 	CPU 2.	Your choice as to whether CPU 3 waits or not.

Right, so the queue is Entry3->Entry0 (we place new entries on at the
head).

> o	CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
> 	list.  CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
> 	and then...

CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the
ipi.

Furthermore, Entry3 would be first, but suppose it is Entry0, that makes
the scenario work best.

> o	CPU 1 finally re-enables irqs and receives the IPIs!!!  It
> 	finds CPU 0's element on the queue, calls the function,
> 	decrements the ->refs field, and finds that it is zero.
> 	So, CPU 1 invokes list_del_rcu() to remove the element
> 	(OK so far, as list_del_rcu() doesn't overwrite the next
> 	pointer), then invokes csd_unlock() to release the element.

CPU1 will see CPU3's element first, and CPU0's element second. But OK.

> o	CPU 0 then invokes another smp_call_function_many(), also
> 	multiple CPUs, but -not- to CPU 2.  It requeues the element
> 	that was just csd_unlock()ed above, carrying CPU 2 with it.
> 	It IPIs CPUs 1 and 3, but not CPU 2.
> 
> o	CPU 2 continues, and falls off the bottom of the list.  It will
> 	continue to ignore CPU 0's IPI until some other CPU IPIs it.
> 	On some architectures, a single-target IPI won't cut it, only
> 	a multi-target IPI.
> 
> Or am I missing something subtle here?

Ah, right, yes. I place new entries at the HEAD not TAIL, so that in
this case we go from:

Entry3->Entry0
           ^
         CPU2

to

Entry0->Entry3
  ^
CPU2

and CPU2 will continue the list iteration, visiting Entry3 twice, which
is harmless since it will have removed itself from the cpumask the first
time around.

> If this really is a problem, there are a number of counter-based solutions
> to it.  (Famous last words...)
> 
> Abandoning all caution and attempting one on the fly...  Make each CPU
> receiving an IPI increment one per-CPU counter upon entry, and increment
> it again upon exit with memory barriers after and before, respectively.
> Then any CPU with an even value can be ignored, and any CPU whose value
> changes can also be ignored.  Of course, this means you have to scan all
> CPUs...  But in the worst case, you also had to IPI them all.
> 
> Given that this operation is relatively rare, it might be worth using
> shared reference counters, possibly one pair of such counters per (say)
> 16 CPUs.  Then the caller flips the counter.

Yep, I almost implemented a counting RCU which increments a global
counter on IPI entry and decrements on IPI exit, but then did the above
FIFO->FILO queue thingy.

> Alternatively, you can explain to me why my scenario above cannot
> happen -- but at present, it will take some serious explaining!!!

I hope to have adequately explained it, but please, feel free to poke
more holes into it ;-)




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18  5:31   ` Rusty Russell
@ 2009-02-18 10:52     ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-18 10:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
	Ingo Molnar, linux-kernel, Oleg Nesterov

On Wed, 2009-02-18 at 16:01 +1030, Rusty Russell wrote:
> On Wednesday 18 February 2009 08:29:06 Peter Zijlstra wrote:
> > Remove the use of kmalloc() from the smp_call_function_*() calls.
> 
> This patch is actually quite nice.  Not sure it's correct, but it's neat :)

:-)

> One minor quibble:
> 
> > +			data = &per_cpu(csd_data, me);
> 
> "data = &__get_cpu_var(csd_data);" is slightly preferred.  A few less cycles.

Thanks, next version will include the below.

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -325,7 +325,7 @@ int smp_call_function_single(int cpu, vo
 			 * will make sure the callee is done with the
 			 * data before a new caller will use it.
 			 */
-			data = &per_cpu(csd_data, me);
+			data = &__get_cpu_var(csd_data);
 			csd_lock(data);
 		} else {
 			data = &d;
@@ -413,7 +413,7 @@ void smp_call_function_many(const struct
 		return;
 	}
 
-	data = &per_cpu(cfd_data, me);
+	data = &__get_cpu_var(cfd_data);
 	csd_lock(&data->csd);
 
 	spin_lock_irqsave(&data->lock, flags);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18 10:42     ` Peter Zijlstra
@ 2009-02-18 16:06       ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-18 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar,
	Rusty Russell, linux-kernel, Oleg Nesterov

On Wed, Feb 18, 2009 at 11:42:17AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> 
> > > +static int
> > > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > > +{
> > > +	long cpu = (long)hcpu;
> > > +	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
> > > +
> > > +	switch (action) {
> > > +	case CPU_UP_PREPARE:
> > > +	case CPU_UP_PREPARE_FROZEN:
> > > +		if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
> > > +				cpu_to_node(cpu)))
> > > +			return NOTIFY_BAD;
> > > +		break;
> > > +
> > > +#ifdef CONFIG_CPU_HOTPLUG
> > > +	case CPU_UP_CANCELED:
> > > +	case CPU_UP_CANCELED_FROZEN:
> > > +
> > > +	case CPU_DEAD:
> > > +	case CPU_DEAD_FROZEN:
> > > +		free_cpumask_var(cfd->cpumask);
> > > +		break;
> > > +#endif
> > > +
> > > +	return NOTIFY_OK;
> > > +	};
> > > +}
> > 
> > Hmmm....  Why not the following?  Do we really need to free the cpumask
> > when a CPU departs, given that it might return later?
> 
> Probably not, but that's what we have these callbacks for, might as well
> use them.

Fair enough...

> > > +/*
> > > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
> > > + *
> > > + * For non-synchronous ipi calls the csd can still be in use by the previous
> > > + * function call. For multi-cpu calls its even more interesting as we'll have
> > > + * to ensure no other cpu is observing our csd.
> > > + */
> > > +static void csd_lock(struct call_single_data *data)
> > >  {
> > > -	/* Wait for response */
> > > -	do {
> > > -		if (!(data->flags & CSD_FLAG_WAIT))
> > > -			break;
> > > +	while (data->flags & CSD_FLAG_LOCK)
> > >  		cpu_relax();
> > > -	} while (1);
> > > +	data->flags = CSD_FLAG_LOCK;
> > 
> > We do need an smp_mb() here, otherwise, the call from
> > smp_call_function_single() could be reordered by either CPU or compiler
> > as follows:
> > 
> > 	data->func = func;
> > 	data->info = info;
> > 	csd_lock(data);
> > 
> > This might come as a bit of a surprise to the other CPU still trying to
> > use the old values for data->func and data->info.
> > 
> > Note that this smb_mb() is required even if cpu_relax() contains a
> > memory barrier, as it is possible to execute csd_lock_wait() without
> > executing the cpu_relax(), if you get there at just the right time.
> 
> I'm not quite sure I follow, if we observe !(flags & LOCK) then we're
> guaranteed that no cpu will still needs func and info. They might still
> be observing the list entry, but should not find themselves on the
> cpumask -- which is guarded by ->lock.

The compiler could reorder as above, in which case the other CPU might
still be looking at the ->func and ->info fields when the stores happen,
but might have done csd_unlock() by the time this CPU does its
csd_lock().

> Anyway, I'm happy to add the mb().

Please!  ;-)

> > > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt
> > >  	 * It's ok to use list_for_each_rcu() here even though we may delete
> > >  	 * 'pos', since list_del_rcu() doesn't clear ->next
> > >  	 */
> > > -	rcu_read_lock();
> > > -	list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
> > > +	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
> > 
> > OK...  What prevents the following sequence of events?
> > 
> > o	CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
> > 	including CPU 2.  CPU 0 therefore enqueues this on the global
> > 	call_function.queue.  "wait" is not specified, so CPU 0 returns
> > 	immediately after sending the IPIs.
> > 
> > o	CPU 1 disables irqs and leaves them disabled for awhile.
> > 
> > o	CPU 2 receives the IPI, and duly calls the needed function.
> > 	It decrements the ->refs field, but, finding the result
> > 	non-zero, refrains from removing the element that CPU 0
> > 	enqueued, and also refrains from invoking csd_unlock().
> > 
> > o	CPU 3 also receives the IPI, and also calls the needed function.
> > 	Now, only CPU 1 need receive the IPI for the element to be
> > 	removed.
> > 
> > o	CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
> > 	but -not- including CPU 2.  CPU 3 therefore also this on the
> > 	global call_function.queue and sends the IPIs, but no IPI for
> > 	CPU 2.	Your choice as to whether CPU 3 waits or not.
> 
> Right, so the queue is Entry3->Entry0 (we place new entries on at the
> head).

Gah!!!  I even remember looking at the list_add_rcu() and noting that
it was not a list_add_tail_rcu().

Please accept my apologies for my confusion!!!

I suppose that a given CPU might get repeatedly recycled to the beginning
of the list, but someone would have to make that happen before I would
be inclined to worry about it.

> > o	CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
> > 	list.  CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
> > 	and then...
> 
> CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the
> ipi.
> 
> Furthermore, Entry3 would be first, but suppose it is Entry0, that makes
> the scenario work best.
> 
> > o	CPU 1 finally re-enables irqs and receives the IPIs!!!  It
> > 	finds CPU 0's element on the queue, calls the function,
> > 	decrements the ->refs field, and finds that it is zero.
> > 	So, CPU 1 invokes list_del_rcu() to remove the element
> > 	(OK so far, as list_del_rcu() doesn't overwrite the next
> > 	pointer), then invokes csd_unlock() to release the element.
> 
> CPU1 will see CPU3's element first, and CPU0's element second. But OK.
> 
> > o	CPU 0 then invokes another smp_call_function_many(), also
> > 	multiple CPUs, but -not- to CPU 2.  It requeues the element
> > 	that was just csd_unlock()ed above, carrying CPU 2 with it.
> > 	It IPIs CPUs 1 and 3, but not CPU 2.
> > 
> > o	CPU 2 continues, and falls off the bottom of the list.  It will
> > 	continue to ignore CPU 0's IPI until some other CPU IPIs it.
> > 	On some architectures, a single-target IPI won't cut it, only
> > 	a multi-target IPI.
> > 
> > Or am I missing something subtle here?
> 
> Ah, right, yes. I place new entries at the HEAD not TAIL, so that in
> this case we go from:
> 
> Entry3->Entry0
>            ^
>          CPU2
> 
> to
> 
> Entry0->Entry3
>   ^
> CPU2
> 
> and CPU2 will continue the list iteration, visiting Entry3 twice, which
> is harmless since it will have removed itself from the cpumask the first
> time around.
> 
> > If this really is a problem, there are a number of counter-based solutions
> > to it.  (Famous last words...)
> > 
> > Abandoning all caution and attempting one on the fly...  Make each CPU
> > receiving an IPI increment one per-CPU counter upon entry, and increment
> > it again upon exit with memory barriers after and before, respectively.
> > Then any CPU with an even value can be ignored, and any CPU whose value
> > changes can also be ignored.  Of course, this means you have to scan all
> > CPUs...  But in the worst case, you also had to IPI them all.
> > 
> > Given that this operation is relatively rare, it might be worth using
> > shared reference counters, possibly one pair of such counters per (say)
> > 16 CPUs.  Then the caller flips the counter.
> 
> Yep, I almost implemented a counting RCU which increments a global
> counter on IPI entry and decrements on IPI exit, but then did the above
> FIFO->FILO queue thingy.

A simpler way (if it becomes necessary to add at the tail rather than
the head) would be to periodically check for stalls of this sort and
then resend the IPIs.

> > Alternatively, you can explain to me why my scenario above cannot
> > happen -- but at present, it will take some serious explaining!!!
> 
> I hope to have adequately explained it, but please, feel free to poke
> more holes into it ;-)

You have indeed more than adequately explained it.  The only thing
resembling a hole that I have found thus far is the improbable scenario
that repeatedly pulls a given CPU to the beginning of the list, so that
this CPU never reaches the end.

Again, please accept my apologies for my confusion!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18  0:28   ` Paul E. McKenney
  2009-02-18 10:42     ` Peter Zijlstra
@ 2009-02-18 16:15     ` Oleg Nesterov
  2009-02-18 19:47       ` Paul E. McKenney
  1 sibling, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2009-02-18 16:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe,
	Ingo Molnar, Rusty Russell, linux-kernel

On 02/17, Paul E. McKenney wrote:
>
> On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> > +static void csd_lock(struct call_single_data *data)
> >  {
> > -	/* Wait for response */
> > -	do {
> > -		if (!(data->flags & CSD_FLAG_WAIT))
> > -			break;
> > +	while (data->flags & CSD_FLAG_LOCK)
> >  		cpu_relax();
> > -	} while (1);
> > +	data->flags = CSD_FLAG_LOCK;
>
> We do need an smp_mb() here, otherwise, the call from
> smp_call_function_single() could be reordered by either CPU or compiler
> as follows:
>
> 	data->func = func;
> 	data->info = info;
> 	csd_lock(data);
>
> This might come as a bit of a surprise to the other CPU still trying to
> use the old values for data->func and data->info.

Could you explain a bit more here?

The compiler can't re-order this code due to cpu_relax(). Cpu can
re-order, but this doesn't matter because both the sender and ipi
handler take call_single_queue->lock.

And, giwen that csd_unlock() does mb() before csd_unlock(), how
it is possible that other CPU (ipi handler) still uses the old
values in *data after we see !CSD_FLAG_LOCK ?

> Note that this smb_mb() is required even if cpu_relax() contains a
> memory barrier, as it is possible to execute csd_lock_wait() without
> executing the cpu_relax(), if you get there at just the right time.

Can't understand... Nobody can do csd_wait() on this per-cpu entry,
but I guess you meant something else.

> OK...  What prevents the following sequence of events?
>
> o	CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
> 	including CPU 2.  CPU 0 therefore enqueues this on the global
> 	call_function.queue.  "wait" is not specified, so CPU 0 returns
> 	immediately after sending the IPIs.
>
> 	It decrements the ->refs field, but, finding the result
> 	non-zero, refrains from removing the element that CPU 0
> 	enqueued, and also refrains from invoking csd_unlock().
>
> o	CPU 3 also receives the IPI, and also calls the needed function.
> 	Now, only CPU 1 need receive the IPI for the element to be
> 	removed.

so we have a single entry E0 on list,

> o	CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
> 	but -not- including CPU 2.  CPU 3 therefore also this on the
> 	global call_function.queue and sends the IPIs, but no IPI for
> 	CPU 2.	Your choice as to whether CPU 3 waits or not.

now we have E3 -> E0

> o	CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
> 	list.  CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
> 	and then...

it actually sees E3, not E0

> o	CPU 1 finally re-enables irqs and receives the IPIs!!!  It
> 	finds CPU 0's element on the queue, calls the function,
> 	decrements the ->refs field, and finds that it is zero.
> 	So, CPU 1 invokes list_del_rcu() to remove the element
> 	(OK so far, as list_del_rcu() doesn't overwrite the next
> 	pointer), then invokes csd_unlock() to release the element.
>
> o	CPU 0 then invokes another smp_call_function_many(), also
> 	multiple CPUs, but -not- to CPU 2.  It requeues the element
> 	that was just csd_unlock()ed above, carrying CPU 2 with it.
> 	It IPIs CPUs 1 and 3, but not CPU 2.

and inserts the element E0 at the head of the list again,

>
> o	CPU 2 continues, and falls off the bottom of the list.

afaics, it doesn't.

Every time smp_call_function_many() reuses the element, it sets its
->next pointer to the head of the list. If we race with another CPU
which fetches this pointer, this CPU has to re-scan the whole list,
but since we always modify/read data under data->lock this should
be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask).

No?

Oleg.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18 16:15     ` Oleg Nesterov
@ 2009-02-18 19:47       ` Paul E. McKenney
  2009-02-18 20:12         ` Oleg Nesterov
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-18 19:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe,
	Ingo Molnar, Rusty Russell, linux-kernel

On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote:
> On 02/17, Paul E. McKenney wrote:
> >
> > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> > > +static void csd_lock(struct call_single_data *data)
> > >  {
> > > -	/* Wait for response */
> > > -	do {
> > > -		if (!(data->flags & CSD_FLAG_WAIT))
> > > -			break;
> > > +	while (data->flags & CSD_FLAG_LOCK)
> > >  		cpu_relax();
> > > -	} while (1);
> > > +	data->flags = CSD_FLAG_LOCK;
> >
> > We do need an smp_mb() here, otherwise, the call from
> > smp_call_function_single() could be reordered by either CPU or compiler
> > as follows:
> >
> > 	data->func = func;
> > 	data->info = info;
> > 	csd_lock(data);
> >
> > This might come as a bit of a surprise to the other CPU still trying to
> > use the old values for data->func and data->info.
> 
> Could you explain a bit more here?
> 
> The compiler can't re-order this code due to cpu_relax(). Cpu can
> re-order, but this doesn't matter because both the sender and ipi
> handler take call_single_queue->lock.
> 
> And, giwen that csd_unlock() does mb() before csd_unlock(), how
> it is possible that other CPU (ipi handler) still uses the old
> values in *data after we see !CSD_FLAG_LOCK ?

Good point on cpu_relax(), which appears to be at least a compiler
barrier on all architectures.

I must confess to being in the habit of assuming reordering unless I
can prove that such reordering cannot happen.  I am running tests of
this code snippet on POWER, but do not have access to ARM, which some
say has more tendency to ignore control-flow dependencies than does Power.

I will let people know what comes of the tests.

> > Note that this smb_mb() is required even if cpu_relax() contains a
> > memory barrier, as it is possible to execute csd_lock_wait() without
> > executing the cpu_relax(), if you get there at just the right time.
> 
> Can't understand... Nobody can do csd_wait() on this per-cpu entry,
> but I guess you meant something else.
> 
> > OK...  What prevents the following sequence of events?
> >
> > o	CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
> > 	including CPU 2.  CPU 0 therefore enqueues this on the global
> > 	call_function.queue.  "wait" is not specified, so CPU 0 returns
> > 	immediately after sending the IPIs.
> >
> > 	It decrements the ->refs field, but, finding the result
> > 	non-zero, refrains from removing the element that CPU 0
> > 	enqueued, and also refrains from invoking csd_unlock().
> >
> > o	CPU 3 also receives the IPI, and also calls the needed function.
> > 	Now, only CPU 1 need receive the IPI for the element to be
> > 	removed.
> 
> so we have a single entry E0 on list,
> 
> > o	CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
> > 	but -not- including CPU 2.  CPU 3 therefore also this on the
> > 	global call_function.queue and sends the IPIs, but no IPI for
> > 	CPU 2.	Your choice as to whether CPU 3 waits or not.
> 
> now we have E3 -> E0
> 
> > o	CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
> > 	list.  CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
> > 	and then...
> 
> it actually sees E3, not E0
> 
> > o	CPU 1 finally re-enables irqs and receives the IPIs!!!  It
> > 	finds CPU 0's element on the queue, calls the function,
> > 	decrements the ->refs field, and finds that it is zero.
> > 	So, CPU 1 invokes list_del_rcu() to remove the element
> > 	(OK so far, as list_del_rcu() doesn't overwrite the next
> > 	pointer), then invokes csd_unlock() to release the element.
> >
> > o	CPU 0 then invokes another smp_call_function_many(), also
> > 	multiple CPUs, but -not- to CPU 2.  It requeues the element
> > 	that was just csd_unlock()ed above, carrying CPU 2 with it.
> > 	It IPIs CPUs 1 and 3, but not CPU 2.
> 
> and inserts the element E0 at the head of the list again,
> 
> >
> > o	CPU 2 continues, and falls off the bottom of the list.
> 
> afaics, it doesn't.
> 
> Every time smp_call_function_many() reuses the element, it sets its
> ->next pointer to the head of the list. If we race with another CPU
> which fetches this pointer, this CPU has to re-scan the whole list,
> but since we always modify/read data under data->lock this should
> be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask).
> 
> No?

You are quite correct.  I guess I should have gone home early instead of
reviewing Peter's patch...  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18 19:47       ` Paul E. McKenney
@ 2009-02-18 20:12         ` Oleg Nesterov
  2009-02-19  2:40           ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2009-02-18 20:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe,
	Ingo Molnar, Rusty Russell, linux-kernel

On 02/18, Paul E. McKenney wrote:
>
> On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote:
>
> > On 02/17, Paul E. McKenney wrote:
> > >
> > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> > > > +static void csd_lock(struct call_single_data *data)
> > > >  {
> > > > -	/* Wait for response */
> > > > -	do {
> > > > -		if (!(data->flags & CSD_FLAG_WAIT))
> > > > -			break;
> > > > +	while (data->flags & CSD_FLAG_LOCK)
> > > >  		cpu_relax();
> > > > -	} while (1);
> > > > +	data->flags = CSD_FLAG_LOCK;
> > >
> > > We do need an smp_mb() here, otherwise, the call from
> > > smp_call_function_single() could be reordered by either CPU or compiler
> > > as follows:
> > >
> > > 	data->func = func;
> > > 	data->info = info;
> > > 	csd_lock(data);
> > >
> > > This might come as a bit of a surprise to the other CPU still trying to
> > > use the old values for data->func and data->info.
> >
> > Could you explain a bit more here?
> >
> > The compiler can't re-order this code due to cpu_relax(). Cpu can
> > re-order, but this doesn't matter because both the sender and ipi
> > handler take call_single_queue->lock.
> >
> > And, giwen that csd_unlock() does mb() before csd_unlock(), how
> > it is possible that other CPU (ipi handler) still uses the old
> > values in *data after we see !CSD_FLAG_LOCK ?
>
> Good point on cpu_relax(), which appears to be at least a compiler
> barrier on all architectures.
>
> I must confess to being in the habit of assuming reordering unless I
> can prove that such reordering cannot happen.

Yes, probably you are right...

But since almost nobody (except you ;) really understands this magic,
it would be nice to have a comment which explains exactly what is the
reason for mb(). Otherwise it is so hard to read the code, if you
don't understand mb(), then you probably missed something important.

> > Every time smp_call_function_many() reuses the element, it sets its
> > ->next pointer to the head of the list. If we race with another CPU
> > which fetches this pointer, this CPU has to re-scan the whole list,
> > but since we always modify/read data under data->lock this should
> > be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask).
>
> You are quite correct.  I guess I should have gone home early instead of
> reviewing Peter's patch...  :-/

In that case I shouldn't even try to read this series ;) I was wrong so
many times...

Oleg.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-18 20:12         ` Oleg Nesterov
@ 2009-02-19  2:40           ` Paul E. McKenney
  2009-02-19  8:33             ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-19  2:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe,
	Ingo Molnar, Rusty Russell, linux-kernel

On Wed, Feb 18, 2009 at 09:12:52PM +0100, Oleg Nesterov wrote:
> On 02/18, Paul E. McKenney wrote:
> >
> > On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote:
> >
> > > On 02/17, Paul E. McKenney wrote:
> > > >
> > > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:
> > > > > +static void csd_lock(struct call_single_data *data)
> > > > >  {
> > > > > -	/* Wait for response */
> > > > > -	do {
> > > > > -		if (!(data->flags & CSD_FLAG_WAIT))
> > > > > -			break;
> > > > > +	while (data->flags & CSD_FLAG_LOCK)
> > > > >  		cpu_relax();
> > > > > -	} while (1);
> > > > > +	data->flags = CSD_FLAG_LOCK;
> > > >
> > > > We do need an smp_mb() here, otherwise, the call from
> > > > smp_call_function_single() could be reordered by either CPU or compiler
> > > > as follows:
> > > >
> > > > 	data->func = func;
> > > > 	data->info = info;
> > > > 	csd_lock(data);
> > > >
> > > > This might come as a bit of a surprise to the other CPU still trying to
> > > > use the old values for data->func and data->info.
> > >
> > > Could you explain a bit more here?
> > >
> > > The compiler can't re-order this code due to cpu_relax(). Cpu can
> > > re-order, but this doesn't matter because both the sender and ipi
> > > handler take call_single_queue->lock.
> > >
> > > And, giwen that csd_unlock() does mb() before csd_unlock(), how
> > > it is possible that other CPU (ipi handler) still uses the old
> > > values in *data after we see !CSD_FLAG_LOCK ?
> >
> > Good point on cpu_relax(), which appears to be at least a compiler
> > barrier on all architectures.
> >
> > I must confess to being in the habit of assuming reordering unless I
> > can prove that such reordering cannot happen.
> 
> Yes, probably you are right...
> 
> But since almost nobody (except you ;) really understands this magic,
> it would be nice to have a comment which explains exactly what is the
> reason for mb(). Otherwise it is so hard to read the code, if you
> don't understand mb(), then you probably missed something important.

An hour-long stress test on both Power 5 and Power 6 failed to
locate a problem, though that of course does not prove lack of a
problem, particularly for CPUs that don't pay as much attention to
control-dependency ordering as Power does.  :-/

The test may be found in CodeSamples/advsync/special/mbtest/mb_lhs_ws.c
in git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git,
if anyone wants to take a look.

In the meantime, I nominate the following comment at the end of
csd_lock():

	/*
	 * prevent CPU from reordering the above assignment to ->flags
	 * with any subsequent assignments to other fields of the
	 * specified call_single_data structure.
	 */

	smp_mb(); /* See above block comment. */

> > > Every time smp_call_function_many() reuses the element, it sets its
> > > ->next pointer to the head of the list. If we race with another CPU
> > > which fetches this pointer, this CPU has to re-scan the whole list,
> > > but since we always modify/read data under data->lock this should
> > > be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask).
> >
> > You are quite correct.  I guess I should have gone home early instead of
> > reviewing Peter's patch...  :-/
> 
> In that case I shouldn't even try to read this series ;) I was wrong so
> many times...

I suppose that I should be happy to be wrong nine times if I find a subtle
bug on the tenth attempt, but somehow it doesn't feel that way.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] generic-ipi: remove kmalloc()
  2009-02-19  2:40           ` Paul E. McKenney
@ 2009-02-19  8:33             ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-19  8:33 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, Linus Torvalds, Nick Piggin, Jens Axboe,
	Ingo Molnar, Rusty Russell, linux-kernel

On Wed, 2009-02-18 at 18:40 -0800, Paul E. McKenney wrote:
> 
> In the meantime, I nominate the following comment at the end of
> csd_lock():
> 
>         /*
>          * prevent CPU from reordering the above assignment to ->flags
>          * with any subsequent assignments to other fields of the
>          * specified call_single_data structure.
>          */
> 
>         smp_mb(); /* See above block comment. */

Done, thanks Paul!


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-02-19  8:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra
2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
2009-02-18  0:27   ` Paul E. McKenney
2009-02-18  9:45     ` Peter Zijlstra
2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra
2009-02-18  0:28   ` Paul E. McKenney
2009-02-18 10:42     ` Peter Zijlstra
2009-02-18 16:06       ` Paul E. McKenney
2009-02-18 16:15     ` Oleg Nesterov
2009-02-18 19:47       ` Paul E. McKenney
2009-02-18 20:12         ` Oleg Nesterov
2009-02-19  2:40           ` Paul E. McKenney
2009-02-19  8:33             ` Peter Zijlstra
2009-02-18  5:31   ` Rusty Russell
2009-02-18 10:52     ` Peter Zijlstra
2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox