* [PATCH 0/3] generic-ipi: patches -v5
@ 2009-02-17 21:59 Peter Zijlstra
2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw)
To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney,
Ingo Molnar, Rusty Russell
Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra
against -tip
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 1/3] generic-ipi: simplify the barriers 2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra @ 2009-02-17 21:59 ` Peter Zijlstra 2009-02-18 0:27 ` Paul E. McKenney 2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra 2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra 2 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw) To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney, Ingo Molnar, Rusty Russell Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra [-- Attachment #1: nick-simplify-generic-smp-barriers.patch --] [-- Type: text/plain, Size: 5438 bytes --] From: Nick Piggin <npiggin@suse.de> Firstly, just unconditionally take the lock and check the list in the generic_call_function_single_interrupt IPI handler. As we've just taken an IPI here, the chances are fairly high that there will be work on the list for us, so do the locking unconditionally. This removes the tricky lockless list_empty check and dubious barriers. The change looks bigger than it is because it is just removing an outer loop. Secondly, clarify architecture specific IPI locking rules. Generic code has no tools to impose any sane ordering on IPIs if they go outside normal cache coherency, ergo the arch code must make them appear to obey cache coherency as a "memory operation" to initiate an IPI, and a "memory operation" to receive one. This way at least they can be reasoned about in generic code, and smp_mb used to provide ordering. The combination of these two changes means that explict barriers can be taken out of queue handling for the single case -- shared data is explicitly locked, and ipi ordering must conform to that, so no barriers needed. An extra barrier is needed in the many handler, so as to ensure we load the list element after the IPI is received. Does any architecture actually needs barriers? For the initiator I could see it, but for the handler I would be surprised. The other thing we could do for simplicity is just to require that a full barrier is required before generating an IPI, and after receiving an IPI. We can't just do that in generic code without auditing architectures. There have been subtle hangs here on some archs in the past. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/smp.c | 83 +++++++++++++++++++++++++++++++---------------------------- 1 file changed, 44 insertions(+), 39 deletions(-) Index: linux-2.6/kernel/smp.c =================================================================== --- linux-2.6.orig/kernel/smp.c +++ linux-2.6/kernel/smp.c @@ -74,9 +74,16 @@ static void generic_exec_single(int cpu, spin_unlock_irqrestore(&dst->lock, flags); /* - * Make the list addition visible before sending the ipi. + * The list addition should be visible before sending the IPI + * handler locks the list to pull the entry off it because of + * normal cache coherency rules implied by spinlocks. + * + * If IPIs can go out of order to the cache coherency protocol + * in an architecture, sufficient synchronisation should be added + * to arch code to make it appear to obey cache coherency WRT + * locking and barrier primitives. Generic code isn't really equipped + * to do the right thing... */ - smp_mb(); if (ipi) arch_send_call_function_single_ipi(cpu); @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt int cpu = get_cpu(); /* + * Ensure entry is visible on call_function_queue after we have + * entered the IPI. See comment in smp_call_function_many. + * If we don't have this, then we may miss an entry on the list + * and never get another IPI to process it. + */ + smp_mb(); + + /* * It's ok to use list_for_each_rcu() here even though we may delete * 'pos', since list_del_rcu() doesn't clear ->next */ @@ -154,49 +169,37 @@ void generic_smp_call_function_single_in { struct call_single_queue *q = &__get_cpu_var(call_single_queue); LIST_HEAD(list); + unsigned int data_flags; - /* - * Need to see other stores to list head for checking whether - * list is empty without holding q->lock - */ - smp_read_barrier_depends(); - while (!list_empty(&q->list)) { - unsigned int data_flags; - - spin_lock(&q->lock); - list_replace_init(&q->list, &list); - spin_unlock(&q->lock); - - while (!list_empty(&list)) { - struct call_single_data *data; - - data = list_entry(list.next, struct call_single_data, - list); - list_del(&data->list); + spin_lock(&q->lock); + list_replace_init(&q->list, &list); + spin_unlock(&q->lock); - /* - * 'data' can be invalid after this call if - * flags == 0 (when called through - * generic_exec_single(), so save them away before - * making the call. - */ - data_flags = data->flags; + while (!list_empty(&list)) { + struct call_single_data *data; - data->func(data->info); + data = list_entry(list.next, struct call_single_data, + list); + list_del(&data->list); - if (data_flags & CSD_FLAG_WAIT) { - smp_wmb(); - data->flags &= ~CSD_FLAG_WAIT; - } else if (data_flags & CSD_FLAG_LOCK) { - smp_wmb(); - data->flags &= ~CSD_FLAG_LOCK; - } else if (data_flags & CSD_FLAG_ALLOC) - kfree(data); - } /* - * See comment on outer loop + * 'data' can be invalid after this call if + * flags == 0 (when called through + * generic_exec_single(), so save them away before + * making the call. */ - smp_read_barrier_depends(); + data_flags = data->flags; + + data->func(data->info); + + if (data_flags & CSD_FLAG_WAIT) { + smp_wmb(); + data->flags &= ~CSD_FLAG_WAIT; + } else if (data_flags & CSD_FLAG_LOCK) { + smp_wmb(); + data->flags &= ~CSD_FLAG_LOCK; + } else if (data_flags & CSD_FLAG_ALLOC) + kfree(data); } } @@ -375,6 +378,8 @@ void smp_call_function_many(const struct /* * Make the list addition visible before sending the ipi. + * (IPIs must obey or appear to obey normal Linux cache coherency + * rules -- see comment in generic_exec_single). */ smp_mb(); -- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] generic-ipi: simplify the barriers 2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra @ 2009-02-18 0:27 ` Paul E. McKenney 2009-02-18 9:45 ` Peter Zijlstra 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2009-02-18 0:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel, Oleg Nesterov On Tue, Feb 17, 2009 at 10:59:05PM +0100, Peter Zijlstra wrote: > From: Nick Piggin <npiggin@suse.de> > > Firstly, just unconditionally take the lock and check the list in the > generic_call_function_single_interrupt IPI handler. As we've just taken > an IPI here, the chances are fairly high that there will be work on the > list for us, so do the locking unconditionally. This removes the tricky > lockless list_empty check and dubious barriers. The change looks bigger > than it is because it is just removing an outer loop. > > Secondly, clarify architecture specific IPI locking rules. Generic code > has no tools to impose any sane ordering on IPIs if they go outside > normal cache coherency, ergo the arch code must make them appear to > obey cache coherency as a "memory operation" to initiate an IPI, and > a "memory operation" to receive one. This way at least they can be > reasoned about in generic code, and smp_mb used to provide ordering. > > The combination of these two changes means that explict barriers can > be taken out of queue handling for the single case -- shared data is > explicitly locked, and ipi ordering must conform to that, so no > barriers needed. An extra barrier is needed in the many handler, so > as to ensure we load the list element after the IPI is received. > > Does any architecture actually needs barriers? For the initiator I > could see it, but for the handler I would be surprised. The other > thing we could do for simplicity is just to require that a full > barrier is required before generating an IPI, and after receiving an > IPI. We can't just do that in generic code without auditing > architectures. There have been subtle hangs here on some archs in > the past. While I sympathize with pushing memory barriers down into the arch-specific functions, you -are- running this by the various arch maintainers so that they have an opportunity to adjust, right? Thanx, Paul > Signed-off-by: Nick Piggin <npiggin@suse.de> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > kernel/smp.c | 83 +++++++++++++++++++++++++++++++---------------------------- > 1 file changed, 44 insertions(+), 39 deletions(-) > > Index: linux-2.6/kernel/smp.c > =================================================================== > --- linux-2.6.orig/kernel/smp.c > +++ linux-2.6/kernel/smp.c > @@ -74,9 +74,16 @@ static void generic_exec_single(int cpu, > spin_unlock_irqrestore(&dst->lock, flags); > > /* > - * Make the list addition visible before sending the ipi. > + * The list addition should be visible before sending the IPI > + * handler locks the list to pull the entry off it because of > + * normal cache coherency rules implied by spinlocks. > + * > + * If IPIs can go out of order to the cache coherency protocol > + * in an architecture, sufficient synchronisation should be added > + * to arch code to make it appear to obey cache coherency WRT > + * locking and barrier primitives. Generic code isn't really equipped > + * to do the right thing... > */ > - smp_mb(); While I sympathize with the above, you -are- running this by the various arch maintainers so that they have an opportunity to adjust, right? > > if (ipi) > arch_send_call_function_single_ipi(cpu); > @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt > int cpu = get_cpu(); > > /* > + * Ensure entry is visible on call_function_queue after we have > + * entered the IPI. See comment in smp_call_function_many. > + * If we don't have this, then we may miss an entry on the list > + * and never get another IPI to process it. > + */ > + smp_mb(); > + > + /* > * It's ok to use list_for_each_rcu() here even though we may delete > * 'pos', since list_del_rcu() doesn't clear ->next > */ > @@ -154,49 +169,37 @@ void generic_smp_call_function_single_in > { > struct call_single_queue *q = &__get_cpu_var(call_single_queue); > LIST_HEAD(list); > + unsigned int data_flags; > > - /* > - * Need to see other stores to list head for checking whether > - * list is empty without holding q->lock > - */ > - smp_read_barrier_depends(); > - while (!list_empty(&q->list)) { > - unsigned int data_flags; > - > - spin_lock(&q->lock); > - list_replace_init(&q->list, &list); > - spin_unlock(&q->lock); > - > - while (!list_empty(&list)) { > - struct call_single_data *data; > - > - data = list_entry(list.next, struct call_single_data, > - list); > - list_del(&data->list); > + spin_lock(&q->lock); > + list_replace_init(&q->list, &list); > + spin_unlock(&q->lock); > > - /* > - * 'data' can be invalid after this call if > - * flags == 0 (when called through > - * generic_exec_single(), so save them away before > - * making the call. > - */ > - data_flags = data->flags; > + while (!list_empty(&list)) { > + struct call_single_data *data; > > - data->func(data->info); > + data = list_entry(list.next, struct call_single_data, > + list); > + list_del(&data->list); > > - if (data_flags & CSD_FLAG_WAIT) { > - smp_wmb(); > - data->flags &= ~CSD_FLAG_WAIT; > - } else if (data_flags & CSD_FLAG_LOCK) { > - smp_wmb(); > - data->flags &= ~CSD_FLAG_LOCK; > - } else if (data_flags & CSD_FLAG_ALLOC) > - kfree(data); > - } > /* > - * See comment on outer loop > + * 'data' can be invalid after this call if > + * flags == 0 (when called through > + * generic_exec_single(), so save them away before > + * making the call. > */ > - smp_read_barrier_depends(); > + data_flags = data->flags; > + > + data->func(data->info); > + > + if (data_flags & CSD_FLAG_WAIT) { > + smp_wmb(); > + data->flags &= ~CSD_FLAG_WAIT; > + } else if (data_flags & CSD_FLAG_LOCK) { > + smp_wmb(); > + data->flags &= ~CSD_FLAG_LOCK; > + } else if (data_flags & CSD_FLAG_ALLOC) > + kfree(data); > } > } > > @@ -375,6 +378,8 @@ void smp_call_function_many(const struct > > /* > * Make the list addition visible before sending the ipi. > + * (IPIs must obey or appear to obey normal Linux cache coherency > + * rules -- see comment in generic_exec_single). > */ > smp_mb(); > > > -- > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] generic-ipi: simplify the barriers 2009-02-18 0:27 ` Paul E. McKenney @ 2009-02-18 9:45 ` Peter Zijlstra 0 siblings, 0 replies; 16+ messages in thread From: Peter Zijlstra @ 2009-02-18 9:45 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel, Oleg Nesterov On Tue, 2009-02-17 at 16:27 -0800, Paul E. McKenney wrote: > While I sympathize with pushing memory barriers down into the > arch-specific functions, you -are- running this by the various > arch maintainers so that they have an opportunity to adjust, right? Nick's initial posting of this patch included linux-arch -- maybe I should have added that too. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra 2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra @ 2009-02-17 21:59 ` Peter Zijlstra 2009-02-18 0:28 ` Paul E. McKenney 2009-02-18 5:31 ` Rusty Russell 2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra 2 siblings, 2 replies; 16+ messages in thread From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw) To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney, Ingo Molnar, Rusty Russell Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra [-- Attachment #1: smp-remove-kmalloc.patch --] [-- Type: text/plain, Size: 12784 bytes --] Remove the use of kmalloc() from the smp_call_function_*() calls. Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu data for single cpu ipi calls) started the discussion on the use of kmalloc() in this code and fixed the smp_call_function_single(.wait=0) fallback case. In this patch we complete this by also providing means for the _many() call, which fully removes the need for kmalloc() in this code. The problem with the _many() call is that other cpus might still be observing our entry when we're done with it. It solved this by dynamically allocating data elements and RCU-freeing it. We solve it by using a single per-cpu entry which provides static storage and solves one half of the problem (avoiding referencing freed data). The other half, ensuring the queue iteration it still possible, is done by placing re-used entries at the head of the list. This means that if someone was still iterating that entry when it got moved, he will now re-visit the entries on the list he had already seen, but avoids skipping over entries like would have happened had we placed the new entry at the end. Furthermore, visiting entries twice is not a problem, since we remove our cpu from the entry's cpumask once its called. Many thanks to Oleg for his suggestions and poking him holes in my earlier attempts. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/smp.c | 258 ++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 159 insertions(+), 99 deletions(-) Index: linux-2.6/kernel/smp.c =================================================================== --- linux-2.6.orig/kernel/smp.c +++ linux-2.6/kernel/smp.c @@ -10,23 +10,28 @@ #include <linux/rcupdate.h> #include <linux/rculist.h> #include <linux/smp.h> +#include <linux/cpu.h> static DEFINE_PER_CPU(struct call_single_queue, call_single_queue); -static LIST_HEAD(call_function_queue); -__cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock); + +static struct { + struct list_head queue; + spinlock_t lock; +} call_function __cacheline_aligned_in_smp = { + .queue = LIST_HEAD_INIT(call_function.queue), + .lock = __SPIN_LOCK_UNLOCKED(call_function.lock), +}; enum { CSD_FLAG_WAIT = 0x01, - CSD_FLAG_ALLOC = 0x02, - CSD_FLAG_LOCK = 0x04, + CSD_FLAG_LOCK = 0x02, }; struct call_function_data { struct call_single_data csd; spinlock_t lock; unsigned int refs; - struct rcu_head rcu_head; - unsigned long cpumask_bits[]; + cpumask_var_t cpumask; }; struct call_single_queue { @@ -34,8 +39,45 @@ struct call_single_queue { spinlock_t lock; }; +static DEFINE_PER_CPU(struct call_function_data, cfd_data) = { + .lock = __SPIN_LOCK_UNLOCKED(cfd_data.lock), +}; + +static int +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); + + switch (action) { + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, + cpu_to_node(cpu))) + return NOTIFY_BAD; + break; + +#ifdef CONFIG_CPU_HOTPLUG + case CPU_UP_CANCELED: + case CPU_UP_CANCELED_FROZEN: + + case CPU_DEAD: + case CPU_DEAD_FROZEN: + free_cpumask_var(cfd->cpumask); + break; +#endif + }; + + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata hotplug_cfd_notifier = { + .notifier_call = hotplug_cfd, +}; + static int __cpuinit init_call_single_data(void) { + void *cpu = (void *)(long)smp_processor_id(); int i; for_each_possible_cpu(i) { @@ -44,18 +86,61 @@ static int __cpuinit init_call_single_da spin_lock_init(&q->lock); INIT_LIST_HEAD(&q->list); } + + hotplug_cfd(&hotplug_cfd_notifier, CPU_UP_PREPARE, cpu); + register_cpu_notifier(&hotplug_cfd_notifier); + return 0; } early_initcall(init_call_single_data); -static void csd_flag_wait(struct call_single_data *data) +/* + * csd_wait/csd_complete are used for synchronous ipi calls + */ +static void csd_wait_prepare(struct call_single_data *data) +{ + data->flags |= CSD_FLAG_WAIT; +} + +static void csd_complete(struct call_single_data *data) +{ + if (data->flags & CSD_FLAG_WAIT) { + /* + * ensure we're all done before saying we are + */ + smp_mb(); + data->flags &= ~CSD_FLAG_WAIT; + } +} + +static void csd_wait(struct call_single_data *data) +{ + while (data->flags & CSD_FLAG_WAIT) + cpu_relax(); +} + +/* + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources + * + * For non-synchronous ipi calls the csd can still be in use by the previous + * function call. For multi-cpu calls its even more interesting as we'll have + * to ensure no other cpu is observing our csd. + */ +static void csd_lock(struct call_single_data *data) { - /* Wait for response */ - do { - if (!(data->flags & CSD_FLAG_WAIT)) - break; + while (data->flags & CSD_FLAG_LOCK) cpu_relax(); - } while (1); + data->flags = CSD_FLAG_LOCK; +} + +static void csd_unlock(struct call_single_data *data) +{ + WARN_ON(!(data->flags & CSD_FLAG_LOCK)); + /* + * ensure we're all done before releasing data + */ + smp_mb(); + data->flags &= ~CSD_FLAG_LOCK; } /* @@ -89,16 +174,7 @@ static void generic_exec_single(int cpu, arch_send_call_function_single_ipi(cpu); if (wait) - csd_flag_wait(data); -} - -static void rcu_free_call_data(struct rcu_head *head) -{ - struct call_function_data *data; - - data = container_of(head, struct call_function_data, rcu_head); - - kfree(data); + csd_wait(data); } /* @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt * It's ok to use list_for_each_rcu() here even though we may delete * 'pos', since list_del_rcu() doesn't clear ->next */ - rcu_read_lock(); - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { int refs; - if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits))) + spin_lock(&data->lock); + if (!cpumask_test_cpu(cpu, data->cpumask)) { + spin_unlock(&data->lock); continue; + } + cpumask_clear_cpu(cpu, data->cpumask); + spin_unlock(&data->lock); data->csd.func(data->csd.info); spin_lock(&data->lock); - cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits)); WARN_ON(data->refs == 0); - data->refs--; - refs = data->refs; + refs = --data->refs; + if (!refs) { + spin_lock(&call_function.lock); + list_del_rcu(&data->csd.list); + spin_unlock(&call_function.lock); + } spin_unlock(&data->lock); if (refs) continue; - spin_lock(&call_function_lock); - list_del_rcu(&data->csd.list); - spin_unlock(&call_function_lock); - - if (data->csd.flags & CSD_FLAG_WAIT) { - /* - * serialize stores to data with the flag clear - * and wakeup - */ - smp_wmb(); - data->csd.flags &= ~CSD_FLAG_WAIT; - } - if (data->csd.flags & CSD_FLAG_ALLOC) - call_rcu(&data->rcu_head, rcu_free_call_data); + csd_complete(&data->csd); + csd_unlock(&data->csd); } - rcu_read_unlock(); put_cpu(); } @@ -192,14 +262,14 @@ void generic_smp_call_function_single_in data->func(data->info); - if (data_flags & CSD_FLAG_WAIT) { - smp_wmb(); - data->flags &= ~CSD_FLAG_WAIT; - } else if (data_flags & CSD_FLAG_LOCK) { - smp_wmb(); - data->flags &= ~CSD_FLAG_LOCK; - } else if (data_flags & CSD_FLAG_ALLOC) - kfree(data); + if (data_flags & CSD_FLAG_WAIT) + csd_complete(data); + + /* + * Unlocked CSDs are valid through generic_exec_single() + */ + if (data_flags & CSD_FLAG_LOCK) + csd_unlock(data); } } @@ -218,7 +288,9 @@ static DEFINE_PER_CPU(struct call_single int smp_call_function_single(int cpu, void (*func) (void *info), void *info, int wait) { - struct call_single_data d; + struct call_single_data d = { + .flags = 0, + }; unsigned long flags; /* prevent preemption and reschedule on another processor, as well as CPU removal */ @@ -239,13 +311,11 @@ int smp_call_function_single(int cpu, vo /* * We are calling a function on a single CPU * and we are not going to wait for it to finish. - * We first try to allocate the data, but if we - * fail, we fall back to use a per cpu data to pass - * the information to that CPU. Since all callers - * of this code will use the same data, we must - * synchronize the callers to prevent a new caller - * from corrupting the data before the callee - * can access it. + * We use a per cpu data to pass the information to + * that CPU. Since all callers of this code will + * use the same data, we must synchronize the + * callers to prevent a new caller from corrupting + * the data before the callee can access it. * * The CSD_FLAG_LOCK is used to let us know when * the IPI handler is done with the data. @@ -255,18 +325,11 @@ int smp_call_function_single(int cpu, vo * will make sure the callee is done with the * data before a new caller will use it. */ - data = kmalloc(sizeof(*data), GFP_ATOMIC); - if (data) - data->flags = CSD_FLAG_ALLOC; - else { - data = &per_cpu(csd_data, me); - while (data->flags & CSD_FLAG_LOCK) - cpu_relax(); - data->flags = CSD_FLAG_LOCK; - } + data = &per_cpu(csd_data, me); + csd_lock(data); } else { data = &d; - data->flags = CSD_FLAG_WAIT; + csd_wait_prepare(data); } data->func = func; @@ -326,14 +389,14 @@ void smp_call_function_many(const struct { struct call_function_data *data; unsigned long flags; - int cpu, next_cpu; + int cpu, next_cpu, me = smp_processor_id(); /* Can deadlock when called with interrupts disabled */ WARN_ON(irqs_disabled()); /* So, what's a CPU they want? Ignoring this one. */ cpu = cpumask_first_and(mask, cpu_online_mask); - if (cpu == smp_processor_id()) + if (cpu == me) cpu = cpumask_next_and(cpu, mask, cpu_online_mask); /* No online cpus? We're done. */ if (cpu >= nr_cpu_ids) @@ -341,7 +404,7 @@ void smp_call_function_many(const struct /* Do we have another CPU which isn't us? */ next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask); - if (next_cpu == smp_processor_id()) + if (next_cpu == me) next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask); /* Fastpath: do that cpu by itself. */ @@ -350,31 +413,28 @@ void smp_call_function_many(const struct return; } - data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC); - if (unlikely(!data)) { - /* Slow path. */ - for_each_online_cpu(cpu) { - if (cpu == smp_processor_id()) - continue; - if (cpumask_test_cpu(cpu, mask)) - smp_call_function_single(cpu, func, info, wait); - } - return; - } + data = &per_cpu(cfd_data, me); + csd_lock(&data->csd); - spin_lock_init(&data->lock); - data->csd.flags = CSD_FLAG_ALLOC; + spin_lock_irqsave(&data->lock, flags); if (wait) - data->csd.flags |= CSD_FLAG_WAIT; + csd_wait_prepare(&data->csd); + data->csd.func = func; data->csd.info = info; - cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits)); - data->refs = cpumask_weight(to_cpumask(data->cpumask_bits)); - - spin_lock_irqsave(&call_function_lock, flags); - list_add_tail_rcu(&data->csd.list, &call_function_queue); - spin_unlock_irqrestore(&call_function_lock, flags); + cpumask_and(data->cpumask, mask, cpu_online_mask); + cpumask_clear_cpu(me, data->cpumask); + data->refs = cpumask_weight(data->cpumask); + + spin_lock(&call_function.lock); + /* + * Place entry at the _HEAD_ of the list, so that any cpu still + * observing the entry in generic_smp_call_function_interrupt() will + * not miss any other list entries. + */ + list_add_rcu(&data->csd.list, &call_function.queue); + spin_unlock(&call_function.lock); + spin_unlock_irqrestore(&data->lock, flags); /* * Make the list addition visible before sending the ipi. @@ -384,11 +444,11 @@ void smp_call_function_many(const struct smp_mb(); /* Send a message to all CPUs in the map */ - arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits)); + arch_send_call_function_ipi_mask(data->cpumask); /* optionally wait for the CPUs to complete */ if (wait) - csd_flag_wait(&data->csd); + csd_wait(&data->csd); } EXPORT_SYMBOL(smp_call_function_many); @@ -418,20 +478,20 @@ EXPORT_SYMBOL(smp_call_function); void ipi_call_lock(void) { - spin_lock(&call_function_lock); + spin_lock(&call_function.lock); } void ipi_call_unlock(void) { - spin_unlock(&call_function_lock); + spin_unlock(&call_function.lock); } void ipi_call_lock_irq(void) { - spin_lock_irq(&call_function_lock); + spin_lock_irq(&call_function.lock); } void ipi_call_unlock_irq(void) { - spin_unlock_irq(&call_function_lock); + spin_unlock_irq(&call_function.lock); } -- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra @ 2009-02-18 0:28 ` Paul E. McKenney 2009-02-18 10:42 ` Peter Zijlstra 2009-02-18 16:15 ` Oleg Nesterov 2009-02-18 5:31 ` Rusty Russell 1 sibling, 2 replies; 16+ messages in thread From: Paul E. McKenney @ 2009-02-18 0:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel, Oleg Nesterov On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > Remove the use of kmalloc() from the smp_call_function_*() calls. > > Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu data for > single cpu ipi calls) started the discussion on the use of kmalloc() in > this code and fixed the smp_call_function_single(.wait=0) fallback case. > > In this patch we complete this by also providing means for the _many() > call, which fully removes the need for kmalloc() in this code. > > The problem with the _many() call is that other cpus might still be > observing our entry when we're done with it. It solved this by > dynamically allocating data elements and RCU-freeing it. > > We solve it by using a single per-cpu entry which provides static > storage and solves one half of the problem (avoiding referencing freed > data). > > The other half, ensuring the queue iteration it still possible, is done > by placing re-used entries at the head of the list. This means that if > someone was still iterating that entry when it got moved, he will now > re-visit the entries on the list he had already seen, but avoids > skipping over entries like would have happened had we placed the new > entry at the end. > > Furthermore, visiting entries twice is not a problem, since we remove > our cpu from the entry's cpumask once its called. > > Many thanks to Oleg for his suggestions and poking him holes in my > earlier attempts. A couple small questions and one big one below... Thanx, Paul > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > kernel/smp.c | 258 ++++++++++++++++++++++++++++++++++++----------------------- > 1 file changed, 159 insertions(+), 99 deletions(-) > > Index: linux-2.6/kernel/smp.c > =================================================================== > --- linux-2.6.orig/kernel/smp.c > +++ linux-2.6/kernel/smp.c > @@ -10,23 +10,28 @@ > #include <linux/rcupdate.h> > #include <linux/rculist.h> > #include <linux/smp.h> > +#include <linux/cpu.h> > > static DEFINE_PER_CPU(struct call_single_queue, call_single_queue); > -static LIST_HEAD(call_function_queue); > -__cacheline_aligned_in_smp DEFINE_SPINLOCK(call_function_lock); > + > +static struct { > + struct list_head queue; > + spinlock_t lock; > +} call_function __cacheline_aligned_in_smp = { > + .queue = LIST_HEAD_INIT(call_function.queue), > + .lock = __SPIN_LOCK_UNLOCKED(call_function.lock), > +}; > > enum { > CSD_FLAG_WAIT = 0x01, > - CSD_FLAG_ALLOC = 0x02, > - CSD_FLAG_LOCK = 0x04, > + CSD_FLAG_LOCK = 0x02, > }; > > struct call_function_data { > struct call_single_data csd; > spinlock_t lock; > unsigned int refs; > - struct rcu_head rcu_head; > - unsigned long cpumask_bits[]; > + cpumask_var_t cpumask; > }; > > struct call_single_queue { > @@ -34,8 +39,45 @@ struct call_single_queue { > spinlock_t lock; > }; > > +static DEFINE_PER_CPU(struct call_function_data, cfd_data) = { > + .lock = __SPIN_LOCK_UNLOCKED(cfd_data.lock), > +}; > + > +static int > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) > +{ > + long cpu = (long)hcpu; > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); > + > + switch (action) { > + case CPU_UP_PREPARE: > + case CPU_UP_PREPARE_FROZEN: > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, > + cpu_to_node(cpu))) > + return NOTIFY_BAD; > + break; > + > +#ifdef CONFIG_CPU_HOTPLUG > + case CPU_UP_CANCELED: > + case CPU_UP_CANCELED_FROZEN: > + > + case CPU_DEAD: > + case CPU_DEAD_FROZEN: > + free_cpumask_var(cfd->cpumask); > + break; > +#endif > + > + return NOTIFY_OK; > + }; > +} Hmmm.... Why not the following? Do we really need to free the cpumask when a CPU departs, given that it might return later? + switch (action) { + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + if (cfd->cpumask == NULL && + (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, + cpu_to_node(cpu)))) + return NOTIFY_BAD; + break; + + return NOTIFY_OK; + }; +} > + > +static struct notifier_block __cpuinitdata hotplug_cfd_notifier = { > + .notifier_call = hotplug_cfd, > +}; > + > static int __cpuinit init_call_single_data(void) > { > + void *cpu = (void *)(long)smp_processor_id(); > int i; > > for_each_possible_cpu(i) { > @@ -44,18 +86,61 @@ static int __cpuinit init_call_single_da > spin_lock_init(&q->lock); > INIT_LIST_HEAD(&q->list); > } > + > + hotplug_cfd(&hotplug_cfd_notifier, CPU_UP_PREPARE, cpu); > + register_cpu_notifier(&hotplug_cfd_notifier); > + > return 0; > } > early_initcall(init_call_single_data); > > -static void csd_flag_wait(struct call_single_data *data) > +/* > + * csd_wait/csd_complete are used for synchronous ipi calls > + */ > +static void csd_wait_prepare(struct call_single_data *data) > +{ > + data->flags |= CSD_FLAG_WAIT; > +} > + > +static void csd_complete(struct call_single_data *data) > +{ > + if (data->flags & CSD_FLAG_WAIT) { > + /* > + * ensure we're all done before saying we are > + */ > + smp_mb(); > + data->flags &= ~CSD_FLAG_WAIT; > + } > +} > + > +static void csd_wait(struct call_single_data *data) > +{ > + while (data->flags & CSD_FLAG_WAIT) > + cpu_relax(); > +} > + > +/* > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources > + * > + * For non-synchronous ipi calls the csd can still be in use by the previous > + * function call. For multi-cpu calls its even more interesting as we'll have > + * to ensure no other cpu is observing our csd. > + */ > +static void csd_lock(struct call_single_data *data) > { > - /* Wait for response */ > - do { > - if (!(data->flags & CSD_FLAG_WAIT)) > - break; > + while (data->flags & CSD_FLAG_LOCK) > cpu_relax(); > - } while (1); > + data->flags = CSD_FLAG_LOCK; We do need an smp_mb() here, otherwise, the call from smp_call_function_single() could be reordered by either CPU or compiler as follows: data->func = func; data->info = info; csd_lock(data); This might come as a bit of a surprise to the other CPU still trying to use the old values for data->func and data->info. Note that this smb_mb() is required even if cpu_relax() contains a memory barrier, as it is possible to execute csd_lock_wait() without executing the cpu_relax(), if you get there at just the right time. > +} > + > +static void csd_unlock(struct call_single_data *data) > +{ > + WARN_ON(!(data->flags & CSD_FLAG_LOCK)); > + /* > + * ensure we're all done before releasing data > + */ > + smp_mb(); > + data->flags &= ~CSD_FLAG_LOCK; > } > > /* > @@ -89,16 +174,7 @@ static void generic_exec_single(int cpu, > arch_send_call_function_single_ipi(cpu); > > if (wait) > - csd_flag_wait(data); > -} > - > -static void rcu_free_call_data(struct rcu_head *head) > -{ > - struct call_function_data *data; > - > - data = container_of(head, struct call_function_data, rcu_head); > - > - kfree(data); > + csd_wait(data); > } > > /* > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt > * It's ok to use list_for_each_rcu() here even though we may delete > * 'pos', since list_del_rcu() doesn't clear ->next > */ > - rcu_read_lock(); > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { OK... What prevents the following sequence of events? o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, including CPU 2. CPU 0 therefore enqueues this on the global call_function.queue. "wait" is not specified, so CPU 0 returns immediately after sending the IPIs. o CPU 1 disables irqs and leaves them disabled for awhile. o CPU 2 receives the IPI, and duly calls the needed function. It decrements the ->refs field, but, finding the result non-zero, refrains from removing the element that CPU 0 enqueued, and also refrains from invoking csd_unlock(). o CPU 3 also receives the IPI, and also calls the needed function. Now, only CPU 1 need receive the IPI for the element to be removed. o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, but -not- including CPU 2. CPU 3 therefore also this on the global call_function.queue and sends the IPIs, but no IPI for CPU 2. Your choice as to whether CPU 3 waits or not. o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), and then... o CPU 1 finally re-enables irqs and receives the IPIs!!! It finds CPU 0's element on the queue, calls the function, decrements the ->refs field, and finds that it is zero. So, CPU 1 invokes list_del_rcu() to remove the element (OK so far, as list_del_rcu() doesn't overwrite the next pointer), then invokes csd_unlock() to release the element. o CPU 0 then invokes another smp_call_function_many(), also multiple CPUs, but -not- to CPU 2. It requeues the element that was just csd_unlock()ed above, carrying CPU 2 with it. It IPIs CPUs 1 and 3, but not CPU 2. o CPU 2 continues, and falls off the bottom of the list. It will continue to ignore CPU 0's IPI until some other CPU IPIs it. On some architectures, a single-target IPI won't cut it, only a multi-target IPI. Or am I missing something subtle here? If this really is a problem, there are a number of counter-based solutions to it. (Famous last words...) Abandoning all caution and attempting one on the fly... Make each CPU receiving an IPI increment one per-CPU counter upon entry, and increment it again upon exit with memory barriers after and before, respectively. Then any CPU with an even value can be ignored, and any CPU whose value changes can also be ignored. Of course, this means you have to scan all CPUs... But in the worst case, you also had to IPI them all. Given that this operation is relatively rare, it might be worth using shared reference counters, possibly one pair of such counters per (say) 16 CPUs. Then the caller flips the counter. Alternatively, you can explain to me why my scenario above cannot happen -- but at present, it will take some serious explaining!!! > int refs; > > - if (!cpumask_test_cpu(cpu, to_cpumask(data->cpumask_bits))) > + spin_lock(&data->lock); > + if (!cpumask_test_cpu(cpu, data->cpumask)) { > + spin_unlock(&data->lock); > continue; > + } > + cpumask_clear_cpu(cpu, data->cpumask); > + spin_unlock(&data->lock); > > data->csd.func(data->csd.info); > > spin_lock(&data->lock); > - cpumask_clear_cpu(cpu, to_cpumask(data->cpumask_bits)); > WARN_ON(data->refs == 0); > - data->refs--; > - refs = data->refs; > + refs = --data->refs; > + if (!refs) { > + spin_lock(&call_function.lock); > + list_del_rcu(&data->csd.list); > + spin_unlock(&call_function.lock); > + } > spin_unlock(&data->lock); > > if (refs) > continue; > > - spin_lock(&call_function_lock); > - list_del_rcu(&data->csd.list); > - spin_unlock(&call_function_lock); > - > - if (data->csd.flags & CSD_FLAG_WAIT) { > - /* > - * serialize stores to data with the flag clear > - * and wakeup > - */ > - smp_wmb(); > - data->csd.flags &= ~CSD_FLAG_WAIT; > - } > - if (data->csd.flags & CSD_FLAG_ALLOC) > - call_rcu(&data->rcu_head, rcu_free_call_data); > + csd_complete(&data->csd); > + csd_unlock(&data->csd); > } > - rcu_read_unlock(); > > put_cpu(); > } > @@ -192,14 +262,14 @@ void generic_smp_call_function_single_in > > data->func(data->info); > > - if (data_flags & CSD_FLAG_WAIT) { > - smp_wmb(); > - data->flags &= ~CSD_FLAG_WAIT; > - } else if (data_flags & CSD_FLAG_LOCK) { > - smp_wmb(); > - data->flags &= ~CSD_FLAG_LOCK; > - } else if (data_flags & CSD_FLAG_ALLOC) > - kfree(data); > + if (data_flags & CSD_FLAG_WAIT) > + csd_complete(data); > + > + /* > + * Unlocked CSDs are valid through generic_exec_single() > + */ > + if (data_flags & CSD_FLAG_LOCK) > + csd_unlock(data); > } > } > > @@ -218,7 +288,9 @@ static DEFINE_PER_CPU(struct call_single > int smp_call_function_single(int cpu, void (*func) (void *info), void *info, > int wait) > { > - struct call_single_data d; > + struct call_single_data d = { > + .flags = 0, > + }; > unsigned long flags; > /* prevent preemption and reschedule on another processor, > as well as CPU removal */ > @@ -239,13 +311,11 @@ int smp_call_function_single(int cpu, vo > /* > * We are calling a function on a single CPU > * and we are not going to wait for it to finish. > - * We first try to allocate the data, but if we > - * fail, we fall back to use a per cpu data to pass > - * the information to that CPU. Since all callers > - * of this code will use the same data, we must > - * synchronize the callers to prevent a new caller > - * from corrupting the data before the callee > - * can access it. > + * We use a per cpu data to pass the information to > + * that CPU. Since all callers of this code will > + * use the same data, we must synchronize the > + * callers to prevent a new caller from corrupting > + * the data before the callee can access it. > * > * The CSD_FLAG_LOCK is used to let us know when > * the IPI handler is done with the data. > @@ -255,18 +325,11 @@ int smp_call_function_single(int cpu, vo > * will make sure the callee is done with the > * data before a new caller will use it. > */ > - data = kmalloc(sizeof(*data), GFP_ATOMIC); > - if (data) > - data->flags = CSD_FLAG_ALLOC; > - else { > - data = &per_cpu(csd_data, me); > - while (data->flags & CSD_FLAG_LOCK) > - cpu_relax(); > - data->flags = CSD_FLAG_LOCK; > - } > + data = &per_cpu(csd_data, me); > + csd_lock(data); > } else { > data = &d; > - data->flags = CSD_FLAG_WAIT; > + csd_wait_prepare(data); > } > > data->func = func; > @@ -326,14 +389,14 @@ void smp_call_function_many(const struct > { > struct call_function_data *data; > unsigned long flags; > - int cpu, next_cpu; > + int cpu, next_cpu, me = smp_processor_id(); > > /* Can deadlock when called with interrupts disabled */ > WARN_ON(irqs_disabled()); > > /* So, what's a CPU they want? Ignoring this one. */ > cpu = cpumask_first_and(mask, cpu_online_mask); > - if (cpu == smp_processor_id()) > + if (cpu == me) > cpu = cpumask_next_and(cpu, mask, cpu_online_mask); > /* No online cpus? We're done. */ > if (cpu >= nr_cpu_ids) > @@ -341,7 +404,7 @@ void smp_call_function_many(const struct > > /* Do we have another CPU which isn't us? */ > next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask); > - if (next_cpu == smp_processor_id()) > + if (next_cpu == me) > next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask); > > /* Fastpath: do that cpu by itself. */ > @@ -350,31 +413,28 @@ void smp_call_function_many(const struct > return; > } > > - data = kmalloc(sizeof(*data) + cpumask_size(), GFP_ATOMIC); > - if (unlikely(!data)) { > - /* Slow path. */ > - for_each_online_cpu(cpu) { > - if (cpu == smp_processor_id()) > - continue; > - if (cpumask_test_cpu(cpu, mask)) > - smp_call_function_single(cpu, func, info, wait); > - } > - return; > - } > + data = &per_cpu(cfd_data, me); > + csd_lock(&data->csd); > > - spin_lock_init(&data->lock); > - data->csd.flags = CSD_FLAG_ALLOC; > + spin_lock_irqsave(&data->lock, flags); > if (wait) > - data->csd.flags |= CSD_FLAG_WAIT; > + csd_wait_prepare(&data->csd); > + > data->csd.func = func; > data->csd.info = info; > - cpumask_and(to_cpumask(data->cpumask_bits), mask, cpu_online_mask); > - cpumask_clear_cpu(smp_processor_id(), to_cpumask(data->cpumask_bits)); > - data->refs = cpumask_weight(to_cpumask(data->cpumask_bits)); > - > - spin_lock_irqsave(&call_function_lock, flags); > - list_add_tail_rcu(&data->csd.list, &call_function_queue); > - spin_unlock_irqrestore(&call_function_lock, flags); > + cpumask_and(data->cpumask, mask, cpu_online_mask); > + cpumask_clear_cpu(me, data->cpumask); > + data->refs = cpumask_weight(data->cpumask); > + > + spin_lock(&call_function.lock); > + /* > + * Place entry at the _HEAD_ of the list, so that any cpu still > + * observing the entry in generic_smp_call_function_interrupt() will > + * not miss any other list entries. > + */ > + list_add_rcu(&data->csd.list, &call_function.queue); > + spin_unlock(&call_function.lock); > + spin_unlock_irqrestore(&data->lock, flags); > > /* > * Make the list addition visible before sending the ipi. > @@ -384,11 +444,11 @@ void smp_call_function_many(const struct > smp_mb(); > > /* Send a message to all CPUs in the map */ > - arch_send_call_function_ipi_mask(to_cpumask(data->cpumask_bits)); > + arch_send_call_function_ipi_mask(data->cpumask); > > /* optionally wait for the CPUs to complete */ > if (wait) > - csd_flag_wait(&data->csd); > + csd_wait(&data->csd); > } > EXPORT_SYMBOL(smp_call_function_many); > > @@ -418,20 +478,20 @@ EXPORT_SYMBOL(smp_call_function); > > void ipi_call_lock(void) > { > - spin_lock(&call_function_lock); > + spin_lock(&call_function.lock); > } > > void ipi_call_unlock(void) > { > - spin_unlock(&call_function_lock); > + spin_unlock(&call_function.lock); > } > > void ipi_call_lock_irq(void) > { > - spin_lock_irq(&call_function_lock); > + spin_lock_irq(&call_function.lock); > } > > void ipi_call_unlock_irq(void) > { > - spin_unlock_irq(&call_function_lock); > + spin_unlock_irq(&call_function.lock); > } > > -- > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 0:28 ` Paul E. McKenney @ 2009-02-18 10:42 ` Peter Zijlstra 2009-02-18 16:06 ` Paul E. McKenney 2009-02-18 16:15 ` Oleg Nesterov 1 sibling, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2009-02-18 10:42 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel, Oleg Nesterov On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote: > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > +static int > > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) > > +{ > > + long cpu = (long)hcpu; > > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); > > + > > + switch (action) { > > + case CPU_UP_PREPARE: > > + case CPU_UP_PREPARE_FROZEN: > > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, > > + cpu_to_node(cpu))) > > + return NOTIFY_BAD; > > + break; > > + > > +#ifdef CONFIG_CPU_HOTPLUG > > + case CPU_UP_CANCELED: > > + case CPU_UP_CANCELED_FROZEN: > > + > > + case CPU_DEAD: > > + case CPU_DEAD_FROZEN: > > + free_cpumask_var(cfd->cpumask); > > + break; > > +#endif > > + > > + return NOTIFY_OK; > > + }; > > +} > > Hmmm.... Why not the following? Do we really need to free the cpumask > when a CPU departs, given that it might return later? Probably not, but that's what we have these callbacks for, might as well use them. > > +/* > > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources > > + * > > + * For non-synchronous ipi calls the csd can still be in use by the previous > > + * function call. For multi-cpu calls its even more interesting as we'll have > > + * to ensure no other cpu is observing our csd. > > + */ > > +static void csd_lock(struct call_single_data *data) > > { > > - /* Wait for response */ > > - do { > > - if (!(data->flags & CSD_FLAG_WAIT)) > > - break; > > + while (data->flags & CSD_FLAG_LOCK) > > cpu_relax(); > > - } while (1); > > + data->flags = CSD_FLAG_LOCK; > > We do need an smp_mb() here, otherwise, the call from > smp_call_function_single() could be reordered by either CPU or compiler > as follows: > > data->func = func; > data->info = info; > csd_lock(data); > > This might come as a bit of a surprise to the other CPU still trying to > use the old values for data->func and data->info. > > Note that this smb_mb() is required even if cpu_relax() contains a > memory barrier, as it is possible to execute csd_lock_wait() without > executing the cpu_relax(), if you get there at just the right time. I'm not quite sure I follow, if we observe !(flags & LOCK) then we're guaranteed that no cpu will still needs func and info. They might still be observing the list entry, but should not find themselves on the cpumask -- which is guarded by ->lock. Anyway, I'm happy to add the mb(). > > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt > > * It's ok to use list_for_each_rcu() here even though we may delete > > * 'pos', since list_del_rcu() doesn't clear ->next > > */ > > - rcu_read_lock(); > > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { > > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { > > OK... What prevents the following sequence of events? > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > including CPU 2. CPU 0 therefore enqueues this on the global > call_function.queue. "wait" is not specified, so CPU 0 returns > immediately after sending the IPIs. > > o CPU 1 disables irqs and leaves them disabled for awhile. > > o CPU 2 receives the IPI, and duly calls the needed function. > It decrements the ->refs field, but, finding the result > non-zero, refrains from removing the element that CPU 0 > enqueued, and also refrains from invoking csd_unlock(). > > o CPU 3 also receives the IPI, and also calls the needed function. > Now, only CPU 1 need receive the IPI for the element to be > removed. > > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > but -not- including CPU 2. CPU 3 therefore also this on the > global call_function.queue and sends the IPIs, but no IPI for > CPU 2. Your choice as to whether CPU 3 waits or not. Right, so the queue is Entry3->Entry0 (we place new entries on at the head). > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > and then... CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the ipi. Furthermore, Entry3 would be first, but suppose it is Entry0, that makes the scenario work best. > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > finds CPU 0's element on the queue, calls the function, > decrements the ->refs field, and finds that it is zero. > So, CPU 1 invokes list_del_rcu() to remove the element > (OK so far, as list_del_rcu() doesn't overwrite the next > pointer), then invokes csd_unlock() to release the element. CPU1 will see CPU3's element first, and CPU0's element second. But OK. > o CPU 0 then invokes another smp_call_function_many(), also > multiple CPUs, but -not- to CPU 2. It requeues the element > that was just csd_unlock()ed above, carrying CPU 2 with it. > It IPIs CPUs 1 and 3, but not CPU 2. > > o CPU 2 continues, and falls off the bottom of the list. It will > continue to ignore CPU 0's IPI until some other CPU IPIs it. > On some architectures, a single-target IPI won't cut it, only > a multi-target IPI. > > Or am I missing something subtle here? Ah, right, yes. I place new entries at the HEAD not TAIL, so that in this case we go from: Entry3->Entry0 ^ CPU2 to Entry0->Entry3 ^ CPU2 and CPU2 will continue the list iteration, visiting Entry3 twice, which is harmless since it will have removed itself from the cpumask the first time around. > If this really is a problem, there are a number of counter-based solutions > to it. (Famous last words...) > > Abandoning all caution and attempting one on the fly... Make each CPU > receiving an IPI increment one per-CPU counter upon entry, and increment > it again upon exit with memory barriers after and before, respectively. > Then any CPU with an even value can be ignored, and any CPU whose value > changes can also be ignored. Of course, this means you have to scan all > CPUs... But in the worst case, you also had to IPI them all. > > Given that this operation is relatively rare, it might be worth using > shared reference counters, possibly one pair of such counters per (say) > 16 CPUs. Then the caller flips the counter. Yep, I almost implemented a counting RCU which increments a global counter on IPI entry and decrements on IPI exit, but then did the above FIFO->FILO queue thingy. > Alternatively, you can explain to me why my scenario above cannot > happen -- but at present, it will take some serious explaining!!! I hope to have adequately explained it, but please, feel free to poke more holes into it ;-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 10:42 ` Peter Zijlstra @ 2009-02-18 16:06 ` Paul E. McKenney 0 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2009-02-18 16:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel, Oleg Nesterov On Wed, Feb 18, 2009 at 11:42:17AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote: > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > > > +static int > > > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) > > > +{ > > > + long cpu = (long)hcpu; > > > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); > > > + > > > + switch (action) { > > > + case CPU_UP_PREPARE: > > > + case CPU_UP_PREPARE_FROZEN: > > > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, > > > + cpu_to_node(cpu))) > > > + return NOTIFY_BAD; > > > + break; > > > + > > > +#ifdef CONFIG_CPU_HOTPLUG > > > + case CPU_UP_CANCELED: > > > + case CPU_UP_CANCELED_FROZEN: > > > + > > > + case CPU_DEAD: > > > + case CPU_DEAD_FROZEN: > > > + free_cpumask_var(cfd->cpumask); > > > + break; > > > +#endif > > > + > > > + return NOTIFY_OK; > > > + }; > > > +} > > > > Hmmm.... Why not the following? Do we really need to free the cpumask > > when a CPU departs, given that it might return later? > > Probably not, but that's what we have these callbacks for, might as well > use them. Fair enough... > > > +/* > > > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources > > > + * > > > + * For non-synchronous ipi calls the csd can still be in use by the previous > > > + * function call. For multi-cpu calls its even more interesting as we'll have > > > + * to ensure no other cpu is observing our csd. > > > + */ > > > +static void csd_lock(struct call_single_data *data) > > > { > > > - /* Wait for response */ > > > - do { > > > - if (!(data->flags & CSD_FLAG_WAIT)) > > > - break; > > > + while (data->flags & CSD_FLAG_LOCK) > > > cpu_relax(); > > > - } while (1); > > > + data->flags = CSD_FLAG_LOCK; > > > > We do need an smp_mb() here, otherwise, the call from > > smp_call_function_single() could be reordered by either CPU or compiler > > as follows: > > > > data->func = func; > > data->info = info; > > csd_lock(data); > > > > This might come as a bit of a surprise to the other CPU still trying to > > use the old values for data->func and data->info. > > > > Note that this smb_mb() is required even if cpu_relax() contains a > > memory barrier, as it is possible to execute csd_lock_wait() without > > executing the cpu_relax(), if you get there at just the right time. > > I'm not quite sure I follow, if we observe !(flags & LOCK) then we're > guaranteed that no cpu will still needs func and info. They might still > be observing the list entry, but should not find themselves on the > cpumask -- which is guarded by ->lock. The compiler could reorder as above, in which case the other CPU might still be looking at the ->func and ->info fields when the stores happen, but might have done csd_unlock() by the time this CPU does its csd_lock(). > Anyway, I'm happy to add the mb(). Please! ;-) > > > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt > > > * It's ok to use list_for_each_rcu() here even though we may delete > > > * 'pos', since list_del_rcu() doesn't clear ->next > > > */ > > > - rcu_read_lock(); > > > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { > > > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { > > > > OK... What prevents the following sequence of events? > > > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > > including CPU 2. CPU 0 therefore enqueues this on the global > > call_function.queue. "wait" is not specified, so CPU 0 returns > > immediately after sending the IPIs. > > > > o CPU 1 disables irqs and leaves them disabled for awhile. > > > > o CPU 2 receives the IPI, and duly calls the needed function. > > It decrements the ->refs field, but, finding the result > > non-zero, refrains from removing the element that CPU 0 > > enqueued, and also refrains from invoking csd_unlock(). > > > > o CPU 3 also receives the IPI, and also calls the needed function. > > Now, only CPU 1 need receive the IPI for the element to be > > removed. > > > > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > > but -not- including CPU 2. CPU 3 therefore also this on the > > global call_function.queue and sends the IPIs, but no IPI for > > CPU 2. Your choice as to whether CPU 3 waits or not. > > Right, so the queue is Entry3->Entry0 (we place new entries on at the > head). Gah!!! I even remember looking at the list_add_rcu() and noting that it was not a list_add_tail_rcu(). Please accept my apologies for my confusion!!! I suppose that a given CPU might get repeatedly recycled to the beginning of the list, but someone would have to make that happen before I would be inclined to worry about it. > > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > > and then... > > CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the > ipi. > > Furthermore, Entry3 would be first, but suppose it is Entry0, that makes > the scenario work best. > > > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > > finds CPU 0's element on the queue, calls the function, > > decrements the ->refs field, and finds that it is zero. > > So, CPU 1 invokes list_del_rcu() to remove the element > > (OK so far, as list_del_rcu() doesn't overwrite the next > > pointer), then invokes csd_unlock() to release the element. > > CPU1 will see CPU3's element first, and CPU0's element second. But OK. > > > o CPU 0 then invokes another smp_call_function_many(), also > > multiple CPUs, but -not- to CPU 2. It requeues the element > > that was just csd_unlock()ed above, carrying CPU 2 with it. > > It IPIs CPUs 1 and 3, but not CPU 2. > > > > o CPU 2 continues, and falls off the bottom of the list. It will > > continue to ignore CPU 0's IPI until some other CPU IPIs it. > > On some architectures, a single-target IPI won't cut it, only > > a multi-target IPI. > > > > Or am I missing something subtle here? > > Ah, right, yes. I place new entries at the HEAD not TAIL, so that in > this case we go from: > > Entry3->Entry0 > ^ > CPU2 > > to > > Entry0->Entry3 > ^ > CPU2 > > and CPU2 will continue the list iteration, visiting Entry3 twice, which > is harmless since it will have removed itself from the cpumask the first > time around. > > > If this really is a problem, there are a number of counter-based solutions > > to it. (Famous last words...) > > > > Abandoning all caution and attempting one on the fly... Make each CPU > > receiving an IPI increment one per-CPU counter upon entry, and increment > > it again upon exit with memory barriers after and before, respectively. > > Then any CPU with an even value can be ignored, and any CPU whose value > > changes can also be ignored. Of course, this means you have to scan all > > CPUs... But in the worst case, you also had to IPI them all. > > > > Given that this operation is relatively rare, it might be worth using > > shared reference counters, possibly one pair of such counters per (say) > > 16 CPUs. Then the caller flips the counter. > > Yep, I almost implemented a counting RCU which increments a global > counter on IPI entry and decrements on IPI exit, but then did the above > FIFO->FILO queue thingy. A simpler way (if it becomes necessary to add at the tail rather than the head) would be to periodically check for stalls of this sort and then resend the IPIs. > > Alternatively, you can explain to me why my scenario above cannot > > happen -- but at present, it will take some serious explaining!!! > > I hope to have adequately explained it, but please, feel free to poke > more holes into it ;-) You have indeed more than adequately explained it. The only thing resembling a hole that I have found thus far is the improbable scenario that repeatedly pulls a given CPU to the beginning of the list, so that this CPU never reaches the end. Again, please accept my apologies for my confusion!!! Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 0:28 ` Paul E. McKenney 2009-02-18 10:42 ` Peter Zijlstra @ 2009-02-18 16:15 ` Oleg Nesterov 2009-02-18 19:47 ` Paul E. McKenney 1 sibling, 1 reply; 16+ messages in thread From: Oleg Nesterov @ 2009-02-18 16:15 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel On 02/17, Paul E. McKenney wrote: > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > +static void csd_lock(struct call_single_data *data) > > { > > - /* Wait for response */ > > - do { > > - if (!(data->flags & CSD_FLAG_WAIT)) > > - break; > > + while (data->flags & CSD_FLAG_LOCK) > > cpu_relax(); > > - } while (1); > > + data->flags = CSD_FLAG_LOCK; > > We do need an smp_mb() here, otherwise, the call from > smp_call_function_single() could be reordered by either CPU or compiler > as follows: > > data->func = func; > data->info = info; > csd_lock(data); > > This might come as a bit of a surprise to the other CPU still trying to > use the old values for data->func and data->info. Could you explain a bit more here? The compiler can't re-order this code due to cpu_relax(). Cpu can re-order, but this doesn't matter because both the sender and ipi handler take call_single_queue->lock. And, giwen that csd_unlock() does mb() before csd_unlock(), how it is possible that other CPU (ipi handler) still uses the old values in *data after we see !CSD_FLAG_LOCK ? > Note that this smb_mb() is required even if cpu_relax() contains a > memory barrier, as it is possible to execute csd_lock_wait() without > executing the cpu_relax(), if you get there at just the right time. Can't understand... Nobody can do csd_wait() on this per-cpu entry, but I guess you meant something else. > OK... What prevents the following sequence of events? > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > including CPU 2. CPU 0 therefore enqueues this on the global > call_function.queue. "wait" is not specified, so CPU 0 returns > immediately after sending the IPIs. > > It decrements the ->refs field, but, finding the result > non-zero, refrains from removing the element that CPU 0 > enqueued, and also refrains from invoking csd_unlock(). > > o CPU 3 also receives the IPI, and also calls the needed function. > Now, only CPU 1 need receive the IPI for the element to be > removed. so we have a single entry E0 on list, > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > but -not- including CPU 2. CPU 3 therefore also this on the > global call_function.queue and sends the IPIs, but no IPI for > CPU 2. Your choice as to whether CPU 3 waits or not. now we have E3 -> E0 > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > and then... it actually sees E3, not E0 > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > finds CPU 0's element on the queue, calls the function, > decrements the ->refs field, and finds that it is zero. > So, CPU 1 invokes list_del_rcu() to remove the element > (OK so far, as list_del_rcu() doesn't overwrite the next > pointer), then invokes csd_unlock() to release the element. > > o CPU 0 then invokes another smp_call_function_many(), also > multiple CPUs, but -not- to CPU 2. It requeues the element > that was just csd_unlock()ed above, carrying CPU 2 with it. > It IPIs CPUs 1 and 3, but not CPU 2. and inserts the element E0 at the head of the list again, > > o CPU 2 continues, and falls off the bottom of the list. afaics, it doesn't. Every time smp_call_function_many() reuses the element, it sets its ->next pointer to the head of the list. If we race with another CPU which fetches this pointer, this CPU has to re-scan the whole list, but since we always modify/read data under data->lock this should be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask). No? Oleg. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 16:15 ` Oleg Nesterov @ 2009-02-18 19:47 ` Paul E. McKenney 2009-02-18 20:12 ` Oleg Nesterov 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2009-02-18 19:47 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote: > On 02/17, Paul E. McKenney wrote: > > > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > > +static void csd_lock(struct call_single_data *data) > > > { > > > - /* Wait for response */ > > > - do { > > > - if (!(data->flags & CSD_FLAG_WAIT)) > > > - break; > > > + while (data->flags & CSD_FLAG_LOCK) > > > cpu_relax(); > > > - } while (1); > > > + data->flags = CSD_FLAG_LOCK; > > > > We do need an smp_mb() here, otherwise, the call from > > smp_call_function_single() could be reordered by either CPU or compiler > > as follows: > > > > data->func = func; > > data->info = info; > > csd_lock(data); > > > > This might come as a bit of a surprise to the other CPU still trying to > > use the old values for data->func and data->info. > > Could you explain a bit more here? > > The compiler can't re-order this code due to cpu_relax(). Cpu can > re-order, but this doesn't matter because both the sender and ipi > handler take call_single_queue->lock. > > And, giwen that csd_unlock() does mb() before csd_unlock(), how > it is possible that other CPU (ipi handler) still uses the old > values in *data after we see !CSD_FLAG_LOCK ? Good point on cpu_relax(), which appears to be at least a compiler barrier on all architectures. I must confess to being in the habit of assuming reordering unless I can prove that such reordering cannot happen. I am running tests of this code snippet on POWER, but do not have access to ARM, which some say has more tendency to ignore control-flow dependencies than does Power. I will let people know what comes of the tests. > > Note that this smb_mb() is required even if cpu_relax() contains a > > memory barrier, as it is possible to execute csd_lock_wait() without > > executing the cpu_relax(), if you get there at just the right time. > > Can't understand... Nobody can do csd_wait() on this per-cpu entry, > but I guess you meant something else. > > > OK... What prevents the following sequence of events? > > > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > > including CPU 2. CPU 0 therefore enqueues this on the global > > call_function.queue. "wait" is not specified, so CPU 0 returns > > immediately after sending the IPIs. > > > > It decrements the ->refs field, but, finding the result > > non-zero, refrains from removing the element that CPU 0 > > enqueued, and also refrains from invoking csd_unlock(). > > > > o CPU 3 also receives the IPI, and also calls the needed function. > > Now, only CPU 1 need receive the IPI for the element to be > > removed. > > so we have a single entry E0 on list, > > > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > > but -not- including CPU 2. CPU 3 therefore also this on the > > global call_function.queue and sends the IPIs, but no IPI for > > CPU 2. Your choice as to whether CPU 3 waits or not. > > now we have E3 -> E0 > > > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > > and then... > > it actually sees E3, not E0 > > > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > > finds CPU 0's element on the queue, calls the function, > > decrements the ->refs field, and finds that it is zero. > > So, CPU 1 invokes list_del_rcu() to remove the element > > (OK so far, as list_del_rcu() doesn't overwrite the next > > pointer), then invokes csd_unlock() to release the element. > > > > o CPU 0 then invokes another smp_call_function_many(), also > > multiple CPUs, but -not- to CPU 2. It requeues the element > > that was just csd_unlock()ed above, carrying CPU 2 with it. > > It IPIs CPUs 1 and 3, but not CPU 2. > > and inserts the element E0 at the head of the list again, > > > > > o CPU 2 continues, and falls off the bottom of the list. > > afaics, it doesn't. > > Every time smp_call_function_many() reuses the element, it sets its > ->next pointer to the head of the list. If we race with another CPU > which fetches this pointer, this CPU has to re-scan the whole list, > but since we always modify/read data under data->lock this should > be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask). > > No? You are quite correct. I guess I should have gone home early instead of reviewing Peter's patch... :-/ Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 19:47 ` Paul E. McKenney @ 2009-02-18 20:12 ` Oleg Nesterov 2009-02-19 2:40 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Oleg Nesterov @ 2009-02-18 20:12 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel On 02/18, Paul E. McKenney wrote: > > On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote: > > > On 02/17, Paul E. McKenney wrote: > > > > > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > > > +static void csd_lock(struct call_single_data *data) > > > > { > > > > - /* Wait for response */ > > > > - do { > > > > - if (!(data->flags & CSD_FLAG_WAIT)) > > > > - break; > > > > + while (data->flags & CSD_FLAG_LOCK) > > > > cpu_relax(); > > > > - } while (1); > > > > + data->flags = CSD_FLAG_LOCK; > > > > > > We do need an smp_mb() here, otherwise, the call from > > > smp_call_function_single() could be reordered by either CPU or compiler > > > as follows: > > > > > > data->func = func; > > > data->info = info; > > > csd_lock(data); > > > > > > This might come as a bit of a surprise to the other CPU still trying to > > > use the old values for data->func and data->info. > > > > Could you explain a bit more here? > > > > The compiler can't re-order this code due to cpu_relax(). Cpu can > > re-order, but this doesn't matter because both the sender and ipi > > handler take call_single_queue->lock. > > > > And, giwen that csd_unlock() does mb() before csd_unlock(), how > > it is possible that other CPU (ipi handler) still uses the old > > values in *data after we see !CSD_FLAG_LOCK ? > > Good point on cpu_relax(), which appears to be at least a compiler > barrier on all architectures. > > I must confess to being in the habit of assuming reordering unless I > can prove that such reordering cannot happen. Yes, probably you are right... But since almost nobody (except you ;) really understands this magic, it would be nice to have a comment which explains exactly what is the reason for mb(). Otherwise it is so hard to read the code, if you don't understand mb(), then you probably missed something important. > > Every time smp_call_function_many() reuses the element, it sets its > > ->next pointer to the head of the list. If we race with another CPU > > which fetches this pointer, this CPU has to re-scan the whole list, > > but since we always modify/read data under data->lock this should > > be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask). > > You are quite correct. I guess I should have gone home early instead of > reviewing Peter's patch... :-/ In that case I shouldn't even try to read this series ;) I was wrong so many times... Oleg. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 20:12 ` Oleg Nesterov @ 2009-02-19 2:40 ` Paul E. McKenney 2009-02-19 8:33 ` Peter Zijlstra 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2009-02-19 2:40 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel On Wed, Feb 18, 2009 at 09:12:52PM +0100, Oleg Nesterov wrote: > On 02/18, Paul E. McKenney wrote: > > > > On Wed, Feb 18, 2009 at 05:15:15PM +0100, Oleg Nesterov wrote: > > > > > On 02/17, Paul E. McKenney wrote: > > > > > > > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > > > > +static void csd_lock(struct call_single_data *data) > > > > > { > > > > > - /* Wait for response */ > > > > > - do { > > > > > - if (!(data->flags & CSD_FLAG_WAIT)) > > > > > - break; > > > > > + while (data->flags & CSD_FLAG_LOCK) > > > > > cpu_relax(); > > > > > - } while (1); > > > > > + data->flags = CSD_FLAG_LOCK; > > > > > > > > We do need an smp_mb() here, otherwise, the call from > > > > smp_call_function_single() could be reordered by either CPU or compiler > > > > as follows: > > > > > > > > data->func = func; > > > > data->info = info; > > > > csd_lock(data); > > > > > > > > This might come as a bit of a surprise to the other CPU still trying to > > > > use the old values for data->func and data->info. > > > > > > Could you explain a bit more here? > > > > > > The compiler can't re-order this code due to cpu_relax(). Cpu can > > > re-order, but this doesn't matter because both the sender and ipi > > > handler take call_single_queue->lock. > > > > > > And, giwen that csd_unlock() does mb() before csd_unlock(), how > > > it is possible that other CPU (ipi handler) still uses the old > > > values in *data after we see !CSD_FLAG_LOCK ? > > > > Good point on cpu_relax(), which appears to be at least a compiler > > barrier on all architectures. > > > > I must confess to being in the habit of assuming reordering unless I > > can prove that such reordering cannot happen. > > Yes, probably you are right... > > But since almost nobody (except you ;) really understands this magic, > it would be nice to have a comment which explains exactly what is the > reason for mb(). Otherwise it is so hard to read the code, if you > don't understand mb(), then you probably missed something important. An hour-long stress test on both Power 5 and Power 6 failed to locate a problem, though that of course does not prove lack of a problem, particularly for CPUs that don't pay as much attention to control-dependency ordering as Power does. :-/ The test may be found in CodeSamples/advsync/special/mbtest/mb_lhs_ws.c in git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git, if anyone wants to take a look. In the meantime, I nominate the following comment at the end of csd_lock(): /* * prevent CPU from reordering the above assignment to ->flags * with any subsequent assignments to other fields of the * specified call_single_data structure. */ smp_mb(); /* See above block comment. */ > > > Every time smp_call_function_many() reuses the element, it sets its > > > ->next pointer to the head of the list. If we race with another CPU > > > which fetches this pointer, this CPU has to re-scan the whole list, > > > but since we always modify/read data under data->lock this should > > > be safe, that CPU must notice (!cpumask_test_cpu(cpu, data->cpumask). > > > > You are quite correct. I guess I should have gone home early instead of > > reviewing Peter's patch... :-/ > > In that case I shouldn't even try to read this series ;) I was wrong so > many times... I suppose that I should be happy to be wrong nine times if I find a subtle bug on the tenth attempt, but somehow it doesn't feel that way. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-19 2:40 ` Paul E. McKenney @ 2009-02-19 8:33 ` Peter Zijlstra 0 siblings, 0 replies; 16+ messages in thread From: Peter Zijlstra @ 2009-02-19 8:33 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Linus Torvalds, Nick Piggin, Jens Axboe, Ingo Molnar, Rusty Russell, linux-kernel On Wed, 2009-02-18 at 18:40 -0800, Paul E. McKenney wrote: > > In the meantime, I nominate the following comment at the end of > csd_lock(): > > /* > * prevent CPU from reordering the above assignment to ->flags > * with any subsequent assignments to other fields of the > * specified call_single_data structure. > */ > > smp_mb(); /* See above block comment. */ Done, thanks Paul! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra 2009-02-18 0:28 ` Paul E. McKenney @ 2009-02-18 5:31 ` Rusty Russell 2009-02-18 10:52 ` Peter Zijlstra 1 sibling, 1 reply; 16+ messages in thread From: Rusty Russell @ 2009-02-18 5:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov On Wednesday 18 February 2009 08:29:06 Peter Zijlstra wrote: > Remove the use of kmalloc() from the smp_call_function_*() calls. This patch is actually quite nice. Not sure it's correct, but it's neat :) One minor quibble: > + data = &per_cpu(csd_data, me); "data = &__get_cpu_var(csd_data);" is slightly preferred. A few less cycles. Thanks, Rusty. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] generic-ipi: remove kmalloc() 2009-02-18 5:31 ` Rusty Russell @ 2009-02-18 10:52 ` Peter Zijlstra 0 siblings, 0 replies; 16+ messages in thread From: Peter Zijlstra @ 2009-02-18 10:52 UTC (permalink / raw) To: Rusty Russell Cc: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney, Ingo Molnar, linux-kernel, Oleg Nesterov On Wed, 2009-02-18 at 16:01 +1030, Rusty Russell wrote: > On Wednesday 18 February 2009 08:29:06 Peter Zijlstra wrote: > > Remove the use of kmalloc() from the smp_call_function_*() calls. > > This patch is actually quite nice. Not sure it's correct, but it's neat :) :-) > One minor quibble: > > > + data = &per_cpu(csd_data, me); > > "data = &__get_cpu_var(csd_data);" is slightly preferred. A few less cycles. Thanks, next version will include the below. Index: linux-2.6/kernel/smp.c =================================================================== --- linux-2.6.orig/kernel/smp.c +++ linux-2.6/kernel/smp.c @@ -325,7 +325,7 @@ int smp_call_function_single(int cpu, vo * will make sure the callee is done with the * data before a new caller will use it. */ - data = &per_cpu(csd_data, me); + data = &__get_cpu_var(csd_data); csd_lock(data); } else { data = &d; @@ -413,7 +413,7 @@ void smp_call_function_many(const struct return; } - data = &per_cpu(cfd_data, me); + data = &__get_cpu_var(cfd_data); csd_lock(&data->csd); spin_lock_irqsave(&data->lock, flags); ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT 2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra 2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra 2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra @ 2009-02-17 21:59 ` Peter Zijlstra 2 siblings, 0 replies; 16+ messages in thread From: Peter Zijlstra @ 2009-02-17 21:59 UTC (permalink / raw) To: Linus Torvalds, Nick Piggin, Jens Axboe, Paul E. McKenney, Ingo Molnar, Rusty Russell Cc: linux-kernel, Oleg Nesterov, Peter Zijlstra [-- Attachment #1: smp-wait.patch --] [-- Type: text/plain, Size: 7199 bytes --] Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework the code so that we can use CSD_FLAG_LOCK for both purposes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- block/blk-softirq.c | 2 - include/linux/smp.h | 3 + kernel/sched.c | 2 - kernel/smp.c | 90 +++++++++++++--------------------------------------- kernel/softirq.c | 2 - 5 files changed, 28 insertions(+), 71 deletions(-) Index: linux-2.6/block/blk-softirq.c =================================================================== --- linux-2.6.orig/block/blk-softirq.c +++ linux-2.6/block/blk-softirq.c @@ -64,7 +64,7 @@ static int raise_blk_irq(int cpu, struct data->info = rq; data->flags = 0; - __smp_call_function_single(cpu, data); + __smp_call_function_single(cpu, data, 0); return 0; } Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1120,7 +1120,7 @@ static void hrtick_start(struct rq *rq, if (rq == this_rq()) { hrtimer_restart(timer); } else if (!rq->hrtick_csd_pending) { - __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd); + __smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0); rq->hrtick_csd_pending = 1; } } Index: linux-2.6/kernel/smp.c =================================================================== --- linux-2.6.orig/kernel/smp.c +++ linux-2.6/kernel/smp.c @@ -23,8 +23,7 @@ static struct { }; enum { - CSD_FLAG_WAIT = 0x01, - CSD_FLAG_LOCK = 0x02, + CSD_FLAG_LOCK = 0x01, }; struct call_function_data { @@ -95,41 +94,21 @@ static int __cpuinit init_call_single_da early_initcall(init_call_single_data); /* - * csd_wait/csd_complete are used for synchronous ipi calls - */ -static void csd_wait_prepare(struct call_single_data *data) -{ - data->flags |= CSD_FLAG_WAIT; -} - -static void csd_complete(struct call_single_data *data) -{ - if (data->flags & CSD_FLAG_WAIT) { - /* - * ensure we're all done before saying we are - */ - smp_mb(); - data->flags &= ~CSD_FLAG_WAIT; - } -} - -static void csd_wait(struct call_single_data *data) -{ - while (data->flags & CSD_FLAG_WAIT) - cpu_relax(); -} - -/* * csd_lock/csd_unlock used to serialize access to per-cpu csd resources * * For non-synchronous ipi calls the csd can still be in use by the previous * function call. For multi-cpu calls its even more interesting as we'll have * to ensure no other cpu is observing our csd. */ -static void csd_lock(struct call_single_data *data) +static void csd_lock_wait(struct call_single_data *data) { while (data->flags & CSD_FLAG_LOCK) cpu_relax(); +} + +static void csd_lock(struct call_single_data *data) +{ + csd_lock_wait(data); data->flags = CSD_FLAG_LOCK; } @@ -147,11 +126,12 @@ static void csd_unlock(struct call_singl * Insert a previously allocated call_single_data element for execution * on the given CPU. data must already have ->func, ->info, and ->flags set. */ -static void generic_exec_single(int cpu, struct call_single_data *data) +static +void generic_exec_single(int cpu, struct call_single_data *data, int wait) { struct call_single_queue *dst = &per_cpu(call_single_queue, cpu); - int wait = data->flags & CSD_FLAG_WAIT, ipi; unsigned long flags; + int ipi; spin_lock_irqsave(&dst->lock, flags); ipi = list_empty(&dst->list); @@ -174,7 +154,7 @@ static void generic_exec_single(int cpu, arch_send_call_function_single_ipi(cpu); if (wait) - csd_wait(data); + csd_lock_wait(data); } /* @@ -224,7 +204,6 @@ void generic_smp_call_function_interrupt if (refs) continue; - csd_complete(&data->csd); csd_unlock(&data->csd); } @@ -262,9 +241,6 @@ void generic_smp_call_function_single_in data->func(data->info); - if (data_flags & CSD_FLAG_WAIT) - csd_complete(data); - /* * Unlocked CSDs are valid through generic_exec_single() */ @@ -305,36 +281,16 @@ int smp_call_function_single(int cpu, vo func(info); local_irq_restore(flags); } else if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) { - struct call_single_data *data; + struct call_single_data *data = &d; - if (!wait) { - /* - * We are calling a function on a single CPU - * and we are not going to wait for it to finish. - * We use a per cpu data to pass the information to - * that CPU. Since all callers of this code will - * use the same data, we must synchronize the - * callers to prevent a new caller from corrupting - * the data before the callee can access it. - * - * The CSD_FLAG_LOCK is used to let us know when - * the IPI handler is done with the data. - * The first caller will set it, and the callee - * will clear it. The next caller must wait for - * it to clear before we set it again. This - * will make sure the callee is done with the - * data before a new caller will use it. - */ + if (!wait) data = &per_cpu(csd_data, me); - csd_lock(data); - } else { - data = &d; - csd_wait_prepare(data); - } + + csd_lock(data); data->func = func; data->info = info; - generic_exec_single(cpu, data); + generic_exec_single(cpu, data, wait); } else { err = -ENXIO; /* CPU not online */ } @@ -354,12 +310,15 @@ EXPORT_SYMBOL(smp_call_function_single); * instance. * */ -void __smp_call_function_single(int cpu, struct call_single_data *data) +void __smp_call_function_single(int cpu, struct call_single_data *data, + int wait) { + csd_lock(data); + /* Can deadlock when called with interrupts disabled */ - WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled()); + WARN_ON(wait && irqs_disabled()); - generic_exec_single(cpu, data); + generic_exec_single(cpu, data, wait); } /* FIXME: Shim for archs using old arch_send_call_function_ipi API. */ @@ -417,9 +376,6 @@ void smp_call_function_many(const struct csd_lock(&data->csd); spin_lock_irqsave(&data->lock, flags); - if (wait) - csd_wait_prepare(&data->csd); - data->csd.func = func; data->csd.info = info; cpumask_and(data->cpumask, mask, cpu_online_mask); @@ -448,7 +404,7 @@ void smp_call_function_many(const struct /* optionally wait for the CPUs to complete */ if (wait) - csd_wait(&data->csd); + csd_lock_wait(&data->csd); } EXPORT_SYMBOL(smp_call_function_many); Index: linux-2.6/kernel/softirq.c =================================================================== --- linux-2.6.orig/kernel/softirq.c +++ linux-2.6/kernel/softirq.c @@ -518,7 +518,7 @@ static int __try_remote_softirq(struct c cp->flags = 0; cp->priv = softirq; - __smp_call_function_single(cpu, cp); + __smp_call_function_single(cpu, cp, 0); return 0; } return 1; Index: linux-2.6/include/linux/smp.h =================================================================== --- linux-2.6.orig/include/linux/smp.h +++ linux-2.6/include/linux/smp.h @@ -82,7 +82,8 @@ smp_call_function_mask(cpumask_t mask, v return 0; } -void __smp_call_function_single(int cpuid, struct call_single_data *data); +void __smp_call_function_single(int cpuid, struct call_single_data *data, + int wait); /* * Generic and arch helpers -- ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-02-19 8:34 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-17 21:59 [PATCH 0/3] generic-ipi: patches -v5 Peter Zijlstra 2009-02-17 21:59 ` [PATCH 1/3] generic-ipi: simplify the barriers Peter Zijlstra 2009-02-18 0:27 ` Paul E. McKenney 2009-02-18 9:45 ` Peter Zijlstra 2009-02-17 21:59 ` [PATCH 2/3] generic-ipi: remove kmalloc() Peter Zijlstra 2009-02-18 0:28 ` Paul E. McKenney 2009-02-18 10:42 ` Peter Zijlstra 2009-02-18 16:06 ` Paul E. McKenney 2009-02-18 16:15 ` Oleg Nesterov 2009-02-18 19:47 ` Paul E. McKenney 2009-02-18 20:12 ` Oleg Nesterov 2009-02-19 2:40 ` Paul E. McKenney 2009-02-19 8:33 ` Peter Zijlstra 2009-02-18 5:31 ` Rusty Russell 2009-02-18 10:52 ` Peter Zijlstra 2009-02-17 21:59 ` [PATCH 3/3] generic-ipi: remove CSD_FLAG_WAIT Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox