From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753083AbZBRKnB (ORCPT ); Wed, 18 Feb 2009 05:43:01 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751559AbZBRKmw (ORCPT ); Wed, 18 Feb 2009 05:42:52 -0500 Received: from bombadil.infradead.org ([18.85.46.34]:51953 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751048AbZBRKmv (ORCPT ); Wed, 18 Feb 2009 05:42:51 -0500 Subject: Re: [PATCH 2/3] generic-ipi: remove kmalloc() From: Peter Zijlstra To: paulmck@linux.vnet.ibm.com Cc: Linus Torvalds , Nick Piggin , Jens Axboe , Ingo Molnar , Rusty Russell , linux-kernel@vger.kernel.org, Oleg Nesterov In-Reply-To: <20090218002823.GA25408@linux.vnet.ibm.com> References: <20090217215904.059250145@chello.nl> <20090217220053.500765201@chello.nl> <20090218002823.GA25408@linux.vnet.ibm.com> Content-Type: text/plain Date: Wed, 18 Feb 2009 11:42:17 +0100 Message-Id: <1234953737.4637.57.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.25.90 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote: > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > +static int > > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) > > +{ > > + long cpu = (long)hcpu; > > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); > > + > > + switch (action) { > > + case CPU_UP_PREPARE: > > + case CPU_UP_PREPARE_FROZEN: > > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, > > + cpu_to_node(cpu))) > > + return NOTIFY_BAD; > > + break; > > + > > +#ifdef CONFIG_CPU_HOTPLUG > > + case CPU_UP_CANCELED: > > + case CPU_UP_CANCELED_FROZEN: > > + > > + case CPU_DEAD: > > + case CPU_DEAD_FROZEN: > > + free_cpumask_var(cfd->cpumask); > > + break; > > +#endif > > + > > + return NOTIFY_OK; > > + }; > > +} > > Hmmm.... Why not the following? Do we really need to free the cpumask > when a CPU departs, given that it might return later? Probably not, but that's what we have these callbacks for, might as well use them. > > +/* > > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources > > + * > > + * For non-synchronous ipi calls the csd can still be in use by the previous > > + * function call. For multi-cpu calls its even more interesting as we'll have > > + * to ensure no other cpu is observing our csd. > > + */ > > +static void csd_lock(struct call_single_data *data) > > { > > - /* Wait for response */ > > - do { > > - if (!(data->flags & CSD_FLAG_WAIT)) > > - break; > > + while (data->flags & CSD_FLAG_LOCK) > > cpu_relax(); > > - } while (1); > > + data->flags = CSD_FLAG_LOCK; > > We do need an smp_mb() here, otherwise, the call from > smp_call_function_single() could be reordered by either CPU or compiler > as follows: > > data->func = func; > data->info = info; > csd_lock(data); > > This might come as a bit of a surprise to the other CPU still trying to > use the old values for data->func and data->info. > > Note that this smb_mb() is required even if cpu_relax() contains a > memory barrier, as it is possible to execute csd_lock_wait() without > executing the cpu_relax(), if you get there at just the right time. I'm not quite sure I follow, if we observe !(flags & LOCK) then we're guaranteed that no cpu will still needs func and info. They might still be observing the list entry, but should not find themselves on the cpumask -- which is guarded by ->lock. Anyway, I'm happy to add the mb(). > > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt > > * It's ok to use list_for_each_rcu() here even though we may delete > > * 'pos', since list_del_rcu() doesn't clear ->next > > */ > > - rcu_read_lock(); > > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { > > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { > > OK... What prevents the following sequence of events? > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > including CPU 2. CPU 0 therefore enqueues this on the global > call_function.queue. "wait" is not specified, so CPU 0 returns > immediately after sending the IPIs. > > o CPU 1 disables irqs and leaves them disabled for awhile. > > o CPU 2 receives the IPI, and duly calls the needed function. > It decrements the ->refs field, but, finding the result > non-zero, refrains from removing the element that CPU 0 > enqueued, and also refrains from invoking csd_unlock(). > > o CPU 3 also receives the IPI, and also calls the needed function. > Now, only CPU 1 need receive the IPI for the element to be > removed. > > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > but -not- including CPU 2. CPU 3 therefore also this on the > global call_function.queue and sends the IPIs, but no IPI for > CPU 2. Your choice as to whether CPU 3 waits or not. Right, so the queue is Entry3->Entry0 (we place new entries on at the head). > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > and then... CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the ipi. Furthermore, Entry3 would be first, but suppose it is Entry0, that makes the scenario work best. > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > finds CPU 0's element on the queue, calls the function, > decrements the ->refs field, and finds that it is zero. > So, CPU 1 invokes list_del_rcu() to remove the element > (OK so far, as list_del_rcu() doesn't overwrite the next > pointer), then invokes csd_unlock() to release the element. CPU1 will see CPU3's element first, and CPU0's element second. But OK. > o CPU 0 then invokes another smp_call_function_many(), also > multiple CPUs, but -not- to CPU 2. It requeues the element > that was just csd_unlock()ed above, carrying CPU 2 with it. > It IPIs CPUs 1 and 3, but not CPU 2. > > o CPU 2 continues, and falls off the bottom of the list. It will > continue to ignore CPU 0's IPI until some other CPU IPIs it. > On some architectures, a single-target IPI won't cut it, only > a multi-target IPI. > > Or am I missing something subtle here? Ah, right, yes. I place new entries at the HEAD not TAIL, so that in this case we go from: Entry3->Entry0 ^ CPU2 to Entry0->Entry3 ^ CPU2 and CPU2 will continue the list iteration, visiting Entry3 twice, which is harmless since it will have removed itself from the cpumask the first time around. > If this really is a problem, there are a number of counter-based solutions > to it. (Famous last words...) > > Abandoning all caution and attempting one on the fly... Make each CPU > receiving an IPI increment one per-CPU counter upon entry, and increment > it again upon exit with memory barriers after and before, respectively. > Then any CPU with an even value can be ignored, and any CPU whose value > changes can also be ignored. Of course, this means you have to scan all > CPUs... But in the worst case, you also had to IPI them all. > > Given that this operation is relatively rare, it might be worth using > shared reference counters, possibly one pair of such counters per (say) > 16 CPUs. Then the caller flips the counter. Yep, I almost implemented a counting RCU which increments a global counter on IPI entry and decrements on IPI exit, but then did the above FIFO->FILO queue thingy. > Alternatively, you can explain to me why my scenario above cannot > happen -- but at present, it will take some serious explaining!!! I hope to have adequately explained it, but please, feel free to poke more holes into it ;-)