From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754823AbZBRQGy (ORCPT ); Wed, 18 Feb 2009 11:06:54 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754109AbZBRQGo (ORCPT ); Wed, 18 Feb 2009 11:06:44 -0500 Received: from e3.ny.us.ibm.com ([32.97.182.143]:47061 "EHLO e3.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752035AbZBRQGm (ORCPT ); Wed, 18 Feb 2009 11:06:42 -0500 Date: Wed, 18 Feb 2009 08:06:37 -0800 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Linus Torvalds , Nick Piggin , Jens Axboe , Ingo Molnar , Rusty Russell , linux-kernel@vger.kernel.org, Oleg Nesterov Subject: Re: [PATCH 2/3] generic-ipi: remove kmalloc() Message-ID: <20090218160637.GA7011@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20090217215904.059250145@chello.nl> <20090217220053.500765201@chello.nl> <20090218002823.GA25408@linux.vnet.ibm.com> <1234953737.4637.57.camel@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1234953737.4637.57.camel@laptop> User-Agent: Mutt/1.5.15+20070412 (2007-04-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 18, 2009 at 11:42:17AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote: > > On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote: > > > > +static int > > > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu) > > > +{ > > > + long cpu = (long)hcpu; > > > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu); > > > + > > > + switch (action) { > > > + case CPU_UP_PREPARE: > > > + case CPU_UP_PREPARE_FROZEN: > > > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, > > > + cpu_to_node(cpu))) > > > + return NOTIFY_BAD; > > > + break; > > > + > > > +#ifdef CONFIG_CPU_HOTPLUG > > > + case CPU_UP_CANCELED: > > > + case CPU_UP_CANCELED_FROZEN: > > > + > > > + case CPU_DEAD: > > > + case CPU_DEAD_FROZEN: > > > + free_cpumask_var(cfd->cpumask); > > > + break; > > > +#endif > > > + > > > + return NOTIFY_OK; > > > + }; > > > +} > > > > Hmmm.... Why not the following? Do we really need to free the cpumask > > when a CPU departs, given that it might return later? > > Probably not, but that's what we have these callbacks for, might as well > use them. Fair enough... > > > +/* > > > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources > > > + * > > > + * For non-synchronous ipi calls the csd can still be in use by the previous > > > + * function call. For multi-cpu calls its even more interesting as we'll have > > > + * to ensure no other cpu is observing our csd. > > > + */ > > > +static void csd_lock(struct call_single_data *data) > > > { > > > - /* Wait for response */ > > > - do { > > > - if (!(data->flags & CSD_FLAG_WAIT)) > > > - break; > > > + while (data->flags & CSD_FLAG_LOCK) > > > cpu_relax(); > > > - } while (1); > > > + data->flags = CSD_FLAG_LOCK; > > > > We do need an smp_mb() here, otherwise, the call from > > smp_call_function_single() could be reordered by either CPU or compiler > > as follows: > > > > data->func = func; > > data->info = info; > > csd_lock(data); > > > > This might come as a bit of a surprise to the other CPU still trying to > > use the old values for data->func and data->info. > > > > Note that this smb_mb() is required even if cpu_relax() contains a > > memory barrier, as it is possible to execute csd_lock_wait() without > > executing the cpu_relax(), if you get there at just the right time. > > I'm not quite sure I follow, if we observe !(flags & LOCK) then we're > guaranteed that no cpu will still needs func and info. They might still > be observing the list entry, but should not find themselves on the > cpumask -- which is guarded by ->lock. The compiler could reorder as above, in which case the other CPU might still be looking at the ->func and ->info fields when the stores happen, but might have done csd_unlock() by the time this CPU does its csd_lock(). > Anyway, I'm happy to add the mb(). Please! ;-) > > > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt > > > * It's ok to use list_for_each_rcu() here even though we may delete > > > * 'pos', since list_del_rcu() doesn't clear ->next > > > */ > > > - rcu_read_lock(); > > > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) { > > > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) { > > > > OK... What prevents the following sequence of events? > > > > o CPU 0 calls smp_call_function_many(), targeting numerous CPUs, > > including CPU 2. CPU 0 therefore enqueues this on the global > > call_function.queue. "wait" is not specified, so CPU 0 returns > > immediately after sending the IPIs. > > > > o CPU 1 disables irqs and leaves them disabled for awhile. > > > > o CPU 2 receives the IPI, and duly calls the needed function. > > It decrements the ->refs field, but, finding the result > > non-zero, refrains from removing the element that CPU 0 > > enqueued, and also refrains from invoking csd_unlock(). > > > > o CPU 3 also receives the IPI, and also calls the needed function. > > Now, only CPU 1 need receive the IPI for the element to be > > removed. > > > > o CPU 3 calls smp_call_function_many(), targeting numerous CPUs, > > but -not- including CPU 2. CPU 3 therefore also this on the > > global call_function.queue and sends the IPIs, but no IPI for > > CPU 2. Your choice as to whether CPU 3 waits or not. > > Right, so the queue is Entry3->Entry0 (we place new entries on at the > head). Gah!!! I even remember looking at the list_add_rcu() and noting that it was not a list_add_tail_rcu(). Please accept my apologies for my confusion!!! I suppose that a given CPU might get repeatedly recycled to the beginning of the list, but someone would have to make that happen before I would be inclined to worry about it. > > o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the > > list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()), > > and then... > > CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the > ipi. > > Furthermore, Entry3 would be first, but suppose it is Entry0, that makes > the scenario work best. > > > o CPU 1 finally re-enables irqs and receives the IPIs!!! It > > finds CPU 0's element on the queue, calls the function, > > decrements the ->refs field, and finds that it is zero. > > So, CPU 1 invokes list_del_rcu() to remove the element > > (OK so far, as list_del_rcu() doesn't overwrite the next > > pointer), then invokes csd_unlock() to release the element. > > CPU1 will see CPU3's element first, and CPU0's element second. But OK. > > > o CPU 0 then invokes another smp_call_function_many(), also > > multiple CPUs, but -not- to CPU 2. It requeues the element > > that was just csd_unlock()ed above, carrying CPU 2 with it. > > It IPIs CPUs 1 and 3, but not CPU 2. > > > > o CPU 2 continues, and falls off the bottom of the list. It will > > continue to ignore CPU 0's IPI until some other CPU IPIs it. > > On some architectures, a single-target IPI won't cut it, only > > a multi-target IPI. > > > > Or am I missing something subtle here? > > Ah, right, yes. I place new entries at the HEAD not TAIL, so that in > this case we go from: > > Entry3->Entry0 > ^ > CPU2 > > to > > Entry0->Entry3 > ^ > CPU2 > > and CPU2 will continue the list iteration, visiting Entry3 twice, which > is harmless since it will have removed itself from the cpumask the first > time around. > > > If this really is a problem, there are a number of counter-based solutions > > to it. (Famous last words...) > > > > Abandoning all caution and attempting one on the fly... Make each CPU > > receiving an IPI increment one per-CPU counter upon entry, and increment > > it again upon exit with memory barriers after and before, respectively. > > Then any CPU with an even value can be ignored, and any CPU whose value > > changes can also be ignored. Of course, this means you have to scan all > > CPUs... But in the worst case, you also had to IPI them all. > > > > Given that this operation is relatively rare, it might be worth using > > shared reference counters, possibly one pair of such counters per (say) > > 16 CPUs. Then the caller flips the counter. > > Yep, I almost implemented a counting RCU which increments a global > counter on IPI entry and decrements on IPI exit, but then did the above > FIFO->FILO queue thingy. A simpler way (if it becomes necessary to add at the tail rather than the head) would be to periodically check for stalls of this sort and then resend the IPIs. > > Alternatively, you can explain to me why my scenario above cannot > > happen -- but at present, it will take some serious explaining!!! > > I hope to have adequately explained it, but please, feel free to poke > more holes into it ;-) You have indeed more than adequately explained it. The only thing resembling a hole that I have found thus far is the improbable scenario that repeatedly pulls a given CPU to the beginning of the list, so that this CPU never reaches the end. Again, please accept my apologies for my confusion!!! Thanx, Paul