From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper Date: Sun, 29 May 2011 10:54:13 +0300 Message-ID: <4DE1FBA5.6080905@redhat.com> References: <20110526180518.GA3572@elte.hu> <4DDE97CE.4000302@redhat.com> <20110526202531.GA2765@elte.hu> <20110526230508.GA15983@Krystal> <20110527102533.GA24608@elte.hu> <20110527110729.GA26920@elte.hu> <4DE13AF0.2080001@redhat.com> <20110528183259.GA15019@elte.hu> <4DE1EA93.6040401@redhat.com> <20110529073550.GA21254@elte.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Mathieu Desnoyers , Pekka Enberg , Sasha Levin , john@jfloren.net, kvm@vger.kernel.org, asias.hejun@gmail.com, gorcunov@gmail.com, prasadjoshi124@gmail.com, "Paul E. McKenney" To: Ingo Molnar Return-path: Received: from mx1.redhat.com ([209.132.183.28]:51206 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751550Ab1E2Hyy (ORCPT ); Sun, 29 May 2011 03:54:54 -0400 In-Reply-To: <20110529073550.GA21254@elte.hu> Sender: kvm-owner@vger.kernel.org List-ID: On 05/29/2011 10:35 AM, Ingo Molnar wrote: > * Avi Kivity wrote: > > > On 05/28/2011 09:32 PM, Ingo Molnar wrote: > > >* Avi Kivity wrote: > > > > > >> > So if you set a notification signal via fcntl(F_SETOWN) on the > > >> > scheduler context switch event fd, the user-space RCU code will > > >> > get a signal on every context switch. > > >> > > >> Context switches are completely uninteresting for userspace rcu: > > >> > > >> rcu_read_lock(); > > >> ---> context switch > > >> > > >> have we learned anything from that? no. User code is always > > >> preemptible and migratable. If rcu_read_lock() prevented migration > > >> somehow, then we'd know that a context switch means we've started a > > >> grace period for this thread. But it doesn't, so we don't. > > > > > >Well, in the next mail i mentioned that we can do migration events as > > >well, which would be useful: instead of having to keep track of > > >nr_tasks RCU grace periods we could simplify it down to nr_cpus. > > > > I don't see how a migration event helps. It is completely > > transparent from the task's point of view. > > It's not transparent at all if you index RCU data structures by the > current CPU index, which the kernel implementation does. But that's completely broken for userspace. The "current cpu index" doesn't even exist, since you can't disable preemption. > Doing that has the advantage of being much more cache-compressed than > the TID index, If you have more tasks than cpus; which isn't a given. > and also having better worst-case grace period latency > properties than a TID index. > > > But if we indexed by the TID then we wouldnt need any scheduler > > > bindings at all - this is the simpler approach. > > > > Yes, and it maps 1:1 to the kernel implementation (cpu = task). > > No, the kernel indexes grace period tracking (and the > write-completion queues) by CPU index. Do a conceptual #define cpu task and it all works out. > > >> What's needed are explicit notifications about grace periods. For > > >> the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > > >> For I/O threads, completion of processing of an event is also a > > >> good point. > > > > > > Grace period notifications are needed too, obviously. > > > > I'd think they're sufficient, no? Is something else needed? > > I think you are missing the fact that in the kernel we index RCU data > structures by CPU number: > > static void rcu_preempt_qs(int cpu) > { > struct rcu_data *rdp =&per_cpu(rcu_preempt_data, cpu); > > ... s/per_cpu/__thread/ > static void rcu_preempt_note_context_switch(int cpu) > { > struct task_struct *t = current; > unsigned long flags; > struct rcu_data *rdp; > struct rcu_node *rnp; > > if (t->rcu_read_lock_nesting&& > (t->rcu_read_unlock_special& RCU_READ_UNLOCK_BLOCKED) == 0) { > > /* Possibly blocking in an RCU read-side critical section. */ > rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); > > ... > > Which could be changed over to be per task in user-space by treating > the TID as a 'virtual CPU' equivalent. > > This probably lengthens worst-case rcu_sync() latencies rather > significantly though - possibly turning urcu into a > stop_machine_run() equivalent in the worst-case. (but i could be > wrong about this last bit) I believe you are. urcu does stress scaling, since it's much easier to add tasks than it is to add cpus, but it's conceptually the same problem. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.