From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
Date: Sun, 29 May 2011 10:54:13 +0300
Message-ID: <4DE1FBA5.6080905@redhat.com>
References: <20110526180518.GA3572@elte.hu> <4DDE97CE.4000302@redhat.com> <BANLkTi=bMMr1zHWhwNsD=ah52n3ueoLbUg@mail.gmail.com> <20110526202531.GA2765@elte.hu> <20110526230508.GA15983@Krystal> <20110527102533.GA24608@elte.hu> <20110527110729.GA26920@elte.hu> <4DE13AF0.2080001@redhat.com> <20110528183259.GA15019@elte.hu> <4DE1EA93.6040401@redhat.com> <20110529073550.GA21254@elte.hu>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Pekka Enberg <penberg@kernel.org>,
	Sasha Levin <levinsasha928@gmail.com>, john@jfloren.net,
	kvm@vger.kernel.org, asias.hejun@gmail.com, gorcunov@gmail.com,
	prasadjoshi124@gmail.com,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:51206 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751550Ab1E2Hyy (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 29 May 2011 03:54:54 -0400
In-Reply-To: <20110529073550.GA21254@elte.hu>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 05/29/2011 10:35 AM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
> >  On 05/28/2011 09:32 PM, Ingo Molnar wrote:
> >  >* Avi Kivity<avi@redhat.com>   wrote:
> >  >
> >  >>   >   So if you set a notification signal via fcntl(F_SETOWN) on the
> >  >>   >   scheduler context switch event fd, the user-space RCU code will
> >  >>   >   get a signal on every context switch.
> >  >>
> >  >>   Context switches are completely uninteresting for userspace rcu:
> >  >>
> >  >>     rcu_read_lock();
> >  >>     --->   context switch
> >  >>
> >  >>   have we learned anything from that?  no.  User code is always
> >  >>   preemptible and migratable.  If rcu_read_lock() prevented migration
> >  >>   somehow, then we'd know that a context switch means we've started a
> >  >>   grace period for this thread.  But it doesn't, so we don't.
> >  >
> >  >Well, in the next mail i mentioned that we can do migration events as
> >  >well, which would be useful: instead of having to keep track of
> >  >nr_tasks RCU grace periods we could simplify it down to nr_cpus.
> >
> >  I don't see how a migration event helps.  It is completely
> >  transparent from the task's point of view.
>
> It's not transparent at all if you index RCU data structures by the
> current CPU index, which the kernel implementation does.

But that's completely broken for userspace.  The "current cpu index" 
doesn't even exist, since you can't disable preemption.

> Doing that has the advantage of being much more cache-compressed than
> the TID index,

If you have more tasks than cpus; which isn't a given.

>   and also having better worst-case grace period latency
> properties than a TID index.


> >  >  But if we indexed by the TID then we wouldnt need any scheduler
> >  >  bindings at all - this is the simpler approach.
> >
> >  Yes, and it maps 1:1 to the kernel implementation (cpu = task).
>
> No, the kernel indexes grace period tracking (and the
> write-completion queues) by CPU index.

Do a conceptual

   #define cpu task

and it all works out.

> >  >>   What's needed are explicit notifications about grace periods.  For
> >  >>   the vcpu threads, calling KVM_VCPU_RUN seems like a good point.
> >  >>   For I/O threads, completion of processing of an event is also a
> >  >>   good point.
> >  >
> >  >  Grace period notifications are needed too, obviously.
> >
> >  I'd think they're sufficient, no?  Is something else needed?
>
> I think you are missing the fact that in the kernel we index RCU data
> structures by CPU number:
>
> static void rcu_preempt_qs(int cpu)
> {
>          struct rcu_data *rdp =&per_cpu(rcu_preempt_data, cpu);
>
> ...

s/per_cpu/__thread/

> static void rcu_preempt_note_context_switch(int cpu)
> {
>          struct task_struct *t = current;
>          unsigned long flags;
>          struct rcu_data *rdp;
>          struct rcu_node *rnp;
>
>          if (t->rcu_read_lock_nesting&&
>              (t->rcu_read_unlock_special&  RCU_READ_UNLOCK_BLOCKED) == 0) {
>
>                  /* Possibly blocking in an RCU read-side critical section. */
>                  rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);
>
> ...
>
> Which could be changed over to be per task in user-space by treating
> the TID as a 'virtual CPU' equivalent.
>
> This probably lengthens worst-case rcu_sync() latencies rather
> significantly though - possibly turning urcu into a
> stop_machine_run() equivalent in the worst-case. (but i could be
> wrong about this last bit)

I believe you are.  urcu does stress scaling, since it's much easier to 
add tasks than it is to add cpus, but it's conceptually the same problem.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.