From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:49525)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SiCmM-0003pp-Ba
	for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:56:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SiCmI-0004sO-TD
	for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:56:37 -0400
Received: from mail-ob0-f173.google.com ([209.85.214.173]:42050)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SiCmI-0004sI-M6
	for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:56:34 -0400
Received: by obbta14 with SMTP id ta14so2819657obb.4
	for <qemu-devel@nongnu.org>; Fri, 22 Jun 2012 15:56:33 -0700 (PDT)
Message-ID: <4FE4F81E.2070007@codemonkey.ws>
Date: Fri, 22 Jun 2012 17:56:30 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <1340290158-11036-1-git-send-email-qemulist@gmail.com>
	<4FE33C61.9000509@siemens.com>
	<CAJnKYQmHqZzQkE_bn6Ag63SEe9hL-UsyXS8ERmhoY2smFkqxVg@mail.gmail.com>
	<4FE44AE7.6050509@siemens.com> <4FE4D15C.2030708@codemonkey.ws>
	<4FE4E038.6000404@web.de> <4FE4E748.7050303@codemonkey.ws>
	<4FE4F163.6060805@web.de>
In-Reply-To: <4FE4F163.6060805@web.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] use little granularity lock to substitue
	qemu_mutex_lock_iothread
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jan Kiszka <jan.kiszka@web.de>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, liu ping fan <qemulist@gmail.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

On 06/22/2012 05:27 PM, Jan Kiszka wrote:
> On 2012-06-22 23:44, Anthony Liguori wrote:
>> 1) unlock iothread before entering the do {} look in kvm_cpu_exec()
>>     a) reacquire the lock after the loop
>>     b) reacquire the lock in kvm_handle_io()
>>     c) introduce an unlocked memory accessor that for now, just requires
>> the iothread lock() and calls cpu_physical_memory_rw()
>
> Right, that's what we have here as well. The latter is modeled as a so
> called "I/O pathway", a thread-based execution context for
> frontend/backend pairs with some logic to transfer certain I/O requests
> asynchronously to the pathway thread.

Interesting, so the VCPU threads always hold the iothread mutex but some 
requests are routed to other threads?

I hadn't considered a design like that.  I've been thinking about a long term 
architecture that's a bit more invasive.

What we think of the I/O thread today wouldn't be special.  It would be one of N 
I/O threads all running separate copies of the main loop.  All of the functions 
that defer dispatch to a main loop would take a context as an argument and 
devices would essentially have a "vector" array of main loops as input.

So virtio-net probably would have two main loop "vectors" since it would like to 
schedule tx and rx independently.  There's nothing that says that you can't pass 
the same main loop context for each vector but that's a configuration choice.

Dispatch from VCPU context would behave the same it does today but obviously 
per-device locking is needed.

> The tricky part was to get nested requests right, i.e. when a requests
> triggers another one from within the device model. This is where things
> get ugly. In theory, you can end up with a vm deadlock if you just apply
> per-device locking. I'm currently trying to rebase our patches, review
> and document the logic behind it.

I really think the only way to solve this is to separate map()'d DMA access 
(where the device really wants to deal with RAM only) and copy-based access 
(where devices map DMA to other devices).

For copy-based access, we really ought to move to a callback based API.  It adds 
quite a bit of complexity but it's really the only way to solve the problem 
robustly.

>> 2) focus initially on killing the lock in kvm_handle_io()
>>     a) the ioport table is pretty simplistic so adding fine grain locking
>> won't be hard.
>>     b) reacquire lock right before ioport dispatch
>>
>> 3) allow for register ioport handlers w/o the dispatch function carrying
>> a iothread
>>     a) this is mostly memory API plumbing
>
> We skipped this as our NICs didn't do PIO, but you clearly need it for
> virtio.

Right.

>> 4) focus on going back and adding fine grain locking to the
>> cpu_physical_memory_rw() accessor
>
> In the end, PIO and MMIO should use the same patterns - and will face
> the same challenges. Ideally, we model things very similar right from
> the start.

Yes.

> And then there is also
>
> 5) provide direct IRQ delivery from the device model to the IRQ chip.
> That's much like what we need for VFIO and KVM device assignment. But
> here we won't be able to cheat and ignore correct generation of vmstates
> of the bypassed PCI host bridges etc... Which leads me to that other
> thread about how to handle this for PCI device pass-through.
> Contributions to that discussion are welcome as well.

I think you mean to the in-kernel IRQ chip.  I'm thinking about this still so I 
don't have a plan yet that I'm ready to share.  I have some ideas though.

>
>>
>> Note that whenever possible, we should be using rwlocks instead of a
>> normal mutex.  In particular, for the ioport data structures, a rwlock
>> seems pretty obvious.
>
> I think we should mostly be fine with a "big hammer" rwlock: unlocked
> read access from VCPUs and iothreads, and vmstop/resume around
> modifications of fast path data structures (like the memory region
> hierarchy or the PIO table).

Ack.

> Where that's not sufficient, RCU will be
> needed. Sleeping rwlocks have horrible semantics (specifically when
> thread priorities come into play) and are performance-wise inferior. We
> should avoid them completely.

Yes, I think RCU is inevitable here but I think starting with rwlocks will help 
with the big refactoring.

>>
>> To be clear, I'm not advocating introducing cpu_lock.  We should do
>> whatever makes the most sense to not have to hold iothread lock while
>> processing an exit from KVM.
>
> Good that we agree. :)
>
>>
>> Note that this is an RFC, the purpose of this series is to have this
>> discussion :-)
>
> Yep, I think we have it now ;). Hope I can contribute some code bits to
> it soon, though I didn't schedule this task for the next week.

Great!  If you have something you can share, I'd be eager to look at it 
regardless of the condition of the code.

Regards,

Anthony Liguori

>
> Jan
>