From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:38796) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SiCKd-0007V0-Vk for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:28:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SiCKb-0006Gm-F4 for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:27:59 -0400 Received: from mout.web.de ([212.227.15.3]:63917) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SiCKb-0006GV-4e for qemu-devel@nongnu.org; Fri, 22 Jun 2012 18:27:57 -0400 Message-ID: <4FE4F163.6060805@web.de> Date: Sat, 23 Jun 2012 00:27:47 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <1340290158-11036-1-git-send-email-qemulist@gmail.com> <4FE33C61.9000509@siemens.com> <4FE44AE7.6050509@siemens.com> <4FE4D15C.2030708@codemonkey.ws> <4FE4E038.6000404@web.de> <4FE4E748.7050303@codemonkey.ws> In-Reply-To: <4FE4E748.7050303@codemonkey.ws> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig2CAB837088B171F8498DF746" Subject: Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: "qemu-devel@nongnu.org" , liu ping fan , Stefan Hajnoczi This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2CAB837088B171F8498DF746 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 2012-06-22 23:44, Anthony Liguori wrote: > On 06/22/2012 04:14 PM, Jan Kiszka wrote: >> On 2012-06-22 22:11, Anthony Liguori wrote: >>> On 06/22/2012 05:37 AM, Jan Kiszka wrote: >>>> On 2012-06-22 12:24, liu ping fan wrote: >>>>> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka >>>>> wrote: >>>>>> On 2012-06-21 16:49, Liu Ping Fan wrote: >>>>>>> Nowadays, we use >>>>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to >>>>>>> protect the race to access the emulated dev launched by vcpu >>>>>>> threads& iothread. >>>>>>> >>>>>>> But this lock is too big. We can break it down. >>>>>>> These patches separate the CPUArchState's protection from the oth= er >>>>>>> devices, so we >>>>>>> can have a per-cpu lock for each CPUArchState, not the big lock a= ny >>>>>>> longer. >>>>>> >>>>>> Anything that reduces lock dependencies is generally welcome. But = can >>>>>> you specify in more details what you gain, and under which >>>>>> conditions? >>>>>> >>>>> In fact, there are several steps to break down the Qemu big lock. T= his >>>>> step just aims to shrink the code area protected by >>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am >>>>> working on the following steps, which focus on breaking down the bi= g >>>>> lock when calling handle_{io,mmio} >>>> >>>> Then let us discuss the strategy. This is important as it is >>>> unrealistic >>>> to break up the lock for all code paths. We really need to focus on >>>> goals that provide benefits for relevant use cases. >>> >>> Stefan put together a proof of concept that implemented the data-plan= e >>> portion of virtio-blk in a separate thread. This is possible because= of >>> I/O eventfd (we were able to select() on that fd in a separate thread= ). >>> >>> The performance difference between virtio-blk-pci and >>> virtio-blk-data-plane is staggering when dealing with a very large >>> storage system. >>> >>> So we'd like to get the infrastructure in place where we can start >>> multithreading devices in QEMU to we can integrate this work. >> >> Can you name the primary bits? We really need to see the whole picture= >> before adding new locks. They alone are not the solution. >=20 > Sorry, not sure what you mean by "the primary bits". >=20 > At a high level, the plan is to: >=20 > 1) unlock iothread before entering the do {} look in kvm_cpu_exec() > a) reacquire the lock after the loop > b) reacquire the lock in kvm_handle_io() > c) introduce an unlocked memory accessor that for now, just requires= > the iothread lock() and calls cpu_physical_memory_rw() Right, that's what we have here as well. The latter is modeled as a so called "I/O pathway", a thread-based execution context for frontend/backend pairs with some logic to transfer certain I/O requests asynchronously to the pathway thread. The tricky part was to get nested requests right, i.e. when a requests triggers another one from within the device model. This is where things get ugly. In theory, you can end up with a vm deadlock if you just apply per-device locking. I'm currently trying to rebase our patches, review and document the logic behind it. >=20 > 2) focus initially on killing the lock in kvm_handle_io() > a) the ioport table is pretty simplistic so adding fine grain lockin= g > won't be hard. > b) reacquire lock right before ioport dispatch >=20 > 3) allow for register ioport handlers w/o the dispatch function carryin= g > a iothread > a) this is mostly memory API plumbing We skipped this as our NICs didn't do PIO, but you clearly need it for virtio. >=20 > 4) focus on going back and adding fine grain locking to the > cpu_physical_memory_rw() accessor In the end, PIO and MMIO should use the same patterns - and will face the same challenges. Ideally, we model things very similar right from the start. And then there is also 5) provide direct IRQ delivery from the device model to the IRQ chip. That's much like what we need for VFIO and KVM device assignment. But here we won't be able to cheat and ignore correct generation of vmstates of the bypassed PCI host bridges etc... Which leads me to that other thread about how to handle this for PCI device pass-through. Contributions to that discussion are welcome as well. >=20 > Note that whenever possible, we should be using rwlocks instead of a > normal mutex. In particular, for the ioport data structures, a rwlock > seems pretty obvious. I think we should mostly be fine with a "big hammer" rwlock: unlocked read access from VCPUs and iothreads, and vmstop/resume around modifications of fast path data structures (like the memory region hierarchy or the PIO table). Where that's not sufficient, RCU will be needed. Sleeping rwlocks have horrible semantics (specifically when thread priorities come into play) and are performance-wise inferior. We should avoid them completely. >=20 >> >>> >>> The basic plan is introduce granular locking starting at the KVM >>> dispatch level until we can get to MemoryRegion dispatch. We'll then= >>> have some way to indicate that a MemoryRegion's callbacks should be >>> invoked without holding the qemu global mutex. >> >> I don't disagree, but this end really looks like starting at the wrong= >> edge. The changes are not isolated and surely not yet correct >> (run_on_cpu is broken for tcg e.g.). >> >> Then, none of this locking should be needed for in-kernel irqchips. Al= l >> touched states are thread local or should be modifiable atomically - i= f >> not let's fix *that*, it's more beneficial. >> >> Actually, cpu_lock is counterproductive as it adds locking ops to a pa= th >> where we will not need them later on in the normal configuration. User= >> space irqchip is a slow path and perfectly fine to handle under BQL. S= o >> is VCPU control (pause/resume/run-on). It's better to focus on the fas= t >> path first. >=20 > To be clear, I'm not advocating introducing cpu_lock. We should do > whatever makes the most sense to not have to hold iothread lock while > processing an exit from KVM. Good that we agree. :) >=20 > Note that this is an RFC, the purpose of this series is to have this > discussion :-) Yep, I think we have it now ;). Hope I can contribute some code bits to it soon, though I didn't schedule this task for the next week. Jan --------------enig2CAB837088B171F8498DF746 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/k8WYACgkQitSsb3rl5xQNaACeJIntjqsBpX43eFC770Jp8q2D oSgAoLwkEUY+TYBSp8S09yxrxSrsXoO/ =2awB -----END PGP SIGNATURE----- --------------enig2CAB837088B171F8498DF746--