From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:48688) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLXgC-00031s-U6 for qemu-devel@nongnu.org; Tue, 09 Oct 2012 07:08:58 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TLXg7-0000iM-Ub for qemu-devel@nongnu.org; Tue, 09 Oct 2012 07:08:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40632) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLXg7-0000iI-LD for qemu-devel@nongnu.org; Tue, 09 Oct 2012 07:08:47 -0400 Message-ID: <507405B5.4060108@redhat.com> Date: Tue, 09 Oct 2012 13:08:37 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1348577763-12920-1-git-send-email-pbonzini@redhat.com> <20121008113932.GB16332@stefanha-thinkpad.redhat.com> <5072CE54.8020208@redhat.com> <20121009090811.GB13775@stefanha-thinkpad.redhat.com> <5073EDB3.3020804@redhat.com> <5073FE3A.1090903@redhat.com> <507401D8.8090203@redhat.com> In-Reply-To: <507401D8.8090203@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Block I/O outside the QEMU global mutex was "Re: [RFC PATCH 00/17] Support for multiple "AIO contexts"" List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Kevin Wolf , Stefan Hajnoczi , Anthony Liguori , Ping Fan Liu , qemu-devel@nongnu.org Il 09/10/2012 12:52, Avi Kivity ha scritto: > On 10/09/2012 12:36 PM, Paolo Bonzini wrote: >> Il 09/10/2012 11:26, Avi Kivity ha scritto: >>> On 10/09/2012 11:08 AM, Stefan Hajnoczi wrote: >>>> Here are the steps that have been mentioned: >>>> >>>> 1. aio fastpath - for raw-posix and other aio block drivers, can we reduce I/O >>>> request latency by skipping block layer coroutines? >>> >>> Is coroutine overhead noticable? >> >> I'm thinking more about throughput than latency. If the iothread >> becomes CPU-bound, then everything is noticeable. > > That's not strictly a coroutine issue. Switching to ordinary threads > may make the problem worse, since there will clearly be contention. The point is you don't need either coroutines or userspace threads if you use native AIO. longjmp/setjmp is probably a smaller overhead compared to the many syscalls involved in poll+eventfd reads+io_submit+io_getevents, but it's also not cheap. Also, if you process AIO in batches you risk overflowing the pool of free coroutines, which gets expensive real fast (allocate/free the stack, do the expensive getcontext/swapcontext instead of the cheaper longjmp/setjmp, etc.). It seems better to sidestep the issue completely, it's a small amount of work. > What is the I/O processing time we have? If it's say 10 microseconds, > then we'll have 100,000 context switches per second assuming a device > lock and a saturated iothread (split into multiple threads). Hopefully with a saturated dedicated iothread you would not have any context switches and a single CPU will be just dedicated to virtio processing. > The coroutine work may have laid the groundwork for fine-grained > locking. I'm doubtful we should use qcow when we want >100K IOPS though. Yep. Going away from coroutines is a solution in search of a problem, it will introduce several new variables (kernel scheduling, more expensive lock contention, starving the thread pool with locked threads, ...), all for a case where performance hardly matters. >>>> I'm also curious about virtqueue_pop()/virtqueue_push() outside the QEMU mutex >>>> although that might be blocked by the current work around MMIO/PIO dispatch >>>> outside the global mutex. >>> >>> It is, yes. >> >> It should only require unlocked memory map/unmap, not MMIO dispatch. >> The MMIO/PIO bits are taken care of by ioeventfd. > > The ring, or indirect descriptors, or the data, can all be on mmio. > IIRC the virtio spec forbids that, but the APIs have to be general. We > don't have cpu_physical_memory_map_nommio() (or > address_space_map_nommio(), as soon as the coding style committee > ratifies srtuct literals). cpu_physical_memory_map could still take the QEMU lock in the slow bounce-buffer case. BTW the block layer has been using struct literals for a long time and we're just as happy as you are about them. :) Paolo