[Qemu-devel] converting the block layer from coroutines to threads

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] converting the block layer from coroutines to threads
@ 2012-02-24 19:02 Paolo Bonzini
  2012-02-24 19:22 ` Anthony Liguori
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2012-02-24 19:02 UTC (permalink / raw)
  To: qemu-devel, Stefan Hajnoczi

Hi all,

a few weeks ago Stefan Hajnoczi pointed me to his work on virtio-blk
performance.

Stefan's work had two sides.  First, he captured very nice performance
data of the block layer at
http://www.linux-kvm.org/page/Virtio/Block/Latency; second, in order to
measure peak performance, he basically implemented "vhost-blk" in
userspace.  This second part adds a thread for each block device that is
activated via ioeventfd and converts virtio requests to AIO calls on the
host.  It is quite restricted, as it only supports raw files and Linux
AIO, but it achieved performance improvements of IIRC 2-3x on big machines.

We talked a bit about how to generalize the work and bring it upstream,
and one idea was to make a fully multi-threaded block/AIO layer.  This
is a prerequisite for processing devices in their own thread, but it
would also be an occasion to clean up several pieces of code, magically
add AIO support for Windows, and probably be a good step towards making
libblockformat.

And, it turns out that multi-threading block is not that difficult.

There are basically two parts in converting from coroutines to threads.
 One is to protect shared data, and we do quite well here because we
already protect most bits with CoMutexes.  The second is to remove
manual scheduling (CoQueues) and replace it with standard thread
synchronization primitives, such as condition variables and semaphores.

For the first part, there are relatively few pieces of data that are
shared by multiple coroutines and need to be protected by blocks, namely
bottom halves and timers.

For the second part, CoQueues are used by the throttled and tracked
request lists.  These lists also have to be protected by their own mutex.

I have made an attempt at it in my github repo's thread-blocks branch.
The ingredients are:

- a common representation of requests, based on BdrvTrackedRequest but
subsuming RwCo and other structs.  This representation follows a request
from beginning to end.  Lists of BdrvTrackedRequests replace the CoQueues;

- a generic threadpool, similar to the one in posix-aio-compat.c but
with extra services similar to coroutine enter/yield.  These services
are just wrappers around condition variables, and do not prevent you
from using regular mutex and condvars in the work items;

- an AIO fast path, used when none of the coroutine-enabled goodies are
active;

A work item in the thread-pool replaces a coroutine and, unlike
coroutines, execute completely asynchronously with respect to the
iothread and VCPU threads.  posix-aio-compat code is replaced by
synchronous entry-points into the "file" driver.  An interesting point
is that (unlike coroutine-gthread) there is hardly any overhead from
removing the coroutines, because:

- when using the raw format, more or less the same code is executed,
only in raw-posix.c rather than in posix-aio-compat.c;

- when using qcow2, file I/O will execute synchronously in the same work
item that is already being used for the format code.  So format I/O is
more expensive to start, but this is compensated completely by cheaper
protocol I/O.  There can be slowdowns in special cases such as reading
from non-allocated clusters, of course.

qcow2 is the only format that uses CoQueues internally.  These are
replaced by a condition variable.

rbd, iSCSI and other aio-based protocols will require new locking, but
are otherwise not a problem.

NBD and sheepdog have to be almost rewritten, but I expect the result to
be simpler because they can mostly use blocking I/O instead of
coroutines.  Everything else works with s/CoMutex/QemuMutex/;
s/co_mutex/mutex/.

When using raw on AIO, the QEMU thread pool can be bypassed completely
and I/O can be submitted directly from the iothread.  Stefan measured
the cost of an io_submit to be comparable to the cost of waking up a
thread in posix-aio-compat.c, and the same holds for my generic thread pool.

Except for fixing non-file protocols (and testing and debugging of
course), this is where I'm at; code is in branch thread-blocks at
git://github.com/bonzini/qemu.git.  It boots a raw image, so it cannot
be that bad! :)  I didn't test any of throttling or copy-on-read.
However, changes are for a large part mechanical and can be posted in
fairly small chunks.

Anyhow, with this in place, aio.c can be rethought and generalized.
Interaction with other threads and the system can occur in terms of
EventNotifiers, so that portability to Windows falls out almost
automatically just by porting those.  The main improvement then would be
separate contexts for asynchronous I/O, so that it is possible to have
per-device threads as in Stefan's original proof of concept.

Thoughts?

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] converting the block layer from coroutines to threads
  2012-02-24 19:02 [Qemu-devel] converting the block layer from coroutines to threads Paolo Bonzini
@ 2012-02-24 19:22 ` Anthony Liguori
  2012-02-24 20:43   ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Liguori @ 2012-02-24 19:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Michael S. Tsirkin, qemu-devel, Stefan Hajnoczi

On 02/24/2012 01:02 PM, Paolo Bonzini wrote:
> Hi all,
>
> a few weeks ago Stefan Hajnoczi pointed me to his work on virtio-blk
> performance.
>
> Stefan's work had two sides.  First, he captured very nice performance
> data of the block layer at
> http://www.linux-kvm.org/page/Virtio/Block/Latency; second, in order to
> measure peak performance, he basically implemented "vhost-blk" in
> userspace.

I don't think the improvements here have anything to do with the block layer.

We've done the same thing with virtio-net and saw impressive performance results 
as a consequence.  Conversely, we see a similar improvement by applying the same 
technique to vhost-net.

Virtio really wants each virtqueue to be processed in a separate thread.  On a 
multicore system, there's considerable improvement doing this.

I think that's where we ought to start.  We really just need the block layer to 
be re-entrant, we don't actually need qcow2 or anything else that uses 
coroutines to use full threads.

Or at least, as far as I know, we don't have any performance data to suggest 
that we do.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] converting the block layer from coroutines to threads
  2012-02-24 19:22 ` Anthony Liguori
@ 2012-02-24 20:43   ` Paolo Bonzini
  2012-02-24 21:01     ` Anthony Liguori
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2012-02-24 20:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

On 02/24/2012 08:22 PM, Anthony Liguori wrote:
> Virtio really wants each virtqueue to be processed in a separate
> thread.  On a multicore system, there's considerable improvement doing
> this.  I think that's where we ought to start.

Well, that's where we ought to *get*.  Stefan's work is awesome but with
the current feature set it would be hard to justify it upstream.

To get it upstream we need to generalize it and make it work well with
the block layer.  And vice versa make the block layer work well with
threads, which is what I care about here.

> We really just need the block layer to be re-entrant, we don't
> actually need qcow2 or anything else that uses coroutines to use full
> threads.

Once you can issue I/O from two threads at the same-time (such as
streaming in the iothread and guest I/O in the virtqueue thread),
everything already needs to be thread-safe.  It is a pretty short step
from there to thread pools for everything.

> Or at least, as far as I know, we don't have any performance data to
> suggest that we do.

No, it's not about speed, though of course it only works if there is no
performance dip.  It is just an enabling step.

That said, my weekend officially begins now. :)

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] converting the block layer from coroutines to threads
  2012-02-24 20:43   ` Paolo Bonzini
@ 2012-02-24 21:01     ` Anthony Liguori
  2012-02-28  9:34       ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Liguori @ 2012-02-24 21:01 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Michael S. Tsirkin, qemu-devel, Stefan Hajnoczi

On 02/24/2012 02:43 PM, Paolo Bonzini wrote:
> On 02/24/2012 08:22 PM, Anthony Liguori wrote:
>> Virtio really wants each virtqueue to be processed in a separate
>> thread.  On a multicore system, there's considerable improvement doing
>> this.  I think that's where we ought to start.
>
> Well, that's where we ought to *get*.  Stefan's work is awesome but with
> the current feature set it would be hard to justify it upstream.
>
> To get it upstream we need to generalize it and make it work well with
> the block layer.  And vice versa make the block layer work well with
> threads, which is what I care about here.
>
>> We really just need the block layer to be re-entrant, we don't
>> actually need qcow2 or anything else that uses coroutines to use full
>> threads.
>
> Once you can issue I/O from two threads at the same-time (such as
> streaming in the iothread and guest I/O in the virtqueue thread),
> everything already needs to be thread-safe.  It is a pretty short step
> from there to thread pools for everything.

If you start with a thread safe API for submitting block requests, that could be 
implemented as

bapi_aiocb *bapi_submit_readv(bapi_driver *d, struct iovec *iov, int iovcnt,
                               off_t offset)
{
    bapi_request *req = make_bapi_request(BAPI_READ, iov, iovcnt, offset);

    return bapi_queue_add_req(req);
}

Which would schedule the I/O thread to actually implement the operation.  You 
could then start incrementally refactoring specific drivers to be re-entrant 
(like linux-aio).  But anything that already needs to use a thread pool to do 
its I/O probably wouldn't benefit from threading virtio.

More importantly, the above would give you good performance to start with, 
instead of refactoring a bunch of code hoping to eventually get to good performance.

>
>> Or at least, as far as I know, we don't have any performance data to
>> suggest that we do.
>
> No, it's not about speed, though of course it only works if there is no
> performance dip.  It is just an enabling step.
>
> That said, my weekend officially begins now. :)

Enjoy!!

Regards,

Anthony Liguori

>
> Paolo
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] converting the block layer from coroutines to threads
  2012-02-24 21:01     ` Anthony Liguori
@ 2012-02-28  9:34       ` Paolo Bonzini
  0 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2012-02-28  9:34 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin

Il 24/02/2012 22:01, Anthony Liguori ha scritto:
>> Once you can issue I/O from two threads at the same-time (such as
>> streaming in the iothread and guest I/O in the virtqueue thread),
>> everything already needs to be thread-safe.  It is a pretty short step
>> from there to thread pools for everything.
> 
> If you start with a thread safe API for submitting block requests, that
> could be implemented as
> 
> bapi_aiocb *bapi_submit_readv(bapi_driver *d, struct iovec *iov, int
> iovcnt,
>                               off_t offset)
> {
>    bapi_request *req = make_bapi_request(BAPI_READ, iov, iovcnt, offset);
> 
>    return bapi_queue_add_req(req);
> }
> 
> Which would schedule the I/O thread to actually implement the
> operation.  You could then start incrementally refactoring specific
> drivers to be re-entrant (like linux-aio).

My proposal has exactly these two ingredients: a representations of
block layer requests, and a fast path for AIO.

But there are really two complementary parts to it.  One is generalizing
thread-pool support to non-raw formats, the other is doing I/O from
multiple threads at the same time.  The first is quite easy overall.
The second is hard, because it's not just about reentrancy.  You need
various pieces of infrastructure that do not yet exist; for example you
need freeze/unfreeze, because drain/drain_all is not enough if other
threads can submit I/O concurrently.

> But anything that already needs to use a thread pool to do its I/O
> probably wouldn't benefit from threading virtio.

Linux AIO _is_ a thread-pool in the end.  It is surprising how close the
latencies are between io_submit and cond_signal, for example.

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-02-28  9:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-24 19:02 [Qemu-devel] converting the block layer from coroutines to threads Paolo Bonzini
2012-02-24 19:22 ` Anthony Liguori
2012-02-24 20:43   ` Paolo Bonzini
2012-02-24 21:01     ` Anthony Liguori
2012-02-28  9:34       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).