public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Badari Pulavarty <pbadari@us.ibm.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: kvm@vger.kernel.org
Subject: Re: [RFC] vhost-blk implementation
Date: Wed, 24 Mar 2010 13:22:37 -0700	[thread overview]
Message-ID: <4BAA748D.40509@us.ibm.com> (raw)
In-Reply-To: <20100324200402.GA22272@infradead.org>

Christoph Hellwig wrote:
>> Inspired by vhost-net implementation, I did initial prototype 
>> of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
>> I haven't handled all the error cases, fixed naming conventions etc.,
>> but the implementation is stable to play with. I tried not to deviate
>> from vhost-net implementation where possible.
>>     
>
> Can you also send the qemu side of it?
>
>   
>> with vhost-blk:
>> ----------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 126.135 seconds, 665 MB/s
>>
>> real    2m6.137s
>> user    0m0.281s
>> sys     0m14.725s
>>
>> without vhost-blk: (virtio)
>> ---------------------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 275.466 seconds, 305 MB/s
>>
>> real    4m35.468s
>> user    0m0.373s
>> sys     0m48.074s
>>     
>
> Which caching mode is this?  I assume data=writeback, because otherwise
> you'd be doing synchronous I/O directly from the handler.
>   

Yes. This is with default (writeback) cache model. As mentioned earlier, 
readhead is helping here
and most cases, data would be ready in the pagecache.
>   
>> +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
>> +			struct iovec *iov, int in)
>> +{
>> +	loff_t pos = sector << 8;
>> +	int ret = 0;
>> +
>> +	if (type & VIRTIO_BLK_T_FLUSH)  {
>> +		ret = vfs_fsync(file, file->f_path.dentry, 1);
>> +	} else if (type & VIRTIO_BLK_T_OUT) {
>> +		ret = vfs_writev(file, iov, in, &pos);
>> +	} else {
>> +		ret = vfs_readv(file, iov, in, &pos);
>> +	}
>> +	return ret;
>>     
>
> I have to admit I don't understand the vhost architecture at all, but
> where do the actual data pointers used by the iovecs reside?
> vfs_readv/writev expect both the iovec itself and the buffers
> pointed to by it to reside in userspace, so just using kernel buffers
> here will break badly on architectures with different user/kernel
> mappings.  A lot of this is fixable using simple set_fs & co tricks,
> but for direct I/O which uses get_user_pages even that will fail badly.
>   
iovecs and buffers are user-space pointers (from the host kernel point 
of view). They are
guest address. So, I don't need to do any set_fs tricks.
> Also it seems like you're doing all the I/O synchronous here?  For
> data=writeback operations that could explain the read speedup
> as you're avoiding context switches, but for actual write I/O
> which has to get data to disk (either directly from vfs_writev or
> later through vfs_fsync) this seems like a really bad idea stealing
> a lot of guest time that should happen in the background.
>   
Yes. QEMU virtio-blk is batching up all the writes and handing of the 
work to another
thread. When the writes() are complete, its sending a status completion. 
Since I am
doing everything synchronous (even though its write to pagecache) one 
request at a
time, that explains the slow down. We need to find a way to

1) batch IO writes together
2) hand off to another thread to do the IO, so that vhost-thread can handle
next set of requests
3) update the status on the completion

What do should I do here ? I can create bunch of kernel threads to do 
the IO for me.
Or some how fit and reuse AIO io_submit() mechanism. Whats the best way 
here ?
I hate do duplicate all the code VFS is doing.
>
> Other than that the code seems quite nice and simple, but one huge
> problem is that it'll only support raw images, and thus misses out
> on all the "nice" image formats used in qemu deployments, especially
> qcow2.  It's also missing the ioctl magic we're having in various
> places, both for controlling host devices like cdroms and SG
> passthrough.
>   
True... unfortunately, I don't understand all of those (qcow2) details 
yet !! I need to read up on those,
to even make a comment :(

Thanks,
Badari



  reply	other threads:[~2010-03-24 20:22 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-23  1:00 [RFC] vhost-blk implementation Badari Pulavarty
2010-03-23  1:16 ` Anthony Liguori
2010-03-23  1:45   ` Badari Pulavarty
2010-03-23  2:00     ` Anthony Liguori
2010-03-23  2:50       ` Badari Pulavarty
2010-03-23 10:05         ` Avi Kivity
2010-03-23 14:48           ` Badari Pulavarty
2010-03-23 10:03 ` Avi Kivity
2010-03-23 14:55   ` Badari Pulavarty
2010-03-23 16:53     ` Avi Kivity
2010-03-24 20:05   ` Christoph Hellwig
2010-03-25  6:29     ` Avi Kivity
2010-03-25 15:48       ` Christoph Hellwig
2010-03-25 15:51         ` Avi Kivity
2010-03-25 15:00     ` Asdo
2010-04-05 19:59       ` Christoph Hellwig
2010-04-07  0:36         ` [RFC] vhost-blk implementation (v2) Badari Pulavarty
2010-03-23 10:09 ` [RFC] vhost-blk implementation Eran Rom
2010-03-24 20:04 ` Christoph Hellwig
2010-03-24 20:22   ` Badari Pulavarty [this message]
2010-03-25  7:57     ` Avi Kivity
2010-03-25 14:36       ` Badari Pulavarty
2010-03-25 15:57     ` Christoph Hellwig
2010-03-26 18:53       ` Eran Rom
2010-04-08 16:17         ` Stefan Hajnoczi
2010-04-05 19:23     ` Christoph Hellwig
2010-04-05 23:17       ` Badari Pulavarty
2010-03-24 20:27   ` Badari Pulavarty
2010-03-29 15:41   ` Badari Pulavarty
2010-03-29 18:20     ` Chris Wright
2010-03-29 20:37       ` Avi Kivity
2010-03-29 22:51         ` Badari Pulavarty
2010-03-29 23:56           ` Chris Wright
2010-03-30 12:43           ` Avi Kivity
2010-04-05 14:22     ` Stefan Hajnoczi
2010-04-06  2:27       ` Badari Pulavarty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BAA748D.40509@us.ibm.com \
    --to=pbadari@us.ibm.com \
    --cc=hch@infradead.org \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox