All of lore.kernel.org
 help / color / mirror / Atom feed
From: Badari Pulavarty <pbadari@us.ibm.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: kvm@vger.kernel.org
Subject: Re: [RFC] vhost-blk implementation
Date: Wed, 24 Mar 2010 13:22:37 -0700	[thread overview]
Message-ID: <4BAA748D.40509@us.ibm.com> (raw)
In-Reply-To: <20100324200402.GA22272@infradead.org>

Christoph Hellwig wrote:
>> Inspired by vhost-net implementation, I did initial prototype 
>> of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
>> I haven't handled all the error cases, fixed naming conventions etc.,
>> but the implementation is stable to play with. I tried not to deviate
>> from vhost-net implementation where possible.
>>     
>
> Can you also send the qemu side of it?
>
>   
>> with vhost-blk:
>> ----------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 126.135 seconds, 665 MB/s
>>
>> real    2m6.137s
>> user    0m0.281s
>> sys     0m14.725s
>>
>> without vhost-blk: (virtio)
>> ---------------------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 275.466 seconds, 305 MB/s
>>
>> real    4m35.468s
>> user    0m0.373s
>> sys     0m48.074s
>>     
>
> Which caching mode is this?  I assume data=writeback, because otherwise
> you'd be doing synchronous I/O directly from the handler.
>   

Yes. This is with default (writeback) cache model. As mentioned earlier, 
readhead is helping here
and most cases, data would be ready in the pagecache.
>   
>> +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
>> +			struct iovec *iov, int in)
>> +{
>> +	loff_t pos = sector << 8;
>> +	int ret = 0;
>> +
>> +	if (type & VIRTIO_BLK_T_FLUSH)  {
>> +		ret = vfs_fsync(file, file->f_path.dentry, 1);
>> +	} else if (type & VIRTIO_BLK_T_OUT) {
>> +		ret = vfs_writev(file, iov, in, &pos);
>> +	} else {
>> +		ret = vfs_readv(file, iov, in, &pos);
>> +	}
>> +	return ret;
>>     
>
> I have to admit I don't understand the vhost architecture at all, but
> where do the actual data pointers used by the iovecs reside?
> vfs_readv/writev expect both the iovec itself and the buffers
> pointed to by it to reside in userspace, so just using kernel buffers
> here will break badly on architectures with different user/kernel
> mappings.  A lot of this is fixable using simple set_fs & co tricks,
> but for direct I/O which uses get_user_pages even that will fail badly.
>   
iovecs and buffers are user-space pointers (from the host kernel point 
of view). They are
guest address. So, I don't need to do any set_fs tricks.
> Also it seems like you're doing all the I/O synchronous here?  For
> data=writeback operations that could explain the read speedup
> as you're avoiding context switches, but for actual write I/O
> which has to get data to disk (either directly from vfs_writev or
> later through vfs_fsync) this seems like a really bad idea stealing
> a lot of guest time that should happen in the background.
>   
Yes. QEMU virtio-blk is batching up all the writes and handing of the 
work to another
thread. When the writes() are complete, its sending a status completion. 
Since I am
doing everything synchronous (even though its write to pagecache) one 
request at a
time, that explains the slow down. We need to find a way to

1) batch IO writes together
2) hand off to another thread to do the IO, so that vhost-thread can handle
next set of requests
3) update the status on the completion

What do should I do here ? I can create bunch of kernel threads to do 
the IO for me.
Or some how fit and reuse AIO io_submit() mechanism. Whats the best way 
here ?
I hate do duplicate all the code VFS is doing.
>
> Other than that the code seems quite nice and simple, but one huge
> problem is that it'll only support raw images, and thus misses out
> on all the "nice" image formats used in qemu deployments, especially
> qcow2.  It's also missing the ioctl magic we're having in various
> places, both for controlling host devices like cdroms and SG
> passthrough.
>   
True... unfortunately, I don't understand all of those (qcow2) details 
yet !! I need to read up on those,
to even make a comment :(

Thanks,
Badari



  reply	other threads:[~2010-03-24 20:22 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-23  1:00 [RFC] vhost-blk implementation Badari Pulavarty
2010-03-23  1:16 ` Anthony Liguori
2010-03-23  1:45   ` Badari Pulavarty
2010-03-23  2:00     ` Anthony Liguori
2010-03-23  2:50       ` Badari Pulavarty
2010-03-23 10:05         ` Avi Kivity
2010-03-23 14:48           ` Badari Pulavarty
2010-03-23 10:03 ` Avi Kivity
2010-03-23 14:55   ` Badari Pulavarty
2010-03-23 16:53     ` Avi Kivity
2010-03-24 20:05   ` Christoph Hellwig
2010-03-25  6:29     ` Avi Kivity
2010-03-25 15:48       ` Christoph Hellwig
2010-03-25 15:51         ` Avi Kivity
2010-03-25 15:00     ` Asdo
2010-04-05 19:59       ` Christoph Hellwig
2010-04-07  0:36         ` [RFC] vhost-blk implementation (v2) Badari Pulavarty
2010-04-07  0:36         ` Badari Pulavarty
2010-03-23 10:09 ` [RFC] vhost-blk implementation Eran Rom
2010-03-24 20:04 ` Christoph Hellwig
2010-03-24 20:22   ` Badari Pulavarty [this message]
2010-03-25  7:57     ` Avi Kivity
2010-03-25 14:36       ` Badari Pulavarty
2010-03-25 15:57     ` Christoph Hellwig
2010-03-26 18:53       ` Eran Rom
2010-04-08 16:17         ` Stefan Hajnoczi
2010-04-05 19:23     ` Christoph Hellwig
2010-04-05 23:17       ` Badari Pulavarty
2010-03-24 20:27   ` Badari Pulavarty
2010-03-29 15:41   ` Badari Pulavarty
2010-03-29 18:20     ` Chris Wright
2010-03-29 20:37       ` Avi Kivity
2010-03-29 22:51         ` Badari Pulavarty
2010-03-29 23:56           ` Chris Wright
2010-03-30 12:43           ` Avi Kivity
2010-04-05 14:22     ` Stefan Hajnoczi
2010-04-06  2:27       ` Badari Pulavarty
  -- strict thread matches above, loose matches on Subject: below --
2010-03-23  0:34 Badari Pulavarty
2010-03-23 12:39 ` Michael S. Tsirkin
2010-03-23 14:47   ` Badari Pulavarty
2010-03-23 17:57   ` Badari Pulavarty
2010-03-23 18:06     ` Michael S. Tsirkin
2010-03-23 19:55       ` Badari Pulavarty
2010-03-24  9:52         ` Michael S. Tsirkin
2010-03-24 11:23         ` Michael S. Tsirkin
2010-03-24 17:58           ` Badari Pulavarty
2010-03-24 18:28             ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BAA748D.40509@us.ibm.com \
    --to=pbadari@us.ibm.com \
    --cc=hch@infradead.org \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.