From mboxrd@z Thu Jan  1 00:00:00 1970
From: Badari Pulavarty <pbadari@us.ibm.com>
Subject: Re: [RFC] vhost-blk implementation
Date: Wed, 24 Mar 2010 13:22:37 -0700
Message-ID: <4BAA748D.40509@us.ibm.com>
References: <1269306023.7931.72.camel@badari-desktop> <20100324200402.GA22272@infradead.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Christoph Hellwig <hch@infradead.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from e37.co.us.ibm.com ([32.97.110.158]:47532 "EHLO
	e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1756822Ab0CXUWh (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 24 Mar 2010 16:22:37 -0400
Received: from d03relay01.boulder.ibm.com (d03relay01.boulder.ibm.com [9.17.195.226])
	by e37.co.us.ibm.com (8.14.3/8.13.1) with ESMTP id o2OKL4Er005707
	for <kvm@vger.kernel.org>; Wed, 24 Mar 2010 14:21:04 -0600
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])
	by d03relay01.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o2OKMMIw086070
	for <kvm@vger.kernel.org>; Wed, 24 Mar 2010 14:22:23 -0600
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])
	by d03av01.boulder.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o2OKMMZa002407
	for <kvm@vger.kernel.org>; Wed, 24 Mar 2010 14:22:22 -0600
In-Reply-To: <20100324200402.GA22272@infradead.org>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Christoph Hellwig wrote:
>> Inspired by vhost-net implementation, I did initial prototype 
>> of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
>> I haven't handled all the error cases, fixed naming conventions etc.,
>> but the implementation is stable to play with. I tried not to deviate
>> from vhost-net implementation where possible.
>>     
>
> Can you also send the qemu side of it?
>
>   
>> with vhost-blk:
>> ----------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 126.135 seconds, 665 MB/s
>>
>> real    2m6.137s
>> user    0m0.281s
>> sys     0m14.725s
>>
>> without vhost-blk: (virtio)
>> ---------------------------
>>
>> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
>> 640000+0 records in
>> 640000+0 records out
>> 83886080000 bytes (84 GB) copied, 275.466 seconds, 305 MB/s
>>
>> real    4m35.468s
>> user    0m0.373s
>> sys     0m48.074s
>>     
>
> Which caching mode is this?  I assume data=writeback, because otherwise
> you'd be doing synchronous I/O directly from the handler.
>   

Yes. This is with default (writeback) cache model. As mentioned earlier, 
readhead is helping here
and most cases, data would be ready in the pagecache.
>   
>> +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
>> +			struct iovec *iov, int in)
>> +{
>> +	loff_t pos = sector << 8;
>> +	int ret = 0;
>> +
>> +	if (type & VIRTIO_BLK_T_FLUSH)  {
>> +		ret = vfs_fsync(file, file->f_path.dentry, 1);
>> +	} else if (type & VIRTIO_BLK_T_OUT) {
>> +		ret = vfs_writev(file, iov, in, &pos);
>> +	} else {
>> +		ret = vfs_readv(file, iov, in, &pos);
>> +	}
>> +	return ret;
>>     
>
> I have to admit I don't understand the vhost architecture at all, but
> where do the actual data pointers used by the iovecs reside?
> vfs_readv/writev expect both the iovec itself and the buffers
> pointed to by it to reside in userspace, so just using kernel buffers
> here will break badly on architectures with different user/kernel
> mappings.  A lot of this is fixable using simple set_fs & co tricks,
> but for direct I/O which uses get_user_pages even that will fail badly.
>   
iovecs and buffers are user-space pointers (from the host kernel point 
of view). They are
guest address. So, I don't need to do any set_fs tricks.
> Also it seems like you're doing all the I/O synchronous here?  For
> data=writeback operations that could explain the read speedup
> as you're avoiding context switches, but for actual write I/O
> which has to get data to disk (either directly from vfs_writev or
> later through vfs_fsync) this seems like a really bad idea stealing
> a lot of guest time that should happen in the background.
>   
Yes. QEMU virtio-blk is batching up all the writes and handing of the 
work to another
thread. When the writes() are complete, its sending a status completion. 
Since I am
doing everything synchronous (even though its write to pagecache) one 
request at a
time, that explains the slow down. We need to find a way to

1) batch IO writes together
2) hand off to another thread to do the IO, so that vhost-thread can handle
next set of requests
3) update the status on the completion

What do should I do here ? I can create bunch of kernel threads to do 
the IO for me.
Or some how fit and reuse AIO io_submit() mechanism. Whats the best way 
here ?
I hate do duplicate all the code VFS is doing.
>
> Other than that the code seems quite nice and simple, but one huge
> problem is that it'll only support raw images, and thus misses out
> on all the "nice" image formats used in qemu deployments, especially
> qcow2.  It's also missing the ioctl magic we're having in various
> places, both for controlling host devices like cdroms and SG
> passthrough.
>   
True... unfortunately, I don't understand all of those (qcow2) details 
yet !! I need to read up on those,
to even make a comment :(

Thanks,
Badari