From mboxrd@z Thu Jan 1 00:00:00 1970 From: Badari Pulavarty Subject: Re: [RFC] vhost-blk implementation Date: Wed, 24 Mar 2010 13:22:37 -0700 Message-ID: <4BAA748D.40509@us.ibm.com> References: <1269306023.7931.72.camel@badari-desktop> <20100324200402.GA22272@infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org To: Christoph Hellwig Return-path: Received: from e37.co.us.ibm.com ([32.97.110.158]:47532 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756822Ab0CXUWh (ORCPT ); Wed, 24 Mar 2010 16:22:37 -0400 Received: from d03relay01.boulder.ibm.com (d03relay01.boulder.ibm.com [9.17.195.226]) by e37.co.us.ibm.com (8.14.3/8.13.1) with ESMTP id o2OKL4Er005707 for ; Wed, 24 Mar 2010 14:21:04 -0600 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay01.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o2OKMMIw086070 for ; Wed, 24 Mar 2010 14:22:23 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o2OKMMZa002407 for ; Wed, 24 Mar 2010 14:22:22 -0600 In-Reply-To: <20100324200402.GA22272@infradead.org> Sender: kvm-owner@vger.kernel.org List-ID: Christoph Hellwig wrote: >> Inspired by vhost-net implementation, I did initial prototype >> of vhost-blk to see if it provides any benefits over QEMU virtio-blk. >> I haven't handled all the error cases, fixed naming conventions etc., >> but the implementation is stable to play with. I tried not to deviate >> from vhost-net implementation where possible. >> > > Can you also send the qemu side of it? > > >> with vhost-blk: >> ---------------- >> >> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct >> 640000+0 records in >> 640000+0 records out >> 83886080000 bytes (84 GB) copied, 126.135 seconds, 665 MB/s >> >> real 2m6.137s >> user 0m0.281s >> sys 0m14.725s >> >> without vhost-blk: (virtio) >> --------------------------- >> >> # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct >> 640000+0 records in >> 640000+0 records out >> 83886080000 bytes (84 GB) copied, 275.466 seconds, 305 MB/s >> >> real 4m35.468s >> user 0m0.373s >> sys 0m48.074s >> > > Which caching mode is this? I assume data=writeback, because otherwise > you'd be doing synchronous I/O directly from the handler. > Yes. This is with default (writeback) cache model. As mentioned earlier, readhead is helping here and most cases, data would be ready in the pagecache. > >> +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector, >> + struct iovec *iov, int in) >> +{ >> + loff_t pos = sector << 8; >> + int ret = 0; >> + >> + if (type & VIRTIO_BLK_T_FLUSH) { >> + ret = vfs_fsync(file, file->f_path.dentry, 1); >> + } else if (type & VIRTIO_BLK_T_OUT) { >> + ret = vfs_writev(file, iov, in, &pos); >> + } else { >> + ret = vfs_readv(file, iov, in, &pos); >> + } >> + return ret; >> > > I have to admit I don't understand the vhost architecture at all, but > where do the actual data pointers used by the iovecs reside? > vfs_readv/writev expect both the iovec itself and the buffers > pointed to by it to reside in userspace, so just using kernel buffers > here will break badly on architectures with different user/kernel > mappings. A lot of this is fixable using simple set_fs & co tricks, > but for direct I/O which uses get_user_pages even that will fail badly. > iovecs and buffers are user-space pointers (from the host kernel point of view). They are guest address. So, I don't need to do any set_fs tricks. > Also it seems like you're doing all the I/O synchronous here? For > data=writeback operations that could explain the read speedup > as you're avoiding context switches, but for actual write I/O > which has to get data to disk (either directly from vfs_writev or > later through vfs_fsync) this seems like a really bad idea stealing > a lot of guest time that should happen in the background. > Yes. QEMU virtio-blk is batching up all the writes and handing of the work to another thread. When the writes() are complete, its sending a status completion. Since I am doing everything synchronous (even though its write to pagecache) one request at a time, that explains the slow down. We need to find a way to 1) batch IO writes together 2) hand off to another thread to do the IO, so that vhost-thread can handle next set of requests 3) update the status on the completion What do should I do here ? I can create bunch of kernel threads to do the IO for me. Or some how fit and reuse AIO io_submit() mechanism. Whats the best way here ? I hate do duplicate all the code VFS is doing. > > Other than that the code seems quite nice and simple, but one huge > problem is that it'll only support raw images, and thus misses out > on all the "nice" image formats used in qemu deployments, especially > qcow2. It's also missing the ioctl magic we're having in various > places, both for controlling host devices like cdroms and SG > passthrough. > True... unfortunately, I don't understand all of those (qcow2) details yet !! I need to read up on those, to even make a comment :( Thanks, Badari