From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59375) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Upfnj-0003j5-Qb for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Upfnf-0000NP-3T for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:27 -0400 Received: from mail.avalus.com ([2001:41c8:10:1dd::10]:55391) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Upfne-0000MC-Kd for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:23 -0400 Date: Thu, 20 Jun 2013 15:25:09 +0100 From: Alex Bligh Message-ID: In-Reply-To: <20130620094618.GC15672@stefanha-thinkpad.redhat.com> References: <7029962A8C6EFDBC98B51E44@nimrod.local> <20130411092548.GE8904@stefanha-thinkpad.redhat.com> <20130620094618.GC15672@stefanha-thinkpad.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu Reply-To: Alex Bligh List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: josh.durgin@inktank.com, qemu-devel@nongnu.org, Alex Bligh , sage@inktank.com Stefan, --On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi wrote: >> The concrete problem here is that flashcache/dm-cache/bcache don't >> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache >> cache access to block devices (in the host layer), and with rbd >> (for instance) there is no access to a block device at all. block/rbd.c >> simply calls librbd which calls librados etc. >> >> So the context switches etc. I am avoiding are the ones that would >> be introduced by using kernel rbd devices rather than librbd. > > I understand the limitations with kernel block devices - their > setup/teardown is an extra step outside QEMU and privileges need to be > managed. That basically means you need to use a management tool like > libvirt to make it usable. It's not just the management tool (we have one of those). Kernel devices are pain. As a trivial example, duplication of UUIDs, LVM IDs etc. by hostile guests can cause issues. > But I don't understand the performance angle here. Do you have profiles > that show kernel rbd is a bottleneck due to context switching? I don't have test figures - perhaps this is just received wisdom, but I'd understood that's why they were faster. > We use the kernel page cache for -drive file=test.img,cache=writeback > and no one has suggested reimplementing the page cache inside QEMU for > better performance. That's true, but I'd argue that is a little different because nothing blocks on the page cache (it being in RAM). You don't get the situation where the tasks sleeps awaiting data (from the page cache), the data arrives, and the task then needs to to be scheduled in. I will admit to a degree of handwaving here as I hadn't realised the claim qemu+rbd was more efficient than qemu+blockdevice+kernelrbd was controversial. > Also, how do you want to manage QEMU page cache with multiple guests > running? They are independent and know nothing about each other. Their > process memory consumption will be bloated and the kernel memory > management will end up having to sort out who gets to stay in physical > memory. I don't think that one's an issue. Currently QEMU processes with cache=writeback contend physical memory via the page cache. I'm not changing that bit. I'm proposing allocating SSD (rather than RAM) for cache, so if anything that should reduce RAM use as it will be quicker to flush the cache to 'disk' (the second layer of caching). I was proposing allocating each task a fixed amount of SSD space. In terms of how this is done, one way would be to mmap a large file on SSD, which would mean the page cache used would be whatever page cache is used for the SSD. You've got more control over this (with madvise etc) than you have with aio I think. > You can see I'm skeptical of this Which is no bad thing! > and think it's premature optimization, ... and I'm only to keen to avoid work if it brings no gain. > but if there's really a case for it with performance profiles then I > guess it would be necessary. But we should definitely get feedback from > the Ceph folks too. The specific problem we are trying to solve (in case that's not obvious) is the non-locality of data read/written by ceph. Whilst you can use placement to localise data to the rack level, even if one of your OSDs is in the machine you end up waiting on network traffic. That is apparently hard to solve inside Ceph. However, this would be applicable to sheepdog, gluster, nfs, the internal iscsi initiator, etc. etc. rather than just to Ceph. I'm also keen to hear from the Ceph guys as if they have a way of keeping lots of reads and writes in the box and not crossing the network, I'd be only too keen to use that. -- Alex Bligh