From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59375)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex@alex.org.uk>) id 1Upfnj-0003j5-Qb
	for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:32 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex@alex.org.uk>) id 1Upfnf-0000NP-3T
	for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:27 -0400
Received: from mail.avalus.com ([2001:41c8:10:1dd::10]:55391)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex@alex.org.uk>) id 1Upfne-0000MC-Kd
	for qemu-devel@nongnu.org; Thu, 20 Jun 2013 10:25:23 -0400
Date: Thu, 20 Jun 2013 15:25:09 +0100
From: Alex Bligh <alex@alex.org.uk>
Message-ID: <C28ECE898D357C014DD09007@Ximines.local>
In-Reply-To: <20130620094618.GC15672@stefanha-thinkpad.redhat.com>
References: <7029962A8C6EFDBC98B51E44@nimrod.local>
	<20130411092548.GE8904@stefanha-thinkpad.redhat.com>
	<E379578A4230002A9442405F@Ximines.local>
	<20130620094618.GC15672@stefanha-thinkpad.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu
Reply-To: Alex Bligh <alex@alex.org.uk>
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: josh.durgin@inktank.com, qemu-devel@nongnu.org, Alex Bligh <alex@alex.org.uk>, sage@inktank.com

Stefan,

--On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote:

>> The concrete problem here is that flashcache/dm-cache/bcache don't
>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
>> cache access to block devices (in the host layer), and with rbd
>> (for instance) there is no access to a block device at all. block/rbd.c
>> simply calls librbd which calls librados etc.
>>
>> So the context switches etc. I am avoiding are the ones that would
>> be introduced by using kernel rbd devices rather than librbd.
>
> I understand the limitations with kernel block devices - their
> setup/teardown is an extra step outside QEMU and privileges need to be
> managed.  That basically means you need to use a management tool like
> libvirt to make it usable.

It's not just the management tool (we have one of those). Kernel
devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
etc. by hostile guests can cause issues.

> But I don't understand the performance angle here.  Do you have profiles
> that show kernel rbd is a bottleneck due to context switching?

I don't have test figures - perhaps this is just received wisdom, but I'd
understood that's why they were faster.

> We use the kernel page cache for -drive file=test.img,cache=writeback
> and no one has suggested reimplementing the page cache inside QEMU for
> better performance.

That's true, but I'd argue that is a little different because nothing
blocks on the page cache (it being in RAM). You don't get the situation
where the tasks sleeps awaiting data (from the page cache), the data
arrives, and the task then needs to to be scheduled in. I will admit
to a degree of handwaving here as I hadn't realised the claim qemu+rbd
was more efficient than qemu+blockdevice+kernelrbd was controversial.

> Also, how do you want to manage QEMU page cache with multiple guests
> running?  They are independent and know nothing about each other.  Their
> process memory consumption will be bloated and the kernel memory
> management will end up having to sort out who gets to stay in physical
> memory.

I don't think that one's an issue. Currently QEMU processes with
cache=writeback contend physical memory via the page cache. I'm
not changing that bit. I'm proposing allocating SSD (rather than
RAM) for cache, so if anything that should reduce RAM use as it
will be quicker to flush the cache to 'disk' (the second layer
of caching). I was proposing allocating each task a fixed amount
of SSD space.

In terms of how this is done, one way would be to mmap a large
file on SSD, which would mean the page cache used would be
whatever page cache is used for the SSD. You've got more control
over this (with madvise etc) than you have with aio I think.

> You can see I'm skeptical of this

Which is no bad thing!

> and think it's premature optimization,

... and I'm only to keen to avoid work if it brings no gain.

> but if there's really a case for it with performance profiles then I
> guess it would be necessary.  But we should definitely get feedback from
> the Ceph folks too.

The specific problem we are trying to solve (in case that's not
obvious) is the non-locality of data read/written by ceph. Whilst
you can use placement to localise data to the rack level, even if
one of your OSDs is in the machine you end up waiting on network
traffic. That is apparently hard to solve inside Ceph.

However, this would be applicable to sheepdog, gluster, nfs,
the internal iscsi initiator, etc. etc. rather than just to Ceph.

I'm also keen to hear from the Ceph guys as if they have a way of
keeping lots of reads and writes in the box and not crossing the
network, I'd be only too keen to use that.

-- 
Alex Bligh