qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@gmail.com>
To: Liu Yuan <namei.unix@gmail.com>
Cc: josh.durgin@inktank.com, Sage Weil <sage@inktank.com>,
	Alex Bligh <alex@alex.org.uk>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu
Date: Mon, 24 Jun 2013 11:31:35 +0200	[thread overview]
Message-ID: <20130624093135.GC19900@stefanha-thinkpad.redhat.com> (raw)
In-Reply-To: <51C46EAF.40705@gmail.com>

On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote:
> On 06/20/2013 11:58 PM, Sage Weil wrote:
> > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> >>> The concrete problem here is that flashcache/dm-cache/bcache don't
> >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> >>> cache access to block devices (in the host layer), and with rbd
> >>> (for instance) there is no access to a block device at all. block/rbd.c
> >>> simply calls librbd which calls librados etc.
> >>>
> >>> So the context switches etc. I am avoiding are the ones that would
> >>> be introduced by using kernel rbd devices rather than librbd.
> >>
> >> I understand the limitations with kernel block devices - their
> >> setup/teardown is an extra step outside QEMU and privileges need to be
> >> managed.  That basically means you need to use a management tool like
> >> libvirt to make it usable.
> >>
> >> But I don't understand the performance angle here.  Do you have profiles
> >> that show kernel rbd is a bottleneck due to context switching?
> >>
> >> We use the kernel page cache for -drive file=test.img,cache=writeback
> >> and no one has suggested reimplementing the page cache inside QEMU for
> >> better performance.
> >>
> >> Also, how do you want to manage QEMU page cache with multiple guests
> >> running?  They are independent and know nothing about each other.  Their
> >> process memory consumption will be bloated and the kernel memory
> >> management will end up having to sort out who gets to stay in physical
> >> memory.
> >>
> >> You can see I'm skeptical of this and think it's premature optimization,
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary.  But we should definitely get feedback from
> >> the Ceph folks too.
> >>
> >> I'd like to hear from Ceph folks what their position on kernel rbd vs
> >> librados is.  Why one do they recommend for QEMU guests and what are the
> >> pros/cons?
> > 
> > I agree that a flashcache/bcache-like persistent cache would be a big win 
> > for qemu + rbd users.  
> > 
> > There are few important issues with librbd vs kernel rbd:
> > 
> >  * librbd tends to get new features more quickly that the kernel rbd 
> >    (although now that layering has landed in 3.10 this will be less 
> >    painful than it was).
> > 
> >  * Using kernel rbd means users need bleeding edge kernels, a non-starter 
> >    for many orgs that are still running things like RHEL.  Bug fixes are 
> >    difficult to roll out, etc.
> > 
> >  * librbd has an in-memory cache that behaves similar to an HDD's cache 
> >    (e.g., it forces writeback on flush).  This improves performance 
> >    significantly for many workloads.  Of course, having a bcache-like 
> >    layer mitigates this..
> > 
> > I'm not really sure what the best path forward is.  Putting the 
> > functionality in qemu would benefit lots of other storage backends, 
> > putting it in librbd would capture various other librbd users (xen, tgt, 
> > and future users like hyper-v), and using new kernels works today but 
> > creates a lot of friction for operations.
> > 
> 
> I think I can share some implementation details about persistent cache
> for guest because 1) Sheepdog has a persistent object-oriented cache as
> exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
> volumes on top of object store. 3) Sheepdog choose a persistent cache on
> local disk while Ceph choose a in memory cache approach.
> 
> The main motivation of object cache is to reduce network traffic and
> improve performance and the cache can be seen as a hard disk' internal
> write cache, which modern kernels support well.
> 
> For a background introduction, Sheepdog's object cache works similar to
> kernel's page cache, except that we cache a 4M object of a volume in
> disk while kernel cache 4k page of a file in memory. We use LRU list per
> volume to do reclaim and dirty list to track dirty objects for
> writeback. We always readahead a whole object if not cached.
> 
> The benefit of a disk cache over a memory cache, in my option, is
> 1) VM get a more smooth performance because cache don't consume memory
> (if memory is on high water mark, the latency of guest IO will be very
> high).
> 2) smaller memory requirement and leave all the memory to guest
> 3) objects from base can be shared by all its children snapshots & clone
> 4) more efficient reclaim algorithm because sheep daemon knows better
> than kernel's dm-cache/bcacsh/flashcache.
> 5) can easily take advantage of SSD as a cache backend

It sounds like the cache is in the sheep daemon and therefore has a
global view of all volumes being accessed from this host.  That way it
can do things like share the cached snapshot data between volumes.

This is what I was pointing out about putting the cache in QEMU - you
only know about this QEMU process, not all volumes being accessed from
this host.

Even if Ceph and Sheepdog don't share code, it sounds like they have a
lot in common and it's worth looking at the Sheepdog cache before adding
one to Ceph.

Stefan

  reply	other threads:[~2013-06-24  9:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh
2013-04-11  9:25 ` Stefan Hajnoczi
2013-06-19 21:28   ` Alex Bligh
2013-06-20  9:46     ` Stefan Hajnoczi
2013-06-20 14:25       ` Alex Bligh
2013-06-21 12:55         ` Stefan Hajnoczi
2013-06-21 13:54           ` Alex Bligh
2013-06-21 15:45           ` Sage Weil
2013-06-20 15:58       ` Sage Weil
2013-06-21 11:18         ` Alex Bligh
2013-06-21 15:40           ` Sage Weil
2013-06-21 13:20         ` Stefan Hajnoczi
2013-06-21 15:18         ` Liu Yuan
2013-06-24  9:31           ` Stefan Hajnoczi [this message]
2013-06-24 10:25             ` Alex Bligh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130624093135.GC19900@stefanha-thinkpad.redhat.com \
    --to=stefanha@gmail.com \
    --cc=alex@alex.org.uk \
    --cc=josh.durgin@inktank.com \
    --cc=namei.unix@gmail.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).