From: Stefan Hajnoczi <stefanha@gmail.com>
To: Liu Yuan <namei.unix@gmail.com>
Cc: josh.durgin@inktank.com, Sage Weil <sage@inktank.com>,
Alex Bligh <alex@alex.org.uk>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Adding a persistent writeback cache to qemu
Date: Mon, 24 Jun 2013 11:31:35 +0200 [thread overview]
Message-ID: <20130624093135.GC19900@stefanha-thinkpad.redhat.com> (raw)
In-Reply-To: <51C46EAF.40705@gmail.com>
On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote:
> On 06/20/2013 11:58 PM, Sage Weil wrote:
> > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> >>> The concrete problem here is that flashcache/dm-cache/bcache don't
> >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> >>> cache access to block devices (in the host layer), and with rbd
> >>> (for instance) there is no access to a block device at all. block/rbd.c
> >>> simply calls librbd which calls librados etc.
> >>>
> >>> So the context switches etc. I am avoiding are the ones that would
> >>> be introduced by using kernel rbd devices rather than librbd.
> >>
> >> I understand the limitations with kernel block devices - their
> >> setup/teardown is an extra step outside QEMU and privileges need to be
> >> managed. That basically means you need to use a management tool like
> >> libvirt to make it usable.
> >>
> >> But I don't understand the performance angle here. Do you have profiles
> >> that show kernel rbd is a bottleneck due to context switching?
> >>
> >> We use the kernel page cache for -drive file=test.img,cache=writeback
> >> and no one has suggested reimplementing the page cache inside QEMU for
> >> better performance.
> >>
> >> Also, how do you want to manage QEMU page cache with multiple guests
> >> running? They are independent and know nothing about each other. Their
> >> process memory consumption will be bloated and the kernel memory
> >> management will end up having to sort out who gets to stay in physical
> >> memory.
> >>
> >> You can see I'm skeptical of this and think it's premature optimization,
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary. But we should definitely get feedback from
> >> the Ceph folks too.
> >>
> >> I'd like to hear from Ceph folks what their position on kernel rbd vs
> >> librados is. Why one do they recommend for QEMU guests and what are the
> >> pros/cons?
> >
> > I agree that a flashcache/bcache-like persistent cache would be a big win
> > for qemu + rbd users.
> >
> > There are few important issues with librbd vs kernel rbd:
> >
> > * librbd tends to get new features more quickly that the kernel rbd
> > (although now that layering has landed in 3.10 this will be less
> > painful than it was).
> >
> > * Using kernel rbd means users need bleeding edge kernels, a non-starter
> > for many orgs that are still running things like RHEL. Bug fixes are
> > difficult to roll out, etc.
> >
> > * librbd has an in-memory cache that behaves similar to an HDD's cache
> > (e.g., it forces writeback on flush). This improves performance
> > significantly for many workloads. Of course, having a bcache-like
> > layer mitigates this..
> >
> > I'm not really sure what the best path forward is. Putting the
> > functionality in qemu would benefit lots of other storage backends,
> > putting it in librbd would capture various other librbd users (xen, tgt,
> > and future users like hyper-v), and using new kernels works today but
> > creates a lot of friction for operations.
> >
>
> I think I can share some implementation details about persistent cache
> for guest because 1) Sheepdog has a persistent object-oriented cache as
> exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
> volumes on top of object store. 3) Sheepdog choose a persistent cache on
> local disk while Ceph choose a in memory cache approach.
>
> The main motivation of object cache is to reduce network traffic and
> improve performance and the cache can be seen as a hard disk' internal
> write cache, which modern kernels support well.
>
> For a background introduction, Sheepdog's object cache works similar to
> kernel's page cache, except that we cache a 4M object of a volume in
> disk while kernel cache 4k page of a file in memory. We use LRU list per
> volume to do reclaim and dirty list to track dirty objects for
> writeback. We always readahead a whole object if not cached.
>
> The benefit of a disk cache over a memory cache, in my option, is
> 1) VM get a more smooth performance because cache don't consume memory
> (if memory is on high water mark, the latency of guest IO will be very
> high).
> 2) smaller memory requirement and leave all the memory to guest
> 3) objects from base can be shared by all its children snapshots & clone
> 4) more efficient reclaim algorithm because sheep daemon knows better
> than kernel's dm-cache/bcacsh/flashcache.
> 5) can easily take advantage of SSD as a cache backend
It sounds like the cache is in the sheep daemon and therefore has a
global view of all volumes being accessed from this host. That way it
can do things like share the cached snapshot data between volumes.
This is what I was pointing out about putting the cache in QEMU - you
only know about this QEMU process, not all volumes being accessed from
this host.
Even if Ceph and Sheepdog don't share code, it sounds like they have a
lot in common and it's worth looking at the Sheepdog cache before adding
one to Ceph.
Stefan
next prev parent reply other threads:[~2013-06-24 9:31 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh
2013-04-11 9:25 ` Stefan Hajnoczi
2013-06-19 21:28 ` Alex Bligh
2013-06-20 9:46 ` Stefan Hajnoczi
2013-06-20 14:25 ` Alex Bligh
2013-06-21 12:55 ` Stefan Hajnoczi
2013-06-21 13:54 ` Alex Bligh
2013-06-21 15:45 ` Sage Weil
2013-06-20 15:58 ` Sage Weil
2013-06-21 11:18 ` Alex Bligh
2013-06-21 15:40 ` Sage Weil
2013-06-21 13:20 ` Stefan Hajnoczi
2013-06-21 15:18 ` Liu Yuan
2013-06-24 9:31 ` Stefan Hajnoczi [this message]
2013-06-24 10:25 ` Alex Bligh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130624093135.GC19900@stefanha-thinkpad.redhat.com \
--to=stefanha@gmail.com \
--cc=alex@alex.org.uk \
--cc=josh.durgin@inktank.com \
--cc=namei.unix@gmail.com \
--cc=qemu-devel@nongnu.org \
--cc=sage@inktank.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).