qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Adding a persistent writeback cache to qemu
@ 2013-04-01 13:21 Alex Bligh
  2013-04-11  9:25 ` Stefan Hajnoczi
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Bligh @ 2013-04-01 13:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Bligh

I'd like to experiment with adding persistent writeback cache to qemu.
The use case here is where non-local storage is used (e.g. rbd, ceph)
using the qemu drivers, together with a local cache as a file on
a much faster locally mounted device, for instance an SSD (possibly
replicated). This would I think give a similar performance boost to
using an rbd block device plus flashcache/dm-cache/bcache, but without
introducing all the context switches and limitations of having to
use real block devices. I appreciate it would need to be live migration
aware (worst case solution: flush and turn off caching during live
migrate), and ideally be capable of replaying a dirty writeback cache
in the event the host crashes.

Is there any support for this already? Has anyone worked on this before?
If not, would there be any interest in it?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh
@ 2013-04-11  9:25 ` Stefan Hajnoczi
  2013-06-19 21:28   ` Alex Bligh
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2013-04-11  9:25 UTC (permalink / raw)
  To: Alex Bligh; +Cc: qemu-devel

On Mon, Apr 01, 2013 at 01:21:45PM +0000, Alex Bligh wrote:
> I'd like to experiment with adding persistent writeback cache to qemu.
> The use case here is where non-local storage is used (e.g. rbd, ceph)
> using the qemu drivers, together with a local cache as a file on
> a much faster locally mounted device, for instance an SSD (possibly
> replicated). This would I think give a similar performance boost to
> using an rbd block device plus flashcache/dm-cache/bcache, but without
> introducing all the context switches and limitations of having to
> use real block devices. I appreciate it would need to be live migration
> aware (worst case solution: flush and turn off caching during live
> migrate), and ideally be capable of replaying a dirty writeback cache
> in the event the host crashes.
> 
> Is there any support for this already? Has anyone worked on this before?
> If not, would there be any interest in it?

I'm concerned about the complexity this would introduce in QEMU.
Therefore I'm a fan of using existing solutions like the Linux block
layer instead of reimplementing this stuff in Linux.

What concrete issues are there with using rbd plus
flashcache/dm-cache/bcache?

I'm not sure I understand the context switch problem since implementing
it in user space will still require system calls to do all the actual
cache I/O.

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-04-11  9:25 ` Stefan Hajnoczi
@ 2013-06-19 21:28   ` Alex Bligh
  2013-06-20  9:46     ` Stefan Hajnoczi
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Bligh @ 2013-06-19 21:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Alex Bligh

Stefan,

--On 11 April 2013 11:25:48 +0200 Stefan Hajnoczi <stefanha@gmail.com> 
wrote:

>> I'd like to experiment with adding persistent writeback cache to qemu.
>> The use case here is where non-local storage is used (e.g. rbd, ceph)
>> using the qemu drivers, together with a local cache as a file on
>> a much faster locally mounted device, for instance an SSD (possibly
>> replicated). This would I think give a similar performance boost to
>> using an rbd block device plus flashcache/dm-cache/bcache, but without
>> introducing all the context switches and limitations of having to
>> use real block devices. I appreciate it would need to be live migration
>> aware (worst case solution: flush and turn off caching during live
>> migrate), and ideally be capable of replaying a dirty writeback cache
>> in the event the host crashes.
>>
>> Is there any support for this already? Has anyone worked on this before?
>> If not, would there be any interest in it?
>
> I'm concerned about the complexity this would introduce in QEMU.
> Therefore I'm a fan of using existing solutions like the Linux block
> layer instead of reimplementing this stuff in Linux.
>
> What concrete issues are there with using rbd plus
> flashcache/dm-cache/bcache?
>
> I'm not sure I understand the context switch problem since implementing
> it in user space will still require system calls to do all the actual
> cache I/O.

I failed to see your reply and got distracted from this. Apologies.
So several months later ...

The concrete problem here is that flashcache/dm-cache/bcache don't
work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
cache access to block devices (in the host layer), and with rbd
(for instance) there is no access to a block device at all. block/rbd.c
simply calls librbd which calls librados etc.

So the context switches etc. I am avoiding are the ones that would
be introduced by using kernel rbd devices rather than librbd.

I had planned to introduce this as a sort of layer on top of any
existing block device handler; I believe they are layered at the
moment.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-19 21:28   ` Alex Bligh
@ 2013-06-20  9:46     ` Stefan Hajnoczi
  2013-06-20 14:25       ` Alex Bligh
  2013-06-20 15:58       ` Sage Weil
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Hajnoczi @ 2013-06-20  9:46 UTC (permalink / raw)
  To: Alex Bligh; +Cc: josh.durgin, qemu-devel, sage

On Wed, Jun 19, 2013 at 10:28:53PM +0100, Alex Bligh wrote:
> --On 11 April 2013 11:25:48 +0200 Stefan Hajnoczi
> <stefanha@gmail.com> wrote:
> 
> >>I'd like to experiment with adding persistent writeback cache to qemu.
> >>The use case here is where non-local storage is used (e.g. rbd, ceph)
> >>using the qemu drivers, together with a local cache as a file on
> >>a much faster locally mounted device, for instance an SSD (possibly
> >>replicated). This would I think give a similar performance boost to
> >>using an rbd block device plus flashcache/dm-cache/bcache, but without
> >>introducing all the context switches and limitations of having to
> >>use real block devices. I appreciate it would need to be live migration
> >>aware (worst case solution: flush and turn off caching during live
> >>migrate), and ideally be capable of replaying a dirty writeback cache
> >>in the event the host crashes.
> >>
> >>Is there any support for this already? Has anyone worked on this before?
> >>If not, would there be any interest in it?
> >
> >I'm concerned about the complexity this would introduce in QEMU.
> >Therefore I'm a fan of using existing solutions like the Linux block
> >layer instead of reimplementing this stuff in Linux.
> >
> >What concrete issues are there with using rbd plus
> >flashcache/dm-cache/bcache?
> >
> >I'm not sure I understand the context switch problem since implementing
> >it in user space will still require system calls to do all the actual
> >cache I/O.
> 
> I failed to see your reply and got distracted from this. Apologies.
> So several months later ...

Happens to me sometimes too ;-).

> The concrete problem here is that flashcache/dm-cache/bcache don't
> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> cache access to block devices (in the host layer), and with rbd
> (for instance) there is no access to a block device at all. block/rbd.c
> simply calls librbd which calls librados etc.
> 
> So the context switches etc. I am avoiding are the ones that would
> be introduced by using kernel rbd devices rather than librbd.

I understand the limitations with kernel block devices - their
setup/teardown is an extra step outside QEMU and privileges need to be
managed.  That basically means you need to use a management tool like
libvirt to make it usable.

But I don't understand the performance angle here.  Do you have profiles
that show kernel rbd is a bottleneck due to context switching?

We use the kernel page cache for -drive file=test.img,cache=writeback
and no one has suggested reimplementing the page cache inside QEMU for
better performance.

Also, how do you want to manage QEMU page cache with multiple guests
running?  They are independent and know nothing about each other.  Their
process memory consumption will be bloated and the kernel memory
management will end up having to sort out who gets to stay in physical
memory.

You can see I'm skeptical of this and think it's premature optimization,
but if there's really a case for it with performance profiles then I
guess it would be necessary.  But we should definitely get feedback from
the Ceph folks too.

I'd like to hear from Ceph folks what their position on kernel rbd vs
librados is.  Why one do they recommend for QEMU guests and what are the
pros/cons?

CCed Sage and Josh

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20  9:46     ` Stefan Hajnoczi
@ 2013-06-20 14:25       ` Alex Bligh
  2013-06-21 12:55         ` Stefan Hajnoczi
  2013-06-20 15:58       ` Sage Weil
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Bligh @ 2013-06-20 14:25 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh, sage

Stefan,

--On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote:

>> The concrete problem here is that flashcache/dm-cache/bcache don't
>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
>> cache access to block devices (in the host layer), and with rbd
>> (for instance) there is no access to a block device at all. block/rbd.c
>> simply calls librbd which calls librados etc.
>>
>> So the context switches etc. I am avoiding are the ones that would
>> be introduced by using kernel rbd devices rather than librbd.
>
> I understand the limitations with kernel block devices - their
> setup/teardown is an extra step outside QEMU and privileges need to be
> managed.  That basically means you need to use a management tool like
> libvirt to make it usable.

It's not just the management tool (we have one of those). Kernel
devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
etc. by hostile guests can cause issues.

> But I don't understand the performance angle here.  Do you have profiles
> that show kernel rbd is a bottleneck due to context switching?

I don't have test figures - perhaps this is just received wisdom, but I'd
understood that's why they were faster.

> We use the kernel page cache for -drive file=test.img,cache=writeback
> and no one has suggested reimplementing the page cache inside QEMU for
> better performance.

That's true, but I'd argue that is a little different because nothing
blocks on the page cache (it being in RAM). You don't get the situation
where the tasks sleeps awaiting data (from the page cache), the data
arrives, and the task then needs to to be scheduled in. I will admit
to a degree of handwaving here as I hadn't realised the claim qemu+rbd
was more efficient than qemu+blockdevice+kernelrbd was controversial.

> Also, how do you want to manage QEMU page cache with multiple guests
> running?  They are independent and know nothing about each other.  Their
> process memory consumption will be bloated and the kernel memory
> management will end up having to sort out who gets to stay in physical
> memory.

I don't think that one's an issue. Currently QEMU processes with
cache=writeback contend physical memory via the page cache. I'm
not changing that bit. I'm proposing allocating SSD (rather than
RAM) for cache, so if anything that should reduce RAM use as it
will be quicker to flush the cache to 'disk' (the second layer
of caching). I was proposing allocating each task a fixed amount
of SSD space.

In terms of how this is done, one way would be to mmap a large
file on SSD, which would mean the page cache used would be
whatever page cache is used for the SSD. You've got more control
over this (with madvise etc) than you have with aio I think.

> You can see I'm skeptical of this

Which is no bad thing!

> and think it's premature optimization,

... and I'm only to keen to avoid work if it brings no gain.

> but if there's really a case for it with performance profiles then I
> guess it would be necessary.  But we should definitely get feedback from
> the Ceph folks too.

The specific problem we are trying to solve (in case that's not
obvious) is the non-locality of data read/written by ceph. Whilst
you can use placement to localise data to the rack level, even if
one of your OSDs is in the machine you end up waiting on network
traffic. That is apparently hard to solve inside Ceph.

However, this would be applicable to sheepdog, gluster, nfs,
the internal iscsi initiator, etc. etc. rather than just to Ceph.

I'm also keen to hear from the Ceph guys as if they have a way of
keeping lots of reads and writes in the box and not crossing the
network, I'd be only too keen to use that.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20  9:46     ` Stefan Hajnoczi
  2013-06-20 14:25       ` Alex Bligh
@ 2013-06-20 15:58       ` Sage Weil
  2013-06-21 11:18         ` Alex Bligh
                           ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Sage Weil @ 2013-06-20 15:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh

On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> > The concrete problem here is that flashcache/dm-cache/bcache don't
> > work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> > cache access to block devices (in the host layer), and with rbd
> > (for instance) there is no access to a block device at all. block/rbd.c
> > simply calls librbd which calls librados etc.
> > 
> > So the context switches etc. I am avoiding are the ones that would
> > be introduced by using kernel rbd devices rather than librbd.
> 
> I understand the limitations with kernel block devices - their
> setup/teardown is an extra step outside QEMU and privileges need to be
> managed.  That basically means you need to use a management tool like
> libvirt to make it usable.
> 
> But I don't understand the performance angle here.  Do you have profiles
> that show kernel rbd is a bottleneck due to context switching?
> 
> We use the kernel page cache for -drive file=test.img,cache=writeback
> and no one has suggested reimplementing the page cache inside QEMU for
> better performance.
> 
> Also, how do you want to manage QEMU page cache with multiple guests
> running?  They are independent and know nothing about each other.  Their
> process memory consumption will be bloated and the kernel memory
> management will end up having to sort out who gets to stay in physical
> memory.
> 
> You can see I'm skeptical of this and think it's premature optimization,
> but if there's really a case for it with performance profiles then I
> guess it would be necessary.  But we should definitely get feedback from
> the Ceph folks too.
> 
> I'd like to hear from Ceph folks what their position on kernel rbd vs
> librados is.  Why one do they recommend for QEMU guests and what are the
> pros/cons?

I agree that a flashcache/bcache-like persistent cache would be a big win 
for qemu + rbd users.  

There are few important issues with librbd vs kernel rbd:

 * librbd tends to get new features more quickly that the kernel rbd 
   (although now that layering has landed in 3.10 this will be less 
   painful than it was).

 * Using kernel rbd means users need bleeding edge kernels, a non-starter 
   for many orgs that are still running things like RHEL.  Bug fixes are 
   difficult to roll out, etc.

 * librbd has an in-memory cache that behaves similar to an HDD's cache 
   (e.g., it forces writeback on flush).  This improves performance 
   significantly for many workloads.  Of course, having a bcache-like 
   layer mitigates this..

I'm not really sure what the best path forward is.  Putting the 
functionality in qemu would benefit lots of other storage backends, 
putting it in librbd would capture various other librbd users (xen, tgt, 
and future users like hyper-v), and using new kernels works today but 
creates a lot of friction for operations.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20 15:58       ` Sage Weil
@ 2013-06-21 11:18         ` Alex Bligh
  2013-06-21 15:40           ` Sage Weil
  2013-06-21 13:20         ` Stefan Hajnoczi
  2013-06-21 15:18         ` Liu Yuan
  2 siblings, 1 reply; 15+ messages in thread
From: Alex Bligh @ 2013-06-21 11:18 UTC (permalink / raw)
  To: Sage Weil, Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh

Sage,

--On 20 June 2013 08:58:19 -0700 Sage Weil <sage@inktank.com> wrote:

>> I'd like to hear from Ceph folks what their position on kernel rbd vs
>> librados is.  Why one do they recommend for QEMU guests and what are the
>> pros/cons?
>
> I agree that a flashcache/bcache-like persistent cache would be a big win
> for qemu + rbd users.

Great.

I think Stefan was really after testing my received wisdom that
ceph+librbd will be greater performance than ceph+blkdev+kernelrbd
(even without the persistent cache), and if so why.

> There are few important issues with librbd vs kernel rbd:
>
>  * librbd tends to get new features more quickly that the kernel rbd
>    (although now that layering has landed in 3.10 this will be less
>    painful than it was).
>
>  * Using kernel rbd means users need bleeding edge kernels, a non-starter
>    for many orgs that are still running things like RHEL.  Bug fixes are
>    difficult to roll out, etc.
>
>  * librbd has an in-memory cache that behaves similar to an HDD's cache
>    (e.g., it forces writeback on flush).  This improves performance
>    significantly for many workloads.  Of course, having a bcache-like
>    layer mitigates this..
>
> I'm not really sure what the best path forward is.  Putting the
> functionality in qemu would benefit lots of other storage backends,
> putting it in librbd would capture various other librbd users (xen, tgt,
> and future users like hyper-v), and using new kernels works today but
> creates a lot of friction for operations.

To be honest I'd not even thought of putting it in librbd (which might
be simpler). I suspect it might be easier to get patches into librbd
than into qemu, and that ensuring cache coherency might be simpler.
If I get time to look at this, would you be interested in taking patches
for this?

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20 14:25       ` Alex Bligh
@ 2013-06-21 12:55         ` Stefan Hajnoczi
  2013-06-21 13:54           ` Alex Bligh
  2013-06-21 15:45           ` Sage Weil
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Hajnoczi @ 2013-06-21 12:55 UTC (permalink / raw)
  To: Alex Bligh; +Cc: Josh Durgin, qemu-devel, sage

On Thu, Jun 20, 2013 at 4:25 PM, Alex Bligh <alex@alex.org.uk> wrote:
> Stefan,
>
>
> --On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
>>> The concrete problem here is that flashcache/dm-cache/bcache don't
>>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
>>> cache access to block devices (in the host layer), and with rbd
>>> (for instance) there is no access to a block device at all. block/rbd.c
>>> simply calls librbd which calls librados etc.
>>>
>>> So the context switches etc. I am avoiding are the ones that would
>>> be introduced by using kernel rbd devices rather than librbd.
>>
>>
>> I understand the limitations with kernel block devices - their
>> setup/teardown is an extra step outside QEMU and privileges need to be
>> managed.  That basically means you need to use a management tool like
>> libvirt to make it usable.
>
>
> It's not just the management tool (we have one of those). Kernel
> devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
> etc. by hostile guests can cause issues.

If you have those problems then something is wrong:

LVM shouldn't definitely not be scanning guest devices.

As for disk UUIDs, they come from the SCSI target which is under your
control, right?  In fact, you can assign different serial numbers to
drives attached in QEMU, the host serial number will not be used.
Therefore, there is a clean separation there and guests do not control
host UUIDs.

The one true weakness here is that Linux reads partition tables
automatically.  Not sure if there's a way to turn it off or how hard
it would be to add that.

>> But I don't understand the performance angle here.  Do you have profiles
>> that show kernel rbd is a bottleneck due to context switching?
>
>
> I don't have test figures - perhaps this is just received wisdom, but I'd
> understood that's why they were faster.
>
>
>> We use the kernel page cache for -drive file=test.img,cache=writeback
>> and no one has suggested reimplementing the page cache inside QEMU for
>> better performance.
>
>
> That's true, but I'd argue that is a little different because nothing
> blocks on the page cache (it being in RAM). You don't get the situation
> where the tasks sleeps awaiting data (from the page cache), the data
> arrives, and the task then needs to to be scheduled in. I will admit
> to a degree of handwaving here as I hadn't realised the claim qemu+rbd
> was more efficient than qemu+blockdevice+kernelrbd was controversial.

It may or may not be more efficient, unless there is some performance
analysis we don't know how big a difference and why.

>> but if there's really a case for it with performance profiles then I
>> guess it would be necessary.  But we should definitely get feedback from
>> the Ceph folks too.
>
>
> The specific problem we are trying to solve (in case that's not
> obvious) is the non-locality of data read/written by ceph. Whilst
> you can use placement to localise data to the rack level, even if
> one of your OSDs is in the machine you end up waiting on network
> traffic. That is apparently hard to solve inside Ceph.

I'm not up-to-speed on Ceph architecture, is this because you need to
visit a metadata server before you access the storage.  Even when the
data is colocated on the same machine you'll need to ask the metadata
server first?

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20 15:58       ` Sage Weil
  2013-06-21 11:18         ` Alex Bligh
@ 2013-06-21 13:20         ` Stefan Hajnoczi
  2013-06-21 15:18         ` Liu Yuan
  2 siblings, 0 replies; 15+ messages in thread
From: Stefan Hajnoczi @ 2013-06-21 13:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, qemu-devel, Alex Bligh

On Thu, Jun 20, 2013 at 5:58 PM, Sage Weil <sage@inktank.com> wrote:
> On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> I'm not really sure what the best path forward is.  Putting the
> functionality in qemu would benefit lots of other storage backends,
> putting it in librbd would capture various other librbd users (xen, tgt,
> and future users like hyper-v), and using new kernels works today but
> creates a lot of friction for operations.

If this is a common Ceph performance bottleneck, then librbd is a good
place to put it.

I believe you don't have to go over the external network with Gluster
or pNFS if data is colocated on the same host, because they talk
directly to the storage node.  (There are probably configurations
where this isn't possible though.)

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-21 12:55         ` Stefan Hajnoczi
@ 2013-06-21 13:54           ` Alex Bligh
  2013-06-21 15:45           ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Alex Bligh @ 2013-06-21 13:54 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Josh Durgin, qemu-devel, Alex Bligh, sage

Stefan,

--On 21 June 2013 14:55:20 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote:

>>> I understand the limitations with kernel block devices - their
>>> setup/teardown is an extra step outside QEMU and privileges need to be
>>> managed.  That basically means you need to use a management tool like
>>> libvirt to make it usable.
>>
>>
>> It's not just the management tool (we have one of those). Kernel
>> devices are pain. As a trivial example, duplication of UUIDs, LVM IDs
>> etc. by hostile guests can cause issues.
>
> If you have those problems then something is wrong:
>
> LVM shouldn't definitely not be scanning guest devices.
>
> As for disk UUIDs, they come from the SCSI target which is under your
> control, right?  In fact, you can assign different serial numbers to
> drives attached in QEMU, the host serial number will not be used.
> Therefore, there is a clean separation there and guests do not control
> host UUIDs.
>
> The one true weakness here is that Linux reads partition tables
> automatically.  Not sure if there's a way to turn it off or how hard
> it would be to add that.

Most things are work-roundable, but the whole thing is 'fail open'.
See
 http://lwn.net/Articles/474067/
for example (not the greatest example, as I've failed to find
the lwn.net article what talked about malicious disk labels).

When a disk is inserted (guest disk mounted), its partition table
gets scanned, and various other stuff happens from udev triggers,
based on the UUID of the disk (I believe the relevant UUID is
actually stored in the file system), and the UUID/label on the GPT.
lvm scanning is also done by default, as is dm stuff. The
same problem happens (in theory) with disk labels. Yes, you
can disable this, but making (e.g.) dm and lvm work on attached
scsi disks but not iscsi disks, in general, when you don't know
the iscsi or scsi vendor is non-trivial (yes, I've done it).

I have not found a way yet to avoid reading the partition table at
all (which would be useful).

There used to be other problems when iscsi is used in anger. One,
for instance, is that the default iscsi client scans the scsi
bus (normally unnecessarily) at the drop of a hat. Even if you
know all information about what you are mounting, it scans it.
This leads to an O(n^2) problem starting VMs - several minutes.
Again, I know how to fix this - if you are interested:
  https://github.com/abligh/open-iscsi/tree/add-no-scanning-option

All this is solved by using the inbuilt iscsi client.

>> That's true, but I'd argue that is a little different because nothing
>> blocks on the page cache (it being in RAM). You don't get the situation
>> where the tasks sleeps awaiting data (from the page cache), the data
>> arrives, and the task then needs to to be scheduled in. I will admit
>> to a degree of handwaving here as I hadn't realised the claim qemu+rbd
>> was more efficient than qemu+blockdevice+kernelrbd was controversial.
>
> It may or may not be more efficient, unless there is some performance
> analysis we don't know how big a difference and why.

Sure. I hope Sage comes back on this one.

>>> but if there's really a case for it with performance profiles then I
>>> guess it would be necessary.  But we should definitely get feedback from
>>> the Ceph folks too.
>>
>>
>> The specific problem we are trying to solve (in case that's not
>> obvious) is the non-locality of data read/written by ceph. Whilst
>> you can use placement to localise data to the rack level, even if
>> one of your OSDs is in the machine you end up waiting on network
>> traffic. That is apparently hard to solve inside Ceph.
>
> I'm not up-to-speed on Ceph architecture, is this because you need to
> visit a metadata server before you access the storage.  Even when the
> data is colocated on the same machine you'll need to ask the metadata
> server first?

Well, Sage would be the expert, but I understand the problem is
simpler than that. Firstly, in order to mark the write as complete
it has to be written to at least a quorum of OSDs, and a quorum is
larger than one. Hence at least one write is non-local. Secondly,
Ceph's placement group feature does not (so $inktank guy told me)
work well for localising at the level of particular servers; so
even if somehow you made ceph happy with writing just one replica
and saying it was done (and doing the rest in the background), you'd
be hard pressed to ensure the first replica written was always
(or nearly always) on a local spindle. Hence my idea of adding a
layer in front which would acknowledge the write - even if
a flush/fua had come in - on the basis it had been written to
persistent storage and it can recover on a reboot after this
point, then go sort the rest out in the background.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-20 15:58       ` Sage Weil
  2013-06-21 11:18         ` Alex Bligh
  2013-06-21 13:20         ` Stefan Hajnoczi
@ 2013-06-21 15:18         ` Liu Yuan
  2013-06-24  9:31           ` Stefan Hajnoczi
  2 siblings, 1 reply; 15+ messages in thread
From: Liu Yuan @ 2013-06-21 15:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: Stefan Hajnoczi, qemu-devel, Alex Bligh, josh.durgin

On 06/20/2013 11:58 PM, Sage Weil wrote:
> On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
>>> The concrete problem here is that flashcache/dm-cache/bcache don't
>>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
>>> cache access to block devices (in the host layer), and with rbd
>>> (for instance) there is no access to a block device at all. block/rbd.c
>>> simply calls librbd which calls librados etc.
>>>
>>> So the context switches etc. I am avoiding are the ones that would
>>> be introduced by using kernel rbd devices rather than librbd.
>>
>> I understand the limitations with kernel block devices - their
>> setup/teardown is an extra step outside QEMU and privileges need to be
>> managed.  That basically means you need to use a management tool like
>> libvirt to make it usable.
>>
>> But I don't understand the performance angle here.  Do you have profiles
>> that show kernel rbd is a bottleneck due to context switching?
>>
>> We use the kernel page cache for -drive file=test.img,cache=writeback
>> and no one has suggested reimplementing the page cache inside QEMU for
>> better performance.
>>
>> Also, how do you want to manage QEMU page cache with multiple guests
>> running?  They are independent and know nothing about each other.  Their
>> process memory consumption will be bloated and the kernel memory
>> management will end up having to sort out who gets to stay in physical
>> memory.
>>
>> You can see I'm skeptical of this and think it's premature optimization,
>> but if there's really a case for it with performance profiles then I
>> guess it would be necessary.  But we should definitely get feedback from
>> the Ceph folks too.
>>
>> I'd like to hear from Ceph folks what their position on kernel rbd vs
>> librados is.  Why one do they recommend for QEMU guests and what are the
>> pros/cons?
> 
> I agree that a flashcache/bcache-like persistent cache would be a big win 
> for qemu + rbd users.  
> 
> There are few important issues with librbd vs kernel rbd:
> 
>  * librbd tends to get new features more quickly that the kernel rbd 
>    (although now that layering has landed in 3.10 this will be less 
>    painful than it was).
> 
>  * Using kernel rbd means users need bleeding edge kernels, a non-starter 
>    for many orgs that are still running things like RHEL.  Bug fixes are 
>    difficult to roll out, etc.
> 
>  * librbd has an in-memory cache that behaves similar to an HDD's cache 
>    (e.g., it forces writeback on flush).  This improves performance 
>    significantly for many workloads.  Of course, having a bcache-like 
>    layer mitigates this..
> 
> I'm not really sure what the best path forward is.  Putting the 
> functionality in qemu would benefit lots of other storage backends, 
> putting it in librbd would capture various other librbd users (xen, tgt, 
> and future users like hyper-v), and using new kernels works today but 
> creates a lot of friction for operations.
> 

I think I can share some implementation details about persistent cache
for guest because 1) Sheepdog has a persistent object-oriented cache as
exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
volumes on top of object store. 3) Sheepdog choose a persistent cache on
local disk while Ceph choose a in memory cache approach.

The main motivation of object cache is to reduce network traffic and
improve performance and the cache can be seen as a hard disk' internal
write cache, which modern kernels support well.

For a background introduction, Sheepdog's object cache works similar to
kernel's page cache, except that we cache a 4M object of a volume in
disk while kernel cache 4k page of a file in memory. We use LRU list per
volume to do reclaim and dirty list to track dirty objects for
writeback. We always readahead a whole object if not cached.

The benefit of a disk cache over a memory cache, in my option, is
1) VM get a more smooth performance because cache don't consume memory
(if memory is on high water mark, the latency of guest IO will be very
high).
2) smaller memory requirement and leave all the memory to guest
3) objects from base can be shared by all its children snapshots & clone
4) more efficient reclaim algorithm because sheep daemon knows better
than kernel's dm-cache/bcacsh/flashcache.
5) can easily take advantage of SSD as a cache backend

There is no migration problems for sheepdog with client cache because we
can release the cache in migration.

If QEMU has persistent cache built-in a generic layer, say block layer,
Sheepdog's object cache code can be removed. There is also some
advantage besides code reduction to built-in this cache for QEMU, for
e.g, we can teach QEMU to multi-connect more than sheep daemon to get a
better HA without caring cache consistency problem.

I believe sheepdog and RBD can share many codes of the persistent cache
but currently only RBD and Sheepdog use object store to provide volumes,
other formats/protocol use file abstraction, it is hard to reuse code
for them. Maybe we can provide a vfs-like layer to accommodate all the
block storage system, no matter whether it is on top of object store or
file store. This is a touch work but worth a try.

Thanks,
Yuan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-21 11:18         ` Alex Bligh
@ 2013-06-21 15:40           ` Sage Weil
  0 siblings, 0 replies; 15+ messages in thread
From: Sage Weil @ 2013-06-21 15:40 UTC (permalink / raw)
  To: Alex Bligh; +Cc: Stefan Hajnoczi, qemu-devel, josh.durgin

On Fri, 21 Jun 2013, Alex Bligh wrote:
> Sage,
> 
> --On 20 June 2013 08:58:19 -0700 Sage Weil <sage@inktank.com> wrote:
> 
> > > I'd like to hear from Ceph folks what their position on kernel rbd vs
> > > librados is.  Why one do they recommend for QEMU guests and what are the
> > > pros/cons?
> > 
> > I agree that a flashcache/bcache-like persistent cache would be a big win
> > for qemu + rbd users.
> 
> Great.
> 
> I think Stefan was really after testing my received wisdom that
> ceph+librbd will be greater performance than ceph+blkdev+kernelrbd
> (even without the persistent cache), and if so why.

Oh, right.  At this point the performance differential is strictly related 
to the cache behavior.  If there were feature parity, I would not expect 
any significant difference.  There may be a net difference of a data copy, 
but I'm not sure it will be significant.

> > There are few important issues with librbd vs kernel rbd:
> > 
> >  * librbd tends to get new features more quickly that the kernel rbd
> >    (although now that layering has landed in 3.10 this will be less
> >    painful than it was).
> > 
> >  * Using kernel rbd means users need bleeding edge kernels, a non-starter
> >    for many orgs that are still running things like RHEL.  Bug fixes are
> >    difficult to roll out, etc.
> > 
> >  * librbd has an in-memory cache that behaves similar to an HDD's cache
> >    (e.g., it forces writeback on flush).  This improves performance
> >    significantly for many workloads.  Of course, having a bcache-like
> >    layer mitigates this..
> > 
> > I'm not really sure what the best path forward is.  Putting the
> > functionality in qemu would benefit lots of other storage backends,
> > putting it in librbd would capture various other librbd users (xen, tgt,
> > and future users like hyper-v), and using new kernels works today but
> > creates a lot of friction for operations.
> 
> To be honest I'd not even thought of putting it in librbd (which might
> be simpler). I suspect it might be easier to get patches into librbd
> than into qemu, and that ensuring cache coherency might be simpler.
> If I get time to look at this, would you be interested in taking patches
> for this?

Certainly!  It will be a bit tricky to integrate is a lightweight way, 
however, so I would be sure to sketch out a design before diving too far 
into the coding.  I suspect the best path forward would be to extend the 
ObjectCacher.  This has the added benefit that ceph-fuse and libcephfs 
could benefit at well.

The dev summit for the next release (emperor) is coming up in a few 
weeks.. this would be a great project to submit a blueprint for so we can 
discuss it then.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-21 12:55         ` Stefan Hajnoczi
  2013-06-21 13:54           ` Alex Bligh
@ 2013-06-21 15:45           ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Sage Weil @ 2013-06-21 15:45 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Josh Durgin, qemu-devel, Alex Bligh

On Fri, 21 Jun 2013, Stefan Hajnoczi wrote:
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary.  But we should definitely get feedback from
> >> the Ceph folks too.
> >
> >
> > The specific problem we are trying to solve (in case that's not
> > obvious) is the non-locality of data read/written by ceph. Whilst
> > you can use placement to localise data to the rack level, even if
> > one of your OSDs is in the machine you end up waiting on network
> > traffic. That is apparently hard to solve inside Ceph.
> 
> I'm not up-to-speed on Ceph architecture, is this because you need to
> visit a metadata server before you access the storage.  Even when the
> data is colocated on the same machine you'll need to ask the metadata
> server first?

The data location is determined by a fancy hash function, so there is no 
metadata lookup step and the client can directly contact the right server.  
The trade-off is that the client doesn't get to choose where to write--the 
hash deterministically tells us that based on the object name and current 
cluster state.

In the end this means there is some lower bound on latency because we are 
reading/writing over the network...

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-21 15:18         ` Liu Yuan
@ 2013-06-24  9:31           ` Stefan Hajnoczi
  2013-06-24 10:25             ` Alex Bligh
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2013-06-24  9:31 UTC (permalink / raw)
  To: Liu Yuan; +Cc: josh.durgin, Sage Weil, Alex Bligh, qemu-devel

On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote:
> On 06/20/2013 11:58 PM, Sage Weil wrote:
> > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
> >>> The concrete problem here is that flashcache/dm-cache/bcache don't
> >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
> >>> cache access to block devices (in the host layer), and with rbd
> >>> (for instance) there is no access to a block device at all. block/rbd.c
> >>> simply calls librbd which calls librados etc.
> >>>
> >>> So the context switches etc. I am avoiding are the ones that would
> >>> be introduced by using kernel rbd devices rather than librbd.
> >>
> >> I understand the limitations with kernel block devices - their
> >> setup/teardown is an extra step outside QEMU and privileges need to be
> >> managed.  That basically means you need to use a management tool like
> >> libvirt to make it usable.
> >>
> >> But I don't understand the performance angle here.  Do you have profiles
> >> that show kernel rbd is a bottleneck due to context switching?
> >>
> >> We use the kernel page cache for -drive file=test.img,cache=writeback
> >> and no one has suggested reimplementing the page cache inside QEMU for
> >> better performance.
> >>
> >> Also, how do you want to manage QEMU page cache with multiple guests
> >> running?  They are independent and know nothing about each other.  Their
> >> process memory consumption will be bloated and the kernel memory
> >> management will end up having to sort out who gets to stay in physical
> >> memory.
> >>
> >> You can see I'm skeptical of this and think it's premature optimization,
> >> but if there's really a case for it with performance profiles then I
> >> guess it would be necessary.  But we should definitely get feedback from
> >> the Ceph folks too.
> >>
> >> I'd like to hear from Ceph folks what their position on kernel rbd vs
> >> librados is.  Why one do they recommend for QEMU guests and what are the
> >> pros/cons?
> > 
> > I agree that a flashcache/bcache-like persistent cache would be a big win 
> > for qemu + rbd users.  
> > 
> > There are few important issues with librbd vs kernel rbd:
> > 
> >  * librbd tends to get new features more quickly that the kernel rbd 
> >    (although now that layering has landed in 3.10 this will be less 
> >    painful than it was).
> > 
> >  * Using kernel rbd means users need bleeding edge kernels, a non-starter 
> >    for many orgs that are still running things like RHEL.  Bug fixes are 
> >    difficult to roll out, etc.
> > 
> >  * librbd has an in-memory cache that behaves similar to an HDD's cache 
> >    (e.g., it forces writeback on flush).  This improves performance 
> >    significantly for many workloads.  Of course, having a bcache-like 
> >    layer mitigates this..
> > 
> > I'm not really sure what the best path forward is.  Putting the 
> > functionality in qemu would benefit lots of other storage backends, 
> > putting it in librbd would capture various other librbd users (xen, tgt, 
> > and future users like hyper-v), and using new kernels works today but 
> > creates a lot of friction for operations.
> > 
> 
> I think I can share some implementation details about persistent cache
> for guest because 1) Sheepdog has a persistent object-oriented cache as
> exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide
> volumes on top of object store. 3) Sheepdog choose a persistent cache on
> local disk while Ceph choose a in memory cache approach.
> 
> The main motivation of object cache is to reduce network traffic and
> improve performance and the cache can be seen as a hard disk' internal
> write cache, which modern kernels support well.
> 
> For a background introduction, Sheepdog's object cache works similar to
> kernel's page cache, except that we cache a 4M object of a volume in
> disk while kernel cache 4k page of a file in memory. We use LRU list per
> volume to do reclaim and dirty list to track dirty objects for
> writeback. We always readahead a whole object if not cached.
> 
> The benefit of a disk cache over a memory cache, in my option, is
> 1) VM get a more smooth performance because cache don't consume memory
> (if memory is on high water mark, the latency of guest IO will be very
> high).
> 2) smaller memory requirement and leave all the memory to guest
> 3) objects from base can be shared by all its children snapshots & clone
> 4) more efficient reclaim algorithm because sheep daemon knows better
> than kernel's dm-cache/bcacsh/flashcache.
> 5) can easily take advantage of SSD as a cache backend

It sounds like the cache is in the sheep daemon and therefore has a
global view of all volumes being accessed from this host.  That way it
can do things like share the cached snapshot data between volumes.

This is what I was pointing out about putting the cache in QEMU - you
only know about this QEMU process, not all volumes being accessed from
this host.

Even if Ceph and Sheepdog don't share code, it sounds like they have a
lot in common and it's worth looking at the Sheepdog cache before adding
one to Ceph.

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Adding a persistent writeback cache to qemu
  2013-06-24  9:31           ` Stefan Hajnoczi
@ 2013-06-24 10:25             ` Alex Bligh
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Bligh @ 2013-06-24 10:25 UTC (permalink / raw)
  To: Stefan Hajnoczi, Liu Yuan; +Cc: josh.durgin, Sage Weil, Alex Bligh, qemu-devel

Stefan,

--On 24 June 2013 11:31:35 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote:

> It sounds like the cache is in the sheep daemon and therefore has a
> global view of all volumes being accessed from this host.  That way it
> can do things like share the cached snapshot data between volumes.

Yes, that's potentially interesting. I think you'd only need to
share data for read-only objects. Read caching (at least of read
only objects) could also persist beyond the life of one qemu process.
So stopping qemu and restarting it could be done with a hot cache.

However, you can't do that for writeable objects, as your write
back cache needs to be fully written out before another qemu
process can be launched referencing the same object, and moreover
writeable objects are not in general shared between multiple
processes anyway (if we ignore corner cases like OCFS2 on RBD).

I had assumed (because of some earlier work I did on a different 
distributed storage project) that the main advantage would be the 
write-back caching rather than the read caching; that might be untrue with 
multiple guests running the same image.

> This is what I was pointing out about putting the cache in QEMU - you
> only know about this QEMU process, not all volumes being accessed from
> this host.
>
> Even if Ceph and Sheepdog don't share code, it sounds like they have a
> lot in common and it's worth looking at the Sheepdog cache before adding
> one to Ceph.

One problem there is Ceph is in C++, Sheepdog is in C.

Another is that sheepdog has a daemon running on the client which
can keep state etc. as qemu comes and goes. Ceph just has a library.

-- 
Alex Bligh

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-06-24 10:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh
2013-04-11  9:25 ` Stefan Hajnoczi
2013-06-19 21:28   ` Alex Bligh
2013-06-20  9:46     ` Stefan Hajnoczi
2013-06-20 14:25       ` Alex Bligh
2013-06-21 12:55         ` Stefan Hajnoczi
2013-06-21 13:54           ` Alex Bligh
2013-06-21 15:45           ` Sage Weil
2013-06-20 15:58       ` Sage Weil
2013-06-21 11:18         ` Alex Bligh
2013-06-21 15:40           ` Sage Weil
2013-06-21 13:20         ` Stefan Hajnoczi
2013-06-21 15:18         ` Liu Yuan
2013-06-24  9:31           ` Stefan Hajnoczi
2013-06-24 10:25             ` Alex Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).