* [Qemu-devel] Adding a persistent writeback cache to qemu @ 2013-04-01 13:21 Alex Bligh 2013-04-11 9:25 ` Stefan Hajnoczi 0 siblings, 1 reply; 15+ messages in thread From: Alex Bligh @ 2013-04-01 13:21 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Bligh I'd like to experiment with adding persistent writeback cache to qemu. The use case here is where non-local storage is used (e.g. rbd, ceph) using the qemu drivers, together with a local cache as a file on a much faster locally mounted device, for instance an SSD (possibly replicated). This would I think give a similar performance boost to using an rbd block device plus flashcache/dm-cache/bcache, but without introducing all the context switches and limitations of having to use real block devices. I appreciate it would need to be live migration aware (worst case solution: flush and turn off caching during live migrate), and ideally be capable of replaying a dirty writeback cache in the event the host crashes. Is there any support for this already? Has anyone worked on this before? If not, would there be any interest in it? -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh @ 2013-04-11 9:25 ` Stefan Hajnoczi 2013-06-19 21:28 ` Alex Bligh 0 siblings, 1 reply; 15+ messages in thread From: Stefan Hajnoczi @ 2013-04-11 9:25 UTC (permalink / raw) To: Alex Bligh; +Cc: qemu-devel On Mon, Apr 01, 2013 at 01:21:45PM +0000, Alex Bligh wrote: > I'd like to experiment with adding persistent writeback cache to qemu. > The use case here is where non-local storage is used (e.g. rbd, ceph) > using the qemu drivers, together with a local cache as a file on > a much faster locally mounted device, for instance an SSD (possibly > replicated). This would I think give a similar performance boost to > using an rbd block device plus flashcache/dm-cache/bcache, but without > introducing all the context switches and limitations of having to > use real block devices. I appreciate it would need to be live migration > aware (worst case solution: flush and turn off caching during live > migrate), and ideally be capable of replaying a dirty writeback cache > in the event the host crashes. > > Is there any support for this already? Has anyone worked on this before? > If not, would there be any interest in it? I'm concerned about the complexity this would introduce in QEMU. Therefore I'm a fan of using existing solutions like the Linux block layer instead of reimplementing this stuff in Linux. What concrete issues are there with using rbd plus flashcache/dm-cache/bcache? I'm not sure I understand the context switch problem since implementing it in user space will still require system calls to do all the actual cache I/O. Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-04-11 9:25 ` Stefan Hajnoczi @ 2013-06-19 21:28 ` Alex Bligh 2013-06-20 9:46 ` Stefan Hajnoczi 0 siblings, 1 reply; 15+ messages in thread From: Alex Bligh @ 2013-06-19 21:28 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, Alex Bligh Stefan, --On 11 April 2013 11:25:48 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote: >> I'd like to experiment with adding persistent writeback cache to qemu. >> The use case here is where non-local storage is used (e.g. rbd, ceph) >> using the qemu drivers, together with a local cache as a file on >> a much faster locally mounted device, for instance an SSD (possibly >> replicated). This would I think give a similar performance boost to >> using an rbd block device plus flashcache/dm-cache/bcache, but without >> introducing all the context switches and limitations of having to >> use real block devices. I appreciate it would need to be live migration >> aware (worst case solution: flush and turn off caching during live >> migrate), and ideally be capable of replaying a dirty writeback cache >> in the event the host crashes. >> >> Is there any support for this already? Has anyone worked on this before? >> If not, would there be any interest in it? > > I'm concerned about the complexity this would introduce in QEMU. > Therefore I'm a fan of using existing solutions like the Linux block > layer instead of reimplementing this stuff in Linux. > > What concrete issues are there with using rbd plus > flashcache/dm-cache/bcache? > > I'm not sure I understand the context switch problem since implementing > it in user space will still require system calls to do all the actual > cache I/O. I failed to see your reply and got distracted from this. Apologies. So several months later ... The concrete problem here is that flashcache/dm-cache/bcache don't work with the rbd (librbd) driver, as flashcache/dm-cache/bcache cache access to block devices (in the host layer), and with rbd (for instance) there is no access to a block device at all. block/rbd.c simply calls librbd which calls librados etc. So the context switches etc. I am avoiding are the ones that would be introduced by using kernel rbd devices rather than librbd. I had planned to introduce this as a sort of layer on top of any existing block device handler; I believe they are layered at the moment. -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-19 21:28 ` Alex Bligh @ 2013-06-20 9:46 ` Stefan Hajnoczi 2013-06-20 14:25 ` Alex Bligh 2013-06-20 15:58 ` Sage Weil 0 siblings, 2 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2013-06-20 9:46 UTC (permalink / raw) To: Alex Bligh; +Cc: josh.durgin, qemu-devel, sage On Wed, Jun 19, 2013 at 10:28:53PM +0100, Alex Bligh wrote: > --On 11 April 2013 11:25:48 +0200 Stefan Hajnoczi > <stefanha@gmail.com> wrote: > > >>I'd like to experiment with adding persistent writeback cache to qemu. > >>The use case here is where non-local storage is used (e.g. rbd, ceph) > >>using the qemu drivers, together with a local cache as a file on > >>a much faster locally mounted device, for instance an SSD (possibly > >>replicated). This would I think give a similar performance boost to > >>using an rbd block device plus flashcache/dm-cache/bcache, but without > >>introducing all the context switches and limitations of having to > >>use real block devices. I appreciate it would need to be live migration > >>aware (worst case solution: flush and turn off caching during live > >>migrate), and ideally be capable of replaying a dirty writeback cache > >>in the event the host crashes. > >> > >>Is there any support for this already? Has anyone worked on this before? > >>If not, would there be any interest in it? > > > >I'm concerned about the complexity this would introduce in QEMU. > >Therefore I'm a fan of using existing solutions like the Linux block > >layer instead of reimplementing this stuff in Linux. > > > >What concrete issues are there with using rbd plus > >flashcache/dm-cache/bcache? > > > >I'm not sure I understand the context switch problem since implementing > >it in user space will still require system calls to do all the actual > >cache I/O. > > I failed to see your reply and got distracted from this. Apologies. > So several months later ... Happens to me sometimes too ;-). > The concrete problem here is that flashcache/dm-cache/bcache don't > work with the rbd (librbd) driver, as flashcache/dm-cache/bcache > cache access to block devices (in the host layer), and with rbd > (for instance) there is no access to a block device at all. block/rbd.c > simply calls librbd which calls librados etc. > > So the context switches etc. I am avoiding are the ones that would > be introduced by using kernel rbd devices rather than librbd. I understand the limitations with kernel block devices - their setup/teardown is an extra step outside QEMU and privileges need to be managed. That basically means you need to use a management tool like libvirt to make it usable. But I don't understand the performance angle here. Do you have profiles that show kernel rbd is a bottleneck due to context switching? We use the kernel page cache for -drive file=test.img,cache=writeback and no one has suggested reimplementing the page cache inside QEMU for better performance. Also, how do you want to manage QEMU page cache with multiple guests running? They are independent and know nothing about each other. Their process memory consumption will be bloated and the kernel memory management will end up having to sort out who gets to stay in physical memory. You can see I'm skeptical of this and think it's premature optimization, but if there's really a case for it with performance profiles then I guess it would be necessary. But we should definitely get feedback from the Ceph folks too. I'd like to hear from Ceph folks what their position on kernel rbd vs librados is. Why one do they recommend for QEMU guests and what are the pros/cons? CCed Sage and Josh Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 9:46 ` Stefan Hajnoczi @ 2013-06-20 14:25 ` Alex Bligh 2013-06-21 12:55 ` Stefan Hajnoczi 2013-06-20 15:58 ` Sage Weil 1 sibling, 1 reply; 15+ messages in thread From: Alex Bligh @ 2013-06-20 14:25 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh, sage Stefan, --On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote: >> The concrete problem here is that flashcache/dm-cache/bcache don't >> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache >> cache access to block devices (in the host layer), and with rbd >> (for instance) there is no access to a block device at all. block/rbd.c >> simply calls librbd which calls librados etc. >> >> So the context switches etc. I am avoiding are the ones that would >> be introduced by using kernel rbd devices rather than librbd. > > I understand the limitations with kernel block devices - their > setup/teardown is an extra step outside QEMU and privileges need to be > managed. That basically means you need to use a management tool like > libvirt to make it usable. It's not just the management tool (we have one of those). Kernel devices are pain. As a trivial example, duplication of UUIDs, LVM IDs etc. by hostile guests can cause issues. > But I don't understand the performance angle here. Do you have profiles > that show kernel rbd is a bottleneck due to context switching? I don't have test figures - perhaps this is just received wisdom, but I'd understood that's why they were faster. > We use the kernel page cache for -drive file=test.img,cache=writeback > and no one has suggested reimplementing the page cache inside QEMU for > better performance. That's true, but I'd argue that is a little different because nothing blocks on the page cache (it being in RAM). You don't get the situation where the tasks sleeps awaiting data (from the page cache), the data arrives, and the task then needs to to be scheduled in. I will admit to a degree of handwaving here as I hadn't realised the claim qemu+rbd was more efficient than qemu+blockdevice+kernelrbd was controversial. > Also, how do you want to manage QEMU page cache with multiple guests > running? They are independent and know nothing about each other. Their > process memory consumption will be bloated and the kernel memory > management will end up having to sort out who gets to stay in physical > memory. I don't think that one's an issue. Currently QEMU processes with cache=writeback contend physical memory via the page cache. I'm not changing that bit. I'm proposing allocating SSD (rather than RAM) for cache, so if anything that should reduce RAM use as it will be quicker to flush the cache to 'disk' (the second layer of caching). I was proposing allocating each task a fixed amount of SSD space. In terms of how this is done, one way would be to mmap a large file on SSD, which would mean the page cache used would be whatever page cache is used for the SSD. You've got more control over this (with madvise etc) than you have with aio I think. > You can see I'm skeptical of this Which is no bad thing! > and think it's premature optimization, ... and I'm only to keen to avoid work if it brings no gain. > but if there's really a case for it with performance profiles then I > guess it would be necessary. But we should definitely get feedback from > the Ceph folks too. The specific problem we are trying to solve (in case that's not obvious) is the non-locality of data read/written by ceph. Whilst you can use placement to localise data to the rack level, even if one of your OSDs is in the machine you end up waiting on network traffic. That is apparently hard to solve inside Ceph. However, this would be applicable to sheepdog, gluster, nfs, the internal iscsi initiator, etc. etc. rather than just to Ceph. I'm also keen to hear from the Ceph guys as if they have a way of keeping lots of reads and writes in the box and not crossing the network, I'd be only too keen to use that. -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 14:25 ` Alex Bligh @ 2013-06-21 12:55 ` Stefan Hajnoczi 2013-06-21 13:54 ` Alex Bligh 2013-06-21 15:45 ` Sage Weil 0 siblings, 2 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2013-06-21 12:55 UTC (permalink / raw) To: Alex Bligh; +Cc: Josh Durgin, qemu-devel, sage On Thu, Jun 20, 2013 at 4:25 PM, Alex Bligh <alex@alex.org.uk> wrote: > Stefan, > > > --On 20 June 2013 11:46:18 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>> The concrete problem here is that flashcache/dm-cache/bcache don't >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache >>> cache access to block devices (in the host layer), and with rbd >>> (for instance) there is no access to a block device at all. block/rbd.c >>> simply calls librbd which calls librados etc. >>> >>> So the context switches etc. I am avoiding are the ones that would >>> be introduced by using kernel rbd devices rather than librbd. >> >> >> I understand the limitations with kernel block devices - their >> setup/teardown is an extra step outside QEMU and privileges need to be >> managed. That basically means you need to use a management tool like >> libvirt to make it usable. > > > It's not just the management tool (we have one of those). Kernel > devices are pain. As a trivial example, duplication of UUIDs, LVM IDs > etc. by hostile guests can cause issues. If you have those problems then something is wrong: LVM shouldn't definitely not be scanning guest devices. As for disk UUIDs, they come from the SCSI target which is under your control, right? In fact, you can assign different serial numbers to drives attached in QEMU, the host serial number will not be used. Therefore, there is a clean separation there and guests do not control host UUIDs. The one true weakness here is that Linux reads partition tables automatically. Not sure if there's a way to turn it off or how hard it would be to add that. >> But I don't understand the performance angle here. Do you have profiles >> that show kernel rbd is a bottleneck due to context switching? > > > I don't have test figures - perhaps this is just received wisdom, but I'd > understood that's why they were faster. > > >> We use the kernel page cache for -drive file=test.img,cache=writeback >> and no one has suggested reimplementing the page cache inside QEMU for >> better performance. > > > That's true, but I'd argue that is a little different because nothing > blocks on the page cache (it being in RAM). You don't get the situation > where the tasks sleeps awaiting data (from the page cache), the data > arrives, and the task then needs to to be scheduled in. I will admit > to a degree of handwaving here as I hadn't realised the claim qemu+rbd > was more efficient than qemu+blockdevice+kernelrbd was controversial. It may or may not be more efficient, unless there is some performance analysis we don't know how big a difference and why. >> but if there's really a case for it with performance profiles then I >> guess it would be necessary. But we should definitely get feedback from >> the Ceph folks too. > > > The specific problem we are trying to solve (in case that's not > obvious) is the non-locality of data read/written by ceph. Whilst > you can use placement to localise data to the rack level, even if > one of your OSDs is in the machine you end up waiting on network > traffic. That is apparently hard to solve inside Ceph. I'm not up-to-speed on Ceph architecture, is this because you need to visit a metadata server before you access the storage. Even when the data is colocated on the same machine you'll need to ask the metadata server first? Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-21 12:55 ` Stefan Hajnoczi @ 2013-06-21 13:54 ` Alex Bligh 2013-06-21 15:45 ` Sage Weil 1 sibling, 0 replies; 15+ messages in thread From: Alex Bligh @ 2013-06-21 13:54 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Josh Durgin, qemu-devel, Alex Bligh, sage Stefan, --On 21 June 2013 14:55:20 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> I understand the limitations with kernel block devices - their >>> setup/teardown is an extra step outside QEMU and privileges need to be >>> managed. That basically means you need to use a management tool like >>> libvirt to make it usable. >> >> >> It's not just the management tool (we have one of those). Kernel >> devices are pain. As a trivial example, duplication of UUIDs, LVM IDs >> etc. by hostile guests can cause issues. > > If you have those problems then something is wrong: > > LVM shouldn't definitely not be scanning guest devices. > > As for disk UUIDs, they come from the SCSI target which is under your > control, right? In fact, you can assign different serial numbers to > drives attached in QEMU, the host serial number will not be used. > Therefore, there is a clean separation there and guests do not control > host UUIDs. > > The one true weakness here is that Linux reads partition tables > automatically. Not sure if there's a way to turn it off or how hard > it would be to add that. Most things are work-roundable, but the whole thing is 'fail open'. See http://lwn.net/Articles/474067/ for example (not the greatest example, as I've failed to find the lwn.net article what talked about malicious disk labels). When a disk is inserted (guest disk mounted), its partition table gets scanned, and various other stuff happens from udev triggers, based on the UUID of the disk (I believe the relevant UUID is actually stored in the file system), and the UUID/label on the GPT. lvm scanning is also done by default, as is dm stuff. The same problem happens (in theory) with disk labels. Yes, you can disable this, but making (e.g.) dm and lvm work on attached scsi disks but not iscsi disks, in general, when you don't know the iscsi or scsi vendor is non-trivial (yes, I've done it). I have not found a way yet to avoid reading the partition table at all (which would be useful). There used to be other problems when iscsi is used in anger. One, for instance, is that the default iscsi client scans the scsi bus (normally unnecessarily) at the drop of a hat. Even if you know all information about what you are mounting, it scans it. This leads to an O(n^2) problem starting VMs - several minutes. Again, I know how to fix this - if you are interested: https://github.com/abligh/open-iscsi/tree/add-no-scanning-option All this is solved by using the inbuilt iscsi client. >> That's true, but I'd argue that is a little different because nothing >> blocks on the page cache (it being in RAM). You don't get the situation >> where the tasks sleeps awaiting data (from the page cache), the data >> arrives, and the task then needs to to be scheduled in. I will admit >> to a degree of handwaving here as I hadn't realised the claim qemu+rbd >> was more efficient than qemu+blockdevice+kernelrbd was controversial. > > It may or may not be more efficient, unless there is some performance > analysis we don't know how big a difference and why. Sure. I hope Sage comes back on this one. >>> but if there's really a case for it with performance profiles then I >>> guess it would be necessary. But we should definitely get feedback from >>> the Ceph folks too. >> >> >> The specific problem we are trying to solve (in case that's not >> obvious) is the non-locality of data read/written by ceph. Whilst >> you can use placement to localise data to the rack level, even if >> one of your OSDs is in the machine you end up waiting on network >> traffic. That is apparently hard to solve inside Ceph. > > I'm not up-to-speed on Ceph architecture, is this because you need to > visit a metadata server before you access the storage. Even when the > data is colocated on the same machine you'll need to ask the metadata > server first? Well, Sage would be the expert, but I understand the problem is simpler than that. Firstly, in order to mark the write as complete it has to be written to at least a quorum of OSDs, and a quorum is larger than one. Hence at least one write is non-local. Secondly, Ceph's placement group feature does not (so $inktank guy told me) work well for localising at the level of particular servers; so even if somehow you made ceph happy with writing just one replica and saying it was done (and doing the rest in the background), you'd be hard pressed to ensure the first replica written was always (or nearly always) on a local spindle. Hence my idea of adding a layer in front which would acknowledge the write - even if a flush/fua had come in - on the basis it had been written to persistent storage and it can recover on a reboot after this point, then go sort the rest out in the background. -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-21 12:55 ` Stefan Hajnoczi 2013-06-21 13:54 ` Alex Bligh @ 2013-06-21 15:45 ` Sage Weil 1 sibling, 0 replies; 15+ messages in thread From: Sage Weil @ 2013-06-21 15:45 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Josh Durgin, qemu-devel, Alex Bligh On Fri, 21 Jun 2013, Stefan Hajnoczi wrote: > >> but if there's really a case for it with performance profiles then I > >> guess it would be necessary. But we should definitely get feedback from > >> the Ceph folks too. > > > > > > The specific problem we are trying to solve (in case that's not > > obvious) is the non-locality of data read/written by ceph. Whilst > > you can use placement to localise data to the rack level, even if > > one of your OSDs is in the machine you end up waiting on network > > traffic. That is apparently hard to solve inside Ceph. > > I'm not up-to-speed on Ceph architecture, is this because you need to > visit a metadata server before you access the storage. Even when the > data is colocated on the same machine you'll need to ask the metadata > server first? The data location is determined by a fancy hash function, so there is no metadata lookup step and the client can directly contact the right server. The trade-off is that the client doesn't get to choose where to write--the hash deterministically tells us that based on the object name and current cluster state. In the end this means there is some lower bound on latency because we are reading/writing over the network... sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 9:46 ` Stefan Hajnoczi 2013-06-20 14:25 ` Alex Bligh @ 2013-06-20 15:58 ` Sage Weil 2013-06-21 11:18 ` Alex Bligh ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Sage Weil @ 2013-06-20 15:58 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh On Thu, 20 Jun 2013, Stefan Hajnoczi wrote: > > The concrete problem here is that flashcache/dm-cache/bcache don't > > work with the rbd (librbd) driver, as flashcache/dm-cache/bcache > > cache access to block devices (in the host layer), and with rbd > > (for instance) there is no access to a block device at all. block/rbd.c > > simply calls librbd which calls librados etc. > > > > So the context switches etc. I am avoiding are the ones that would > > be introduced by using kernel rbd devices rather than librbd. > > I understand the limitations with kernel block devices - their > setup/teardown is an extra step outside QEMU and privileges need to be > managed. That basically means you need to use a management tool like > libvirt to make it usable. > > But I don't understand the performance angle here. Do you have profiles > that show kernel rbd is a bottleneck due to context switching? > > We use the kernel page cache for -drive file=test.img,cache=writeback > and no one has suggested reimplementing the page cache inside QEMU for > better performance. > > Also, how do you want to manage QEMU page cache with multiple guests > running? They are independent and know nothing about each other. Their > process memory consumption will be bloated and the kernel memory > management will end up having to sort out who gets to stay in physical > memory. > > You can see I'm skeptical of this and think it's premature optimization, > but if there's really a case for it with performance profiles then I > guess it would be necessary. But we should definitely get feedback from > the Ceph folks too. > > I'd like to hear from Ceph folks what their position on kernel rbd vs > librados is. Why one do they recommend for QEMU guests and what are the > pros/cons? I agree that a flashcache/bcache-like persistent cache would be a big win for qemu + rbd users. There are few important issues with librbd vs kernel rbd: * librbd tends to get new features more quickly that the kernel rbd (although now that layering has landed in 3.10 this will be less painful than it was). * Using kernel rbd means users need bleeding edge kernels, a non-starter for many orgs that are still running things like RHEL. Bug fixes are difficult to roll out, etc. * librbd has an in-memory cache that behaves similar to an HDD's cache (e.g., it forces writeback on flush). This improves performance significantly for many workloads. Of course, having a bcache-like layer mitigates this.. I'm not really sure what the best path forward is. Putting the functionality in qemu would benefit lots of other storage backends, putting it in librbd would capture various other librbd users (xen, tgt, and future users like hyper-v), and using new kernels works today but creates a lot of friction for operations. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 15:58 ` Sage Weil @ 2013-06-21 11:18 ` Alex Bligh 2013-06-21 15:40 ` Sage Weil 2013-06-21 13:20 ` Stefan Hajnoczi 2013-06-21 15:18 ` Liu Yuan 2 siblings, 1 reply; 15+ messages in thread From: Alex Bligh @ 2013-06-21 11:18 UTC (permalink / raw) To: Sage Weil, Stefan Hajnoczi; +Cc: josh.durgin, qemu-devel, Alex Bligh Sage, --On 20 June 2013 08:58:19 -0700 Sage Weil <sage@inktank.com> wrote: >> I'd like to hear from Ceph folks what their position on kernel rbd vs >> librados is. Why one do they recommend for QEMU guests and what are the >> pros/cons? > > I agree that a flashcache/bcache-like persistent cache would be a big win > for qemu + rbd users. Great. I think Stefan was really after testing my received wisdom that ceph+librbd will be greater performance than ceph+blkdev+kernelrbd (even without the persistent cache), and if so why. > There are few important issues with librbd vs kernel rbd: > > * librbd tends to get new features more quickly that the kernel rbd > (although now that layering has landed in 3.10 this will be less > painful than it was). > > * Using kernel rbd means users need bleeding edge kernels, a non-starter > for many orgs that are still running things like RHEL. Bug fixes are > difficult to roll out, etc. > > * librbd has an in-memory cache that behaves similar to an HDD's cache > (e.g., it forces writeback on flush). This improves performance > significantly for many workloads. Of course, having a bcache-like > layer mitigates this.. > > I'm not really sure what the best path forward is. Putting the > functionality in qemu would benefit lots of other storage backends, > putting it in librbd would capture various other librbd users (xen, tgt, > and future users like hyper-v), and using new kernels works today but > creates a lot of friction for operations. To be honest I'd not even thought of putting it in librbd (which might be simpler). I suspect it might be easier to get patches into librbd than into qemu, and that ensuring cache coherency might be simpler. If I get time to look at this, would you be interested in taking patches for this? -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-21 11:18 ` Alex Bligh @ 2013-06-21 15:40 ` Sage Weil 0 siblings, 0 replies; 15+ messages in thread From: Sage Weil @ 2013-06-21 15:40 UTC (permalink / raw) To: Alex Bligh; +Cc: Stefan Hajnoczi, qemu-devel, josh.durgin On Fri, 21 Jun 2013, Alex Bligh wrote: > Sage, > > --On 20 June 2013 08:58:19 -0700 Sage Weil <sage@inktank.com> wrote: > > > > I'd like to hear from Ceph folks what their position on kernel rbd vs > > > librados is. Why one do they recommend for QEMU guests and what are the > > > pros/cons? > > > > I agree that a flashcache/bcache-like persistent cache would be a big win > > for qemu + rbd users. > > Great. > > I think Stefan was really after testing my received wisdom that > ceph+librbd will be greater performance than ceph+blkdev+kernelrbd > (even without the persistent cache), and if so why. Oh, right. At this point the performance differential is strictly related to the cache behavior. If there were feature parity, I would not expect any significant difference. There may be a net difference of a data copy, but I'm not sure it will be significant. > > There are few important issues with librbd vs kernel rbd: > > > > * librbd tends to get new features more quickly that the kernel rbd > > (although now that layering has landed in 3.10 this will be less > > painful than it was). > > > > * Using kernel rbd means users need bleeding edge kernels, a non-starter > > for many orgs that are still running things like RHEL. Bug fixes are > > difficult to roll out, etc. > > > > * librbd has an in-memory cache that behaves similar to an HDD's cache > > (e.g., it forces writeback on flush). This improves performance > > significantly for many workloads. Of course, having a bcache-like > > layer mitigates this.. > > > > I'm not really sure what the best path forward is. Putting the > > functionality in qemu would benefit lots of other storage backends, > > putting it in librbd would capture various other librbd users (xen, tgt, > > and future users like hyper-v), and using new kernels works today but > > creates a lot of friction for operations. > > To be honest I'd not even thought of putting it in librbd (which might > be simpler). I suspect it might be easier to get patches into librbd > than into qemu, and that ensuring cache coherency might be simpler. > If I get time to look at this, would you be interested in taking patches > for this? Certainly! It will be a bit tricky to integrate is a lightweight way, however, so I would be sure to sketch out a design before diving too far into the coding. I suspect the best path forward would be to extend the ObjectCacher. This has the added benefit that ceph-fuse and libcephfs could benefit at well. The dev summit for the next release (emperor) is coming up in a few weeks.. this would be a great project to submit a blueprint for so we can discuss it then. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 15:58 ` Sage Weil 2013-06-21 11:18 ` Alex Bligh @ 2013-06-21 13:20 ` Stefan Hajnoczi 2013-06-21 15:18 ` Liu Yuan 2 siblings, 0 replies; 15+ messages in thread From: Stefan Hajnoczi @ 2013-06-21 13:20 UTC (permalink / raw) To: Sage Weil; +Cc: Josh Durgin, qemu-devel, Alex Bligh On Thu, Jun 20, 2013 at 5:58 PM, Sage Weil <sage@inktank.com> wrote: > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote: > I'm not really sure what the best path forward is. Putting the > functionality in qemu would benefit lots of other storage backends, > putting it in librbd would capture various other librbd users (xen, tgt, > and future users like hyper-v), and using new kernels works today but > creates a lot of friction for operations. If this is a common Ceph performance bottleneck, then librbd is a good place to put it. I believe you don't have to go over the external network with Gluster or pNFS if data is colocated on the same host, because they talk directly to the storage node. (There are probably configurations where this isn't possible though.) Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-20 15:58 ` Sage Weil 2013-06-21 11:18 ` Alex Bligh 2013-06-21 13:20 ` Stefan Hajnoczi @ 2013-06-21 15:18 ` Liu Yuan 2013-06-24 9:31 ` Stefan Hajnoczi 2 siblings, 1 reply; 15+ messages in thread From: Liu Yuan @ 2013-06-21 15:18 UTC (permalink / raw) To: Sage Weil; +Cc: Stefan Hajnoczi, qemu-devel, Alex Bligh, josh.durgin On 06/20/2013 11:58 PM, Sage Weil wrote: > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote: >>> The concrete problem here is that flashcache/dm-cache/bcache don't >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache >>> cache access to block devices (in the host layer), and with rbd >>> (for instance) there is no access to a block device at all. block/rbd.c >>> simply calls librbd which calls librados etc. >>> >>> So the context switches etc. I am avoiding are the ones that would >>> be introduced by using kernel rbd devices rather than librbd. >> >> I understand the limitations with kernel block devices - their >> setup/teardown is an extra step outside QEMU and privileges need to be >> managed. That basically means you need to use a management tool like >> libvirt to make it usable. >> >> But I don't understand the performance angle here. Do you have profiles >> that show kernel rbd is a bottleneck due to context switching? >> >> We use the kernel page cache for -drive file=test.img,cache=writeback >> and no one has suggested reimplementing the page cache inside QEMU for >> better performance. >> >> Also, how do you want to manage QEMU page cache with multiple guests >> running? They are independent and know nothing about each other. Their >> process memory consumption will be bloated and the kernel memory >> management will end up having to sort out who gets to stay in physical >> memory. >> >> You can see I'm skeptical of this and think it's premature optimization, >> but if there's really a case for it with performance profiles then I >> guess it would be necessary. But we should definitely get feedback from >> the Ceph folks too. >> >> I'd like to hear from Ceph folks what their position on kernel rbd vs >> librados is. Why one do they recommend for QEMU guests and what are the >> pros/cons? > > I agree that a flashcache/bcache-like persistent cache would be a big win > for qemu + rbd users. > > There are few important issues with librbd vs kernel rbd: > > * librbd tends to get new features more quickly that the kernel rbd > (although now that layering has landed in 3.10 this will be less > painful than it was). > > * Using kernel rbd means users need bleeding edge kernels, a non-starter > for many orgs that are still running things like RHEL. Bug fixes are > difficult to roll out, etc. > > * librbd has an in-memory cache that behaves similar to an HDD's cache > (e.g., it forces writeback on flush). This improves performance > significantly for many workloads. Of course, having a bcache-like > layer mitigates this.. > > I'm not really sure what the best path forward is. Putting the > functionality in qemu would benefit lots of other storage backends, > putting it in librbd would capture various other librbd users (xen, tgt, > and future users like hyper-v), and using new kernels works today but > creates a lot of friction for operations. > I think I can share some implementation details about persistent cache for guest because 1) Sheepdog has a persistent object-oriented cache as exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide volumes on top of object store. 3) Sheepdog choose a persistent cache on local disk while Ceph choose a in memory cache approach. The main motivation of object cache is to reduce network traffic and improve performance and the cache can be seen as a hard disk' internal write cache, which modern kernels support well. For a background introduction, Sheepdog's object cache works similar to kernel's page cache, except that we cache a 4M object of a volume in disk while kernel cache 4k page of a file in memory. We use LRU list per volume to do reclaim and dirty list to track dirty objects for writeback. We always readahead a whole object if not cached. The benefit of a disk cache over a memory cache, in my option, is 1) VM get a more smooth performance because cache don't consume memory (if memory is on high water mark, the latency of guest IO will be very high). 2) smaller memory requirement and leave all the memory to guest 3) objects from base can be shared by all its children snapshots & clone 4) more efficient reclaim algorithm because sheep daemon knows better than kernel's dm-cache/bcacsh/flashcache. 5) can easily take advantage of SSD as a cache backend There is no migration problems for sheepdog with client cache because we can release the cache in migration. If QEMU has persistent cache built-in a generic layer, say block layer, Sheepdog's object cache code can be removed. There is also some advantage besides code reduction to built-in this cache for QEMU, for e.g, we can teach QEMU to multi-connect more than sheep daemon to get a better HA without caring cache consistency problem. I believe sheepdog and RBD can share many codes of the persistent cache but currently only RBD and Sheepdog use object store to provide volumes, other formats/protocol use file abstraction, it is hard to reuse code for them. Maybe we can provide a vfs-like layer to accommodate all the block storage system, no matter whether it is on top of object store or file store. This is a touch work but worth a try. Thanks, Yuan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-21 15:18 ` Liu Yuan @ 2013-06-24 9:31 ` Stefan Hajnoczi 2013-06-24 10:25 ` Alex Bligh 0 siblings, 1 reply; 15+ messages in thread From: Stefan Hajnoczi @ 2013-06-24 9:31 UTC (permalink / raw) To: Liu Yuan; +Cc: josh.durgin, Sage Weil, Alex Bligh, qemu-devel On Fri, Jun 21, 2013 at 11:18:07PM +0800, Liu Yuan wrote: > On 06/20/2013 11:58 PM, Sage Weil wrote: > > On Thu, 20 Jun 2013, Stefan Hajnoczi wrote: > >>> The concrete problem here is that flashcache/dm-cache/bcache don't > >>> work with the rbd (librbd) driver, as flashcache/dm-cache/bcache > >>> cache access to block devices (in the host layer), and with rbd > >>> (for instance) there is no access to a block device at all. block/rbd.c > >>> simply calls librbd which calls librados etc. > >>> > >>> So the context switches etc. I am avoiding are the ones that would > >>> be introduced by using kernel rbd devices rather than librbd. > >> > >> I understand the limitations with kernel block devices - their > >> setup/teardown is an extra step outside QEMU and privileges need to be > >> managed. That basically means you need to use a management tool like > >> libvirt to make it usable. > >> > >> But I don't understand the performance angle here. Do you have profiles > >> that show kernel rbd is a bottleneck due to context switching? > >> > >> We use the kernel page cache for -drive file=test.img,cache=writeback > >> and no one has suggested reimplementing the page cache inside QEMU for > >> better performance. > >> > >> Also, how do you want to manage QEMU page cache with multiple guests > >> running? They are independent and know nothing about each other. Their > >> process memory consumption will be bloated and the kernel memory > >> management will end up having to sort out who gets to stay in physical > >> memory. > >> > >> You can see I'm skeptical of this and think it's premature optimization, > >> but if there's really a case for it with performance profiles then I > >> guess it would be necessary. But we should definitely get feedback from > >> the Ceph folks too. > >> > >> I'd like to hear from Ceph folks what their position on kernel rbd vs > >> librados is. Why one do they recommend for QEMU guests and what are the > >> pros/cons? > > > > I agree that a flashcache/bcache-like persistent cache would be a big win > > for qemu + rbd users. > > > > There are few important issues with librbd vs kernel rbd: > > > > * librbd tends to get new features more quickly that the kernel rbd > > (although now that layering has landed in 3.10 this will be less > > painful than it was). > > > > * Using kernel rbd means users need bleeding edge kernels, a non-starter > > for many orgs that are still running things like RHEL. Bug fixes are > > difficult to roll out, etc. > > > > * librbd has an in-memory cache that behaves similar to an HDD's cache > > (e.g., it forces writeback on flush). This improves performance > > significantly for many workloads. Of course, having a bcache-like > > layer mitigates this.. > > > > I'm not really sure what the best path forward is. Putting the > > functionality in qemu would benefit lots of other storage backends, > > putting it in librbd would capture various other librbd users (xen, tgt, > > and future users like hyper-v), and using new kernels works today but > > creates a lot of friction for operations. > > > > I think I can share some implementation details about persistent cache > for guest because 1) Sheepdog has a persistent object-oriented cache as > exactly what Alex described 2) Sheepdog and Ceph's RADOS both provide > volumes on top of object store. 3) Sheepdog choose a persistent cache on > local disk while Ceph choose a in memory cache approach. > > The main motivation of object cache is to reduce network traffic and > improve performance and the cache can be seen as a hard disk' internal > write cache, which modern kernels support well. > > For a background introduction, Sheepdog's object cache works similar to > kernel's page cache, except that we cache a 4M object of a volume in > disk while kernel cache 4k page of a file in memory. We use LRU list per > volume to do reclaim and dirty list to track dirty objects for > writeback. We always readahead a whole object if not cached. > > The benefit of a disk cache over a memory cache, in my option, is > 1) VM get a more smooth performance because cache don't consume memory > (if memory is on high water mark, the latency of guest IO will be very > high). > 2) smaller memory requirement and leave all the memory to guest > 3) objects from base can be shared by all its children snapshots & clone > 4) more efficient reclaim algorithm because sheep daemon knows better > than kernel's dm-cache/bcacsh/flashcache. > 5) can easily take advantage of SSD as a cache backend It sounds like the cache is in the sheep daemon and therefore has a global view of all volumes being accessed from this host. That way it can do things like share the cached snapshot data between volumes. This is what I was pointing out about putting the cache in QEMU - you only know about this QEMU process, not all volumes being accessed from this host. Even if Ceph and Sheepdog don't share code, it sounds like they have a lot in common and it's worth looking at the Sheepdog cache before adding one to Ceph. Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [Qemu-devel] Adding a persistent writeback cache to qemu 2013-06-24 9:31 ` Stefan Hajnoczi @ 2013-06-24 10:25 ` Alex Bligh 0 siblings, 0 replies; 15+ messages in thread From: Alex Bligh @ 2013-06-24 10:25 UTC (permalink / raw) To: Stefan Hajnoczi, Liu Yuan; +Cc: josh.durgin, Sage Weil, Alex Bligh, qemu-devel Stefan, --On 24 June 2013 11:31:35 +0200 Stefan Hajnoczi <stefanha@gmail.com> wrote: > It sounds like the cache is in the sheep daemon and therefore has a > global view of all volumes being accessed from this host. That way it > can do things like share the cached snapshot data between volumes. Yes, that's potentially interesting. I think you'd only need to share data for read-only objects. Read caching (at least of read only objects) could also persist beyond the life of one qemu process. So stopping qemu and restarting it could be done with a hot cache. However, you can't do that for writeable objects, as your write back cache needs to be fully written out before another qemu process can be launched referencing the same object, and moreover writeable objects are not in general shared between multiple processes anyway (if we ignore corner cases like OCFS2 on RBD). I had assumed (because of some earlier work I did on a different distributed storage project) that the main advantage would be the write-back caching rather than the read caching; that might be untrue with multiple guests running the same image. > This is what I was pointing out about putting the cache in QEMU - you > only know about this QEMU process, not all volumes being accessed from > this host. > > Even if Ceph and Sheepdog don't share code, it sounds like they have a > lot in common and it's worth looking at the Sheepdog cache before adding > one to Ceph. One problem there is Ceph is in C++, Sheepdog is in C. Another is that sheepdog has a daemon running on the client which can keep state etc. as qemu comes and goes. Ceph just has a library. -- Alex Bligh ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-06-24 10:26 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-01 13:21 [Qemu-devel] Adding a persistent writeback cache to qemu Alex Bligh 2013-04-11 9:25 ` Stefan Hajnoczi 2013-06-19 21:28 ` Alex Bligh 2013-06-20 9:46 ` Stefan Hajnoczi 2013-06-20 14:25 ` Alex Bligh 2013-06-21 12:55 ` Stefan Hajnoczi 2013-06-21 13:54 ` Alex Bligh 2013-06-21 15:45 ` Sage Weil 2013-06-20 15:58 ` Sage Weil 2013-06-21 11:18 ` Alex Bligh 2013-06-21 15:40 ` Sage Weil 2013-06-21 13:20 ` Stefan Hajnoczi 2013-06-21 15:18 ` Liu Yuan 2013-06-24 9:31 ` Stefan Hajnoczi 2013-06-24 10:25 ` Alex Bligh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).