qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
@ 2010-10-21 14:07 Kevin Wolf
  2010-10-21 15:07 ` Anthony Liguori
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kevin Wolf @ 2010-10-21 14:07 UTC (permalink / raw)
  To: Christian Brunner, Laurent Vivier, MORITA Kazutaka; +Cc: Qemu-devel

Hi all,

I'm currently looking into adding a return value to qemu's bdrv_flush
function and I noticed that your block drivers (nbd, rbd and sheepdog)
don't implement bdrv_flush at all. bdrv_flush is going to return
-ENOTSUP for any block driver not implementing this, effectively
breaking these three drivers for anything but cache=unsafe.

Is there a specific reason why your drivers don't implement this? I
think I remember that one of the drivers always provides
cache=writethough semantics. It would be okay to silently "upgrade" to
cache=writethrough, so in this case I'd just need to add an empty
bdrv_flush implementation.

Otherwise, we really cannot allow any option except cache=unsafe because
that's the semantics provided by the driver.

In any case, I think it would be a good idea to implement a real
bdrv_flush function to allow the write-back cache modes cache=off and
cache=writeback in order to improve performance over writethrough.

Is this possible with your protocols, or can the protocol be changed to
consider this? Any hints on how to proceed?

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-21 14:07 [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog Kevin Wolf
@ 2010-10-21 15:07 ` Anthony Liguori
  2010-10-21 19:32   ` Laurent Vivier
  2010-10-22  5:43 ` MORITA Kazutaka
       [not found] ` <AANLkTikHAm7opg1TzUrUWis53ENT_z6DjfT9GPeBdqA0@mail.gmail.com>
  2 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-10-21 15:07 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Qemu-devel, Christian Brunner, MORITA Kazutaka, Laurent Vivier

On 10/21/2010 09:07 AM, Kevin Wolf wrote:
> Hi all,
>
> I'm currently looking into adding a return value to qemu's bdrv_flush
> function and I noticed that your block drivers (nbd, rbd and sheepdog)
> don't implement bdrv_flush at all. bdrv_flush is going to return
> -ENOTSUP for any block driver not implementing this, effectively
> breaking these three drivers for anything but cache=unsafe.
>
> Is there a specific reason why your drivers don't implement this?

NBD doesn't have a notion of flush.  Only read/write and the block-nbd 
implementation doesn't do write-caching so flush would be a nop.

I'm not sure what the right semantics would be for QEMU.  My guess is a 
nop flush.

Regards,

Anthony Liguori

>   I
> think I remember that one of the drivers always provides
> cache=writethough semantics. It would be okay to silently "upgrade" to
> cache=writethrough, so in this case I'd just need to add an empty
> bdrv_flush implementation.
>
> Otherwise, we really cannot allow any option except cache=unsafe because
> that's the semantics provided by the driver.
>
> In any case, I think it would be a good idea to implement a real
> bdrv_flush function to allow the write-back cache modes cache=off and
> cache=writeback in order to improve performance over writethrough.
>
> Is this possible with your protocols, or can the protocol be changed to
> consider this? Any hints on how to proceed?
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-21 15:07 ` Anthony Liguori
@ 2010-10-21 19:32   ` Laurent Vivier
  2010-10-22  8:29     ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2010-10-21 19:32 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Qemu-devel

Le jeudi 21 octobre 2010 à 10:07 -0500, Anthony Liguori a écrit :
> On 10/21/2010 09:07 AM, Kevin Wolf wrote:
> > Hi all,
> >
> > I'm currently looking into adding a return value to qemu's bdrv_flush
> > function and I noticed that your block drivers (nbd, rbd and sheepdog)
> > don't implement bdrv_flush at all. bdrv_flush is going to return
> > -ENOTSUP for any block driver not implementing this, effectively
> > breaking these three drivers for anything but cache=unsafe.
> >
> > Is there a specific reason why your drivers don't implement this?
> 
> NBD doesn't have a notion of flush.  Only read/write and the block-nbd 
> implementation doesn't do write-caching so flush would be a nop.
> 
> I'm not sure what the right semantics would be for QEMU.  My guess is a 
> nop flush.

I agree.

Regards,
Laurent

> Regards,
> 
> Anthony Liguori
> 
> >   I
> > think I remember that one of the drivers always provides
> > cache=writethough semantics. It would be okay to silently "upgrade" to
> > cache=writethrough, so in this case I'd just need to add an empty
> > bdrv_flush implementation.
> >
> > Otherwise, we really cannot allow any option except cache=unsafe because
> > that's the semantics provided by the driver.
> >
> > In any case, I think it would be a good idea to implement a real
> > bdrv_flush function to allow the write-back cache modes cache=off and
> > cache=writeback in order to improve performance over writethrough.
> >
> > Is this possible with your protocols, or can the protocol be changed to
> > consider this? Any hints on how to proceed?
> >
> > Kevin
> >
> >    
> 

-- 
--------------------- laurent@vivier.eu ----------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel]  bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-21 14:07 [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog Kevin Wolf
  2010-10-21 15:07 ` Anthony Liguori
@ 2010-10-22  5:43 ` MORITA Kazutaka
  2010-10-22  8:47   ` Kevin Wolf
       [not found] ` <AANLkTikHAm7opg1TzUrUWis53ENT_z6DjfT9GPeBdqA0@mail.gmail.com>
  2 siblings, 1 reply; 15+ messages in thread
From: MORITA Kazutaka @ 2010-10-22  5:43 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Qemu-devel, Christian Brunner, MORITA Kazutaka, Laurent Vivier

At Thu, 21 Oct 2010 16:07:28 +0200,
Kevin Wolf wrote:
> 
> Hi all,
> 
> I'm currently looking into adding a return value to qemu's bdrv_flush
> function and I noticed that your block drivers (nbd, rbd and sheepdog)
> don't implement bdrv_flush at all. bdrv_flush is going to return
> -ENOTSUP for any block driver not implementing this, effectively
> breaking these three drivers for anything but cache=unsafe.
> 
> Is there a specific reason why your drivers don't implement this? I
> think I remember that one of the drivers always provides
> cache=writethough semantics. It would be okay to silently "upgrade" to
> cache=writethrough, so in this case I'd just need to add an empty
> bdrv_flush implementation.
> 
> Otherwise, we really cannot allow any option except cache=unsafe because
> that's the semantics provided by the driver.
> 
> In any case, I think it would be a good idea to implement a real
> bdrv_flush function to allow the write-back cache modes cache=off and
> cache=writeback in order to improve performance over writethrough.
> 
> Is this possible with your protocols, or can the protocol be changed to
> consider this? Any hints on how to proceed?
> 

It is a bit difficult to implement an effective bdrv_flush in the
sheepdog block driver.  Sheepdog virtual disks are splited and
distributed to all cluster servers, so the block driver needs to send
flush requests to all of them.  I'm not sure this could improve
performance more than writethrough semantics.

So I think it is better to support only writethrough semantics
currently (I'll modify sheepdog server codes to open stored objects
with O_SYNC or O_DIRECT) and leave write-back semantics as a future
work.

Thanks,

Kazutaka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-21 19:32   ` Laurent Vivier
@ 2010-10-22  8:29     ` Kevin Wolf
  2010-10-22 12:58       ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-10-22  8:29 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: Qemu-devel

Am 21.10.2010 21:32, schrieb Laurent Vivier:
> Le jeudi 21 octobre 2010 à 10:07 -0500, Anthony Liguori a écrit :
>> On 10/21/2010 09:07 AM, Kevin Wolf wrote:
>>> Hi all,
>>>
>>> I'm currently looking into adding a return value to qemu's bdrv_flush
>>> function and I noticed that your block drivers (nbd, rbd and sheepdog)
>>> don't implement bdrv_flush at all. bdrv_flush is going to return
>>> -ENOTSUP for any block driver not implementing this, effectively
>>> breaking these three drivers for anything but cache=unsafe.
>>>
>>> Is there a specific reason why your drivers don't implement this?
>>
>> NBD doesn't have a notion of flush.  Only read/write and the block-nbd 
>> implementation doesn't do write-caching so flush would be a nop.
>>
>> I'm not sure what the right semantics would be for QEMU.  My guess is a 
>> nop flush.
> 
> I agree.

Of course, as Laurent said a while ago, there is no specification for
NBD, so it's hard to say what the intended semantics is.

However, I did have a look at the nbdserver code and it looks as if it
implements something similar to writethrough (namely fsync after each
write) only if configured this way on the server side. qemu-nbd defaults
to writethrough, but can be configured to use cache=none. So with either
server qemu as a client can't tell whether the data is safe on disk or not.

In my book this is a strong argument for refusing to open nbd
connections with anything but cache=unsafe.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fwd: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
       [not found]   ` <Pine.LNX.4.64.1010211155301.18946@cobra.newdream.net>
@ 2010-10-22  8:39     ` Kevin Wolf
  2010-10-22 16:22       ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-10-22  8:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Christian Brunner, Qemu-devel

[ Adding qemu-devel to CC again ]

Am 21.10.2010 20:59, schrieb Sage Weil:
> On Thu, 21 Oct 2010, Christian Brunner wrote:
>> Hi,
>>
>> is there a flush operation in librados? - I guess the only way to
>> handle this, would be waiting until all aio requests are finished?

That's not the semantics of bdrv_flush, you don't need to wait for
running requests. You just need to make sure that all completed requests
are safe on disk so that they would persist even in case of a
crash/power failure.

> There is no flush currently.  But librados does no caching, so in this 
> case at least silenting upgrading to cache=writethrough should work.

You're making sure that the data can't be cached in the server's page
cache or volatile disk cache either, e.g. by using O_SYNC for the image
file? If so, upgrading would be safe.

> If that's a problem, we can implement a flush.  Just let us know.

Presumably providing a writeback mode with explicit flushes could
improve performance. Upgrading to writethrough is not a correctness
problem, though, so it's your decision if you want to implement it.

Kevin

>> ---------- Forwarded message ----------
>> From: Kevin Wolf <kwolf@redhat.com>
>> Date: 2010/10/21
>> Subject: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
>> To: Christian Brunner <chb@muc.de>, Laurent Vivier
>> <Laurent@vivier.eu>, MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
>> Cc: Qemu-devel@nongnu.org
>>
>>
>> Hi all,
>>
>> I'm currently looking into adding a return value to qemu's bdrv_flush
>> function and I noticed that your block drivers (nbd, rbd and sheepdog)
>> don't implement bdrv_flush at all. bdrv_flush is going to return
>> -ENOTSUP for any block driver not implementing this, effectively
>> breaking these three drivers for anything but cache=unsafe.
>>
>> Is there a specific reason why your drivers don't implement this? I
>> think I remember that one of the drivers always provides
>> cache=writethough semantics. It would be okay to silently "upgrade" to
>> cache=writethrough, so in this case I'd just need to add an empty
>> bdrv_flush implementation.
>>
>> Otherwise, we really cannot allow any option except cache=unsafe because
>> that's the semantics provided by the driver.
>>
>> In any case, I think it would be a good idea to implement a real
>> bdrv_flush function to allow the write-back cache modes cache=off and
>> cache=writeback in order to improve performance over writethrough.
>>
>> Is this possible with your protocols, or can the protocol be changed to
>> consider this? Any hints on how to proceed?
>>
>> Kevin
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel]  bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22  5:43 ` MORITA Kazutaka
@ 2010-10-22  8:47   ` Kevin Wolf
  2010-10-25  5:31     ` MORITA Kazutaka
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-10-22  8:47 UTC (permalink / raw)
  To: MORITA Kazutaka; +Cc: Qemu-devel, Christian Brunner, Laurent Vivier

Am 22.10.2010 07:43, schrieb MORITA Kazutaka:
> At Thu, 21 Oct 2010 16:07:28 +0200,
> Kevin Wolf wrote:
>>
>> Hi all,
>>
>> I'm currently looking into adding a return value to qemu's bdrv_flush
>> function and I noticed that your block drivers (nbd, rbd and sheepdog)
>> don't implement bdrv_flush at all. bdrv_flush is going to return
>> -ENOTSUP for any block driver not implementing this, effectively
>> breaking these three drivers for anything but cache=unsafe.
>>
>> Is there a specific reason why your drivers don't implement this? I
>> think I remember that one of the drivers always provides
>> cache=writethough semantics. It would be okay to silently "upgrade" to
>> cache=writethrough, so in this case I'd just need to add an empty
>> bdrv_flush implementation.
>>
>> Otherwise, we really cannot allow any option except cache=unsafe because
>> that's the semantics provided by the driver.
>>
>> In any case, I think it would be a good idea to implement a real
>> bdrv_flush function to allow the write-back cache modes cache=off and
>> cache=writeback in order to improve performance over writethrough.
>>
>> Is this possible with your protocols, or can the protocol be changed to
>> consider this? Any hints on how to proceed?
>>
> 
> It is a bit difficult to implement an effective bdrv_flush in the
> sheepdog block driver.  Sheepdog virtual disks are splited and
> distributed to all cluster servers, so the block driver needs to send
> flush requests to all of them.  I'm not sure this could improve
> performance more than writethrough semantics.

It could probably be optimized so that you only send flush requests to
servers that have actually received write requests since the last flush.

But yes, that's probably a valid point. I guess there's only one way to
find out how it performs: Trying it out.

> So I think it is better to support only writethrough semantics
> currently (I'll modify sheepdog server codes to open stored objects
> with O_SYNC or O_DIRECT) and leave write-back semantics as a future
> work.

I agree, that makes sense.

Note that O_DIRECT does not provide write-through semantics. It bypasses
the page cache, but it doesn't flush other caches like a volatile disk
write cache. If you want to use it, you still need explicit flushes or
O_DIRECT | O_SYNC.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22  8:29     ` Kevin Wolf
@ 2010-10-22 12:58       ` Anthony Liguori
  2010-10-22 13:35         ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-10-22 12:58 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Laurent Vivier, Qemu-devel

On 10/22/2010 03:29 AM, Kevin Wolf wrote:
>> I agree.
>>      
> Of course, as Laurent said a while ago, there is no specification for
> NBD, so it's hard to say what the intended semantics is.
>
> However, I did have a look at the nbdserver code and it looks as if it
> implements something similar to writethrough (namely fsync after each
> write) only if configured this way on the server side. qemu-nbd defaults
> to writethrough, but can be configured to use cache=none. So with either
> server qemu as a client can't tell whether the data is safe on disk or not.
>
> In my book this is a strong argument for refusing to open nbd
> connections with anything but cache=unsafe.
>    

On a physical system, if you don't have a battery backed disk and you 
enable the WC on your disk, then even with cache=writethrough we're unsafe.

Likewise, if you mount your filesystem with barrier=0, QEMU is unsafe.

QEMU can't guarantee safety.  The underlying storage needs to be 
configured correctly.  As long as we're not introducing caching within 
QEMU, I don't think we should assume we're unsafe.

Do we have any place where we can add docs on a per-block format basis?  
It would be good to at least mention for each block device how the 
backing storage needed to be configured for safety.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22 12:58       ` Anthony Liguori
@ 2010-10-22 13:35         ` Kevin Wolf
  2010-10-22 13:45           ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-10-22 13:35 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Laurent Vivier, Qemu-devel

Am 22.10.2010 14:58, schrieb Anthony Liguori:
> On 10/22/2010 03:29 AM, Kevin Wolf wrote:
>>> I agree.
>>>      
>> Of course, as Laurent said a while ago, there is no specification for
>> NBD, so it's hard to say what the intended semantics is.
>>
>> However, I did have a look at the nbdserver code and it looks as if it
>> implements something similar to writethrough (namely fsync after each
>> write) only if configured this way on the server side. qemu-nbd defaults
>> to writethrough, but can be configured to use cache=none. So with either
>> server qemu as a client can't tell whether the data is safe on disk or not.
>>
>> In my book this is a strong argument for refusing to open nbd
>> connections with anything but cache=unsafe.
>>    
> 
> On a physical system, if you don't have a battery backed disk and you 
> enable the WC on your disk, then even with cache=writethrough we're unsafe.

I don't think that's right. O_SYNC should guarantee that the volatile
disk cache is flushed.

> Likewise, if you mount your filesystem with barrier=0, QEMU is unsafe.

Yeah, if you do something equivalent to cache=unsafe on a lower layer,
then qemu can't do much about it. Maybe you can apply the same argument
to NBD, even though it's unsafe by default.

> QEMU can't guarantee safety.  The underlying storage needs to be 
> configured correctly.  As long as we're not introducing caching within 
> QEMU, I don't think we should assume we're unsafe.
> 
> Do we have any place where we can add docs on a per-block format basis?  
> It would be good to at least mention for each block device how the 
> backing storage needed to be configured for safety.

docs/block-protocols.txt?

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22 13:35         ` Kevin Wolf
@ 2010-10-22 13:45           ` Anthony Liguori
  2010-10-22 13:57             ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2010-10-22 13:45 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Laurent Vivier, Qemu-devel

On 10/22/2010 08:35 AM, Kevin Wolf wrote:
> Am 22.10.2010 14:58, schrieb Anthony Liguori:
>    
>> On 10/22/2010 03:29 AM, Kevin Wolf wrote:
>>      
>>>> I agree.
>>>>
>>>>          
>>> Of course, as Laurent said a while ago, there is no specification for
>>> NBD, so it's hard to say what the intended semantics is.
>>>
>>> However, I did have a look at the nbdserver code and it looks as if it
>>> implements something similar to writethrough (namely fsync after each
>>> write) only if configured this way on the server side. qemu-nbd defaults
>>> to writethrough, but can be configured to use cache=none. So with either
>>> server qemu as a client can't tell whether the data is safe on disk or not.
>>>
>>> In my book this is a strong argument for refusing to open nbd
>>> connections with anything but cache=unsafe.
>>>
>>>        
>> On a physical system, if you don't have a battery backed disk and you
>> enable the WC on your disk, then even with cache=writethrough we're unsafe.
>>      
> I don't think that's right. O_SYNC should guarantee that the volatile
> disk cache is flushed.
>    

If your filesystem does the right thing which an awful lot of them don't 
today.

>> Likewise, if you mount your filesystem with barrier=0, QEMU is unsafe.
>>      
> Yeah, if you do something equivalent to cache=unsafe on a lower layer,
> then qemu can't do much about it. Maybe you can apply the same argument
> to NBD, even though it's unsafe by default.
>
>    
>> QEMU can't guarantee safety.  The underlying storage needs to be
>> configured correctly.  As long as we're not introducing caching within
>> QEMU, I don't think we should assume we're unsafe.
>>
>> Do we have any place where we can add docs on a per-block format basis?
>> It would be good to at least mention for each block device how the
>> backing storage needed to be configured for safety.
>>      
> docs/block-protocols.txt?
>    

Maybe docs/block/<name>.txt?  Would be a good home for the qed spec too.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22 13:45           ` Anthony Liguori
@ 2010-10-22 13:57             ` Kevin Wolf
  2010-10-22 14:01               ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-10-22 13:57 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Laurent Vivier, Qemu-devel

Am 22.10.2010 15:45, schrieb Anthony Liguori:
>>> On a physical system, if you don't have a battery backed disk and you
>>> enable the WC on your disk, then even with cache=writethrough we're unsafe.
>>>      
>> I don't think that's right. O_SYNC should guarantee that the volatile
>> disk cache is flushed.
>>    
> 
> If your filesystem does the right thing which an awful lot of them don't 
> today.

The list of really relevant filesystems is rather short, though.

>>> Likewise, if you mount your filesystem with barrier=0, QEMU is unsafe.
>>>      
>> Yeah, if you do something equivalent to cache=unsafe on a lower layer,
>> then qemu can't do much about it. Maybe you can apply the same argument
>> to NBD, even though it's unsafe by default.
>>
>>    
>>> QEMU can't guarantee safety.  The underlying storage needs to be
>>> configured correctly.  As long as we're not introducing caching within
>>> QEMU, I don't think we should assume we're unsafe.
>>>
>>> Do we have any place where we can add docs on a per-block format basis?
>>> It would be good to at least mention for each block device how the
>>> backing storage needed to be configured for safety.
>>>      
>> docs/block-protocols.txt?
>>    
> 
> Maybe docs/block/<name>.txt?  Would be a good home for the qed spec too.

I think spec and documentation for users should be kept separate. I
thought that's the reason why docs/specs/ exists.

And if you exclude specs, I'm not sure if there's a lot left to say for
each format. Having ten files under docs/block/ which consist of two
lines each would be ridiculous. If contrary to my expectations we
actually do have content for it, docs/block/<name>.txt works for me as well.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22 13:57             ` Kevin Wolf
@ 2010-10-22 14:01               ` Anthony Liguori
  0 siblings, 0 replies; 15+ messages in thread
From: Anthony Liguori @ 2010-10-22 14:01 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Laurent Vivier, Qemu-devel

On 10/22/2010 08:57 AM, Kevin Wolf wrote:
> Am 22.10.2010 15:45, schrieb Anthony Liguori:
>    
>>>> On a physical system, if you don't have a battery backed disk and you
>>>> enable the WC on your disk, then even with cache=writethrough we're unsafe.
>>>>
>>>>          
>>> I don't think that's right. O_SYNC should guarantee that the volatile
>>> disk cache is flushed.
>>>
>>>        
>> If your filesystem does the right thing which an awful lot of them don't
>> today.
>>      
> The list of really relevant filesystems is rather short, though.
>
>    
>>>> Likewise, if you mount your filesystem with barrier=0, QEMU is unsafe.
>>>>
>>>>          
>>> Yeah, if you do something equivalent to cache=unsafe on a lower layer,
>>> then qemu can't do much about it. Maybe you can apply the same argument
>>> to NBD, even though it's unsafe by default.
>>>
>>>
>>>        
>>>> QEMU can't guarantee safety.  The underlying storage needs to be
>>>> configured correctly.  As long as we're not introducing caching within
>>>> QEMU, I don't think we should assume we're unsafe.
>>>>
>>>> Do we have any place where we can add docs on a per-block format basis?
>>>> It would be good to at least mention for each block device how the
>>>> backing storage needed to be configured for safety.
>>>>
>>>>          
>>> docs/block-protocols.txt?
>>>
>>>        
>> Maybe docs/block/<name>.txt?  Would be a good home for the qed spec too.
>>      
> I think spec and documentation for users should be kept separate. I
> thought that's the reason why docs/specs/ exists.
>
> And if you exclude specs, I'm not sure if there's a lot left to say for
> each format. Having ten files under docs/block/ which consist of two
> lines each would be ridiculous. If contrary to my expectations we
> actually do have content for it, docs/block/<name>.txt works for me as well.
>    

Okay, sounds reasonable.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fwd: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22  8:39     ` Fwd: " Kevin Wolf
@ 2010-10-22 16:22       ` Sage Weil
  2010-10-25  7:58         ` Kevin Wolf
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2010-10-22 16:22 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: ceph-devel, Christian Brunner, Qemu-devel

On Fri, 22 Oct 2010, Kevin Wolf wrote:
> [ Adding qemu-devel to CC again ]
> 
> Am 21.10.2010 20:59, schrieb Sage Weil:
> > On Thu, 21 Oct 2010, Christian Brunner wrote:
> >> Hi,
> >>
> >> is there a flush operation in librados? - I guess the only way to
> >> handle this, would be waiting until all aio requests are finished?
> 
> That's not the semantics of bdrv_flush, you don't need to wait for
> running requests. You just need to make sure that all completed requests
> are safe on disk so that they would persist even in case of a
> crash/power failure.

Okay, in that case we're fine.  librados doesn't declare a write committed 
until it is safely on disk on multiple backend nodes.  There is a 
mechanism to get an ack sooner, but the qemu storage driver does not use 
it.  

> > There is no flush currently.  But librados does no caching, so in this 
> > case at least silenting upgrading to cache=writethrough should work.
> 
> You're making sure that the data can't be cached in the server's page
> cache or volatile disk cache either, e.g. by using O_SYNC for the image
> file? If so, upgrading would be safe.

Right.

> > If that's a problem, we can implement a flush.  Just let us know.
> 
> Presumably providing a writeback mode with explicit flushes could
> improve performance. Upgrading to writethrough is not a correctness
> problem, though, so it's your decision if you want to implement it.

So is a bdrv_flush generated when e.g. the guest filesystem issues a 
barrier, or would otherwise normally ask a SATA disk to flush it's cache?

sage



> Kevin
> 
> >> ---------- Forwarded message ----------
> >> From: Kevin Wolf <kwolf@redhat.com>
> >> Date: 2010/10/21
> >> Subject: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
> >> To: Christian Brunner <chb@muc.de>, Laurent Vivier
> >> <Laurent@vivier.eu>, MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> >> Cc: Qemu-devel@nongnu.org
> >>
> >>
> >> Hi all,
> >>
> >> I'm currently looking into adding a return value to qemu's bdrv_flush
> >> function and I noticed that your block drivers (nbd, rbd and sheepdog)
> >> don't implement bdrv_flush at all. bdrv_flush is going to return
> >> -ENOTSUP for any block driver not implementing this, effectively
> >> breaking these three drivers for anything but cache=unsafe.
> >>
> >> Is there a specific reason why your drivers don't implement this? I
> >> think I remember that one of the drivers always provides
> >> cache=writethough semantics. It would be okay to silently "upgrade" to
> >> cache=writethrough, so in this case I'd just need to add an empty
> >> bdrv_flush implementation.
> >>
> >> Otherwise, we really cannot allow any option except cache=unsafe because
> >> that's the semantics provided by the driver.
> >>
> >> In any case, I think it would be a good idea to implement a real
> >> bdrv_flush function to allow the write-back cache modes cache=off and
> >> cache=writeback in order to improve performance over writethrough.
> >>
> >> Is this possible with your protocols, or can the protocol be changed to
> >> consider this? Any hints on how to proceed?
> >>
> >> Kevin
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel]  bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22  8:47   ` Kevin Wolf
@ 2010-10-25  5:31     ` MORITA Kazutaka
  0 siblings, 0 replies; 15+ messages in thread
From: MORITA Kazutaka @ 2010-10-25  5:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Laurent Vivier, Qemu-devel, MORITA Kazutaka, Christian Brunner

At Fri, 22 Oct 2010 10:47:44 +0200,
Kevin Wolf wrote:
> 
> Am 22.10.2010 07:43, schrieb MORITA Kazutaka:
> > At Thu, 21 Oct 2010 16:07:28 +0200,
> > Kevin Wolf wrote:
> >>
> >> Hi all,
> >>
> >> I'm currently looking into adding a return value to qemu's bdrv_flush
> >> function and I noticed that your block drivers (nbd, rbd and sheepdog)
> >> don't implement bdrv_flush at all. bdrv_flush is going to return
> >> -ENOTSUP for any block driver not implementing this, effectively
> >> breaking these three drivers for anything but cache=unsafe.
> >>
> >> Is there a specific reason why your drivers don't implement this? I
> >> think I remember that one of the drivers always provides
> >> cache=writethough semantics. It would be okay to silently "upgrade" to
> >> cache=writethrough, so in this case I'd just need to add an empty
> >> bdrv_flush implementation.
> >>
> >> Otherwise, we really cannot allow any option except cache=unsafe because
> >> that's the semantics provided by the driver.
> >>
> >> In any case, I think it would be a good idea to implement a real
> >> bdrv_flush function to allow the write-back cache modes cache=off and
> >> cache=writeback in order to improve performance over writethrough.
> >>
> >> Is this possible with your protocols, or can the protocol be changed to
> >> consider this? Any hints on how to proceed?
> >>
> > 
> > It is a bit difficult to implement an effective bdrv_flush in the
> > sheepdog block driver.  Sheepdog virtual disks are splited and
> > distributed to all cluster servers, so the block driver needs to send
> > flush requests to all of them.  I'm not sure this could improve
> > performance more than writethrough semantics.
> 
> It could probably be optimized so that you only send flush requests to
> servers that have actually received write requests since the last flush.
> 
> But yes, that's probably a valid point. I guess there's only one way to
> find out how it performs: Trying it out.

Agreed, I'll try it out.

> 
> > So I think it is better to support only writethrough semantics
> > currently (I'll modify sheepdog server codes to open stored objects
> > with O_SYNC or O_DIRECT) and leave write-back semantics as a future
> > work.
> 
> I agree, that makes sense.
> 
> Note that O_DIRECT does not provide write-through semantics. It bypasses
> the page cache, but it doesn't flush other caches like a volatile disk
> write cache. If you want to use it, you still need explicit flushes or
> O_DIRECT | O_SYNC.

Thanks for your comment.  I've modified server codes to use O_SYNC, so
now sheepdog gives cache=writethrough semantics always.

Kazutaka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fwd: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog
  2010-10-22 16:22       ` Sage Weil
@ 2010-10-25  7:58         ` Kevin Wolf
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Wolf @ 2010-10-25  7:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Christian Brunner, Qemu-devel

Am 22.10.2010 18:22, schrieb Sage Weil:
> On Fri, 22 Oct 2010, Kevin Wolf wrote:
>> [ Adding qemu-devel to CC again ]
>>
>> Am 21.10.2010 20:59, schrieb Sage Weil:
>>> On Thu, 21 Oct 2010, Christian Brunner wrote:
>>>> Hi,
>>>>
>>>> is there a flush operation in librados? - I guess the only way to
>>>> handle this, would be waiting until all aio requests are finished?
>>
>> That's not the semantics of bdrv_flush, you don't need to wait for
>> running requests. You just need to make sure that all completed requests
>> are safe on disk so that they would persist even in case of a
>> crash/power failure.
> 
> Okay, in that case we're fine.  librados doesn't declare a write committed 
> until it is safely on disk on multiple backend nodes.  There is a 
> mechanism to get an ack sooner, but the qemu storage driver does not use 
> it.  
> 
>>> There is no flush currently.  But librados does no caching, so in this 
>>> case at least silenting upgrading to cache=writethrough should work.
>>
>> You're making sure that the data can't be cached in the server's page
>> cache or volatile disk cache either, e.g. by using O_SYNC for the image
>> file? If so, upgrading would be safe.
> 
> Right.

Okay, implementing bdrv_flush as a nop is fine then.

>>> If that's a problem, we can implement a flush.  Just let us know.
>>
>> Presumably providing a writeback mode with explicit flushes could
>> improve performance. Upgrading to writethrough is not a correctness
>> problem, though, so it's your decision if you want to implement it.
> 
> So is a bdrv_flush generated when e.g. the guest filesystem issues a 
> barrier, or would otherwise normally ask a SATA disk to flush it's cache?

Right, this is the implementation for things like the FLUSH CACHE
command in ATA. It's also used for ordering of writes to image metadata
in formats like qcow2, but that's probably an unusual scenario for the
Ceph backend.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-10-25  7:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-21 14:07 [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog Kevin Wolf
2010-10-21 15:07 ` Anthony Liguori
2010-10-21 19:32   ` Laurent Vivier
2010-10-22  8:29     ` Kevin Wolf
2010-10-22 12:58       ` Anthony Liguori
2010-10-22 13:35         ` Kevin Wolf
2010-10-22 13:45           ` Anthony Liguori
2010-10-22 13:57             ` Kevin Wolf
2010-10-22 14:01               ` Anthony Liguori
2010-10-22  5:43 ` MORITA Kazutaka
2010-10-22  8:47   ` Kevin Wolf
2010-10-25  5:31     ` MORITA Kazutaka
     [not found] ` <AANLkTikHAm7opg1TzUrUWis53ENT_z6DjfT9GPeBdqA0@mail.gmail.com>
     [not found]   ` <Pine.LNX.4.64.1010211155301.18946@cobra.newdream.net>
2010-10-22  8:39     ` Fwd: " Kevin Wolf
2010-10-22 16:22       ` Sage Weil
2010-10-25  7:58         ` Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).