[Question] why not flush device cache at _vg_commit

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Question] why not flush device cache at _vg_commit_raw
@ 2024-01-22 11:22 Su Yue
  2024-01-22 12:48 ` Zdenek Kabelac
  0 siblings, 1 reply; 14+ messages in thread
From: Su Yue @ 2024-01-22 11:22 UTC (permalink / raw)
  To: linux-lvm; +Cc: Heming Zhao, Anthony Iliopoulos, Lidong Zhong, martin.wilck

Hi lvm folks,
  Recently We received a report about the device cache issue after vgchange —deltag.
What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.

IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing 
critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't 
provide data was persistent to storage when write returns. The data can still be in the device cache,
If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.

Is there any particular reason not to flush data cache at VG commit time? 

Thanks
—
Su

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 11:22 [Question] why not flush device cache at _vg_commit_raw Su Yue
@ 2024-01-22 12:48 ` Zdenek Kabelac
  2024-01-22 13:46   ` Anthony Iliopoulos
  0 siblings, 1 reply; 14+ messages in thread
From: Zdenek Kabelac @ 2024-01-22 12:48 UTC (permalink / raw)
  To: Su Yue, linux-lvm
  Cc: Heming Zhao, Anthony Iliopoulos, Lidong Zhong, martin.wilck

Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
> Hi lvm folks,
>    Recently We received a report about the device cache issue after vgchange —deltag.
> What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
> 
> IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
> critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
> provide data was persistent to storage when write returns. The data can still be in the device cache,
> If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
> 
> Is there any particular reason not to flush data cache at VG commit time?
>

Hi

It seems the call to 'dev_flush()' function got somehow lost over the time of 
conversion to async aio usage - I'll investigate.

On the other hand the chance here of losing any data this way would be really 
really very specific to some oddly behaving device.


Regards

Zdenek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 12:48 ` Zdenek Kabelac
@ 2024-01-22 13:46   ` Anthony Iliopoulos
  2024-01-22 14:52     ` Zdenek Kabelac
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony Iliopoulos @ 2024-01-22 13:46 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
> > Hi lvm folks,
> >    Recently We received a report about the device cache issue after vgchange —deltag.
> > What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
> > 
> > IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
> > critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
> > provide data was persistent to storage when write returns. The data can still be in the device cache,
> > If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
> > 
> > Is there any particular reason not to flush data cache at VG commit time?
> > 
> 
> Hi
> 
> It seems the call to 'dev_flush()' function got somehow lost over the time
> of conversion to async aio usage - I'll investigate.
> 
> On the other hand the chance here of losing any data this way would be
> really really very specific to some oddly behaving device.

There's no guarantee that data will be persisted to storage without
explicitly flushing the device data cache. Those are usually volatile
write-back caches, so the data aren't really protected against power
loss without fsyncing the blockdev.

Regards,
Anthony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 13:46   ` Anthony Iliopoulos
@ 2024-01-22 14:52     ` Zdenek Kabelac
  2024-01-22 15:26       ` Ilia Zykov
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Zdenek Kabelac @ 2024-01-22 14:52 UTC (permalink / raw)
  To: Anthony Iliopoulos
  Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
> On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
>> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
>>> Hi lvm folks,
>>>     Recently We received a report about the device cache issue after vgchange —deltag.
>>> What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
>>>
>>> IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
>>> critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
>>> provide data was persistent to storage when write returns. The data can still be in the device cache,
>>> If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
>>>
>>> Is there any particular reason not to flush data cache at VG commit time?
>>>
>>
>> Hi
>>
>> It seems the call to 'dev_flush()' function got somehow lost over the time
>> of conversion to async aio usage - I'll investigate.
>>
>> On the other hand the chance here of losing any data this way would be
>> really really very specific to some oddly behaving device.
> 
> There's no guarantee that data will be persisted to storage without
> explicitly flushing the device data cache. Those are usually volatile
> write-back caches, so the data aren't really protected against power
> loss without fsyncing the blockdev.

At technical level modern storage devices 'should' have enough energy held 
internally to be able to flush out all the caches in emergency cases to the 
persistent storage. So unless we deal with some 'virtual' storage that may 
fake various responses to IO handling - this should not be causing major troubles.

However it's clearly a problem which happened while the code has been shifted 
towards the use of libaio.

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 14:52     ` Zdenek Kabelac
@ 2024-01-22 15:26       ` Ilia Zykov
  2024-01-23  1:54         ` Su Yue
  2024-01-23  8:15         ` Martin Wilck
  2024-01-22 16:01       ` Anthony Iliopoulos
  2024-01-23 16:42       ` Demi Marie Obenour
  2 siblings, 2 replies; 14+ messages in thread
From: Ilia Zykov @ 2024-01-22 15:26 UTC (permalink / raw)
  To: Zdenek Kabelac, Anthony Iliopoulos
  Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

On 22.01.2024 17:52, Zdenek Kabelac wrote:
> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
>> On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
>>> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
>>>> Hi lvm folks,
>>>>     Recently We received a report about the device cache issue after 
>>>> vgchange —deltag.
>>>> What confuses me is that lvm never calls fsync on block devices even 
>>>> at the end of commit phase.
>>>>
>>>> IIRC, it’s common operations for userspace tools to call 
>>>> fsync/O_SYNC/O_DSYNC while writing
>>>> critical data. Yes, lvm2 opens devices with O_DIRECT if they support 
>>>> , but O_DIRECT doesn't
>>>> provide data was persistent to storage when write returns. The data 
>>>> can still be in the device cache,
>>>> If power failure happens in the timing, such critical metadata/data 
>>>> like vg metadata could be lost.
>>>>
>>>> Is there any particular reason not to flush data cache at VG commit 
>>>> time?
>>>>
>>>
>>> Hi
>>>
>>> It seems the call to 'dev_flush()' function got somehow lost over the 
>>> time
>>> of conversion to async aio usage - I'll investigate.
>>>
>>> On the other hand the chance here of losing any data this way would be
>>> really really very specific to some oddly behaving device.
>>
>> There's no guarantee that data will be persisted to storage without
>> explicitly flushing the device data cache. Those are usually volatile
>> write-back caches, so the data aren't really protected against power
>> loss without fsyncing the blockdev.
> 
> At technical level modern storage devices 'should' have enough energy 
> held internally to be able to flush out all the caches in emergency 
> cases to the persistent storage. So unless we deal with some 'virtual' 
> storage that may fake various responses to IO handling - this should not 
> be causing major troubles.
> 
> However it's clearly a problem which happened while the code has been 
> shifted towards the use of libaio.
> 
> Zdenek
> 

More over. There is a very old post about fsync() lying.
https://brad.livejournal.com/2116715.html
I don’t know, maybe this is also a post-lie) Or now the devices have 
become more truthful.
But many devices report that "Write cache" is enabled:

hdparm -I /dev/sda | grep 'Write cache'
              * Write cache

And in many cases fsync() flushes data to write cache only.
But this can be persistent (ssd, flash) cache. Or as Zdenek has wrote,
"devices 'should' have enough energy held internally to be able to flush 
out all the caches in  in emergency cases".

However, in some cases, they may lose some data due to power failure and 
large amount of dirty data in the cache, especially ordinary, 
non-enterprise HDD. IMHO.

----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 14:52     ` Zdenek Kabelac
  2024-01-22 15:26       ` Ilia Zykov
@ 2024-01-22 16:01       ` Anthony Iliopoulos
  2024-01-23 16:42       ` Demi Marie Obenour
  2 siblings, 0 replies; 14+ messages in thread
From: Anthony Iliopoulos @ 2024-01-22 16:01 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
> > On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
> > > Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
> > > > Hi lvm folks,
> > > >     Recently We received a report about the device cache issue after vgchange —deltag.
> > > > What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
> > > > 
> > > > IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
> > > > critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
> > > > provide data was persistent to storage when write returns. The data can still be in the device cache,
> > > > If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
> > > > 
> > > > Is there any particular reason not to flush data cache at VG commit time?
> > > > 
> > > 
> > > Hi
> > > 
> > > It seems the call to 'dev_flush()' function got somehow lost over the time
> > > of conversion to async aio usage - I'll investigate.
> > > 
> > > On the other hand the chance here of losing any data this way would be
> > > really really very specific to some oddly behaving device.
> > 
> > There's no guarantee that data will be persisted to storage without
> > explicitly flushing the device data cache. Those are usually volatile
> > write-back caches, so the data aren't really protected against power
> > loss without fsyncing the blockdev.
> 
> At technical level modern storage devices 'should' have enough energy held
> internally to be able to flush out all the caches in emergency cases to the
> persistent storage. So unless we deal with some 'virtual' storage that may
> fake various responses to IO handling - this should not be causing major
> troubles.

Sure but we cannot make any assumptions about storage device internals
in general, other than the worst-case scenario (which is not uncommon)
that without flushing the volatile caches, the devices provide no
guarantees of data persistence.

We cannot account for faulty firmware or devices that (for example)
indicate that they do write-through caching but in reality they don't or
devices that ignore the flushing ops etc., but that's another issue.

> However it's clearly a problem which happened while the code has been
> shifted towards the use of libaio.

I'm really not that familiar with the codebase, but from a brief look at
the history indeed it seems that dev_close() was calling dev_flush(),
although only for buffered-io (while O_DIRECT also requires flushing
storage caches).

Regards,
Anthony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 15:26       ` Ilia Zykov
@ 2024-01-23  1:54         ` Su Yue
  2024-01-23  8:15         ` Martin Wilck
  1 sibling, 0 replies; 14+ messages in thread
From: Su Yue @ 2024-01-23  1:54 UTC (permalink / raw)
  To: Ilia Zykov
  Cc: Zdenek Kabelac, Anthony Iliopoulos, linux-lvm, Heming Zhao,
	Lidong Zhong, martin.wilck



> On Jan 22, 2024, at 23:26, Ilia Zykov <mail@service4.ru> wrote:
> 
> On 22.01.2024 17:52, Zdenek Kabelac wrote:
>> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
>>> On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
>>>> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
>>>>> Hi lvm folks,
>>>>>     Recently We received a report about the device cache issue after vgchange —deltag.
>>>>> What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
>>>>> 
>>>>> IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
>>>>> critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
>>>>> provide data was persistent to storage when write returns. The data can still be in the device cache,
>>>>> If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
>>>>> 
>>>>> Is there any particular reason not to flush data cache at VG commit time?
>>>>> 
>>>> 
>>>> Hi
>>>> 
>>>> It seems the call to 'dev_flush()' function got somehow lost over the time
>>>> of conversion to async aio usage - I'll investigate.
>>>> 
>>>> On the other hand the chance here of losing any data this way would be
>>>> really really very specific to some oddly behaving device.
>>> 
>>> There's no guarantee that data will be persisted to storage without
>>> explicitly flushing the device data cache. Those are usually volatile
>>> write-back caches, so the data aren't really protected against power
>>> loss without fsyncing the blockdev.
>> At technical level modern storage devices 'should' have enough energy held internally to be able to flush out all the caches in emergency cases to the persistent storage. So unless we deal with some 'virtual' storage that may fake various responses to IO handling - this should not be causing major troubles.
>> However it's clearly a problem which happened while the code has been shifted towards the use of libaio.
>> Zdenek
> 
> More over. There is a very old post about fsync() lying.
> https://brad.livejournal.com/2116715.html
> I don’t know, maybe this is also a post-lie) Or now the devices have become more truthful.
> But many devices report that "Write cache" is enabled:
> 
> hdparm -I /dev/sda | grep 'Write cache'
>             * Write cache
> 
> And in many cases fsync() flushes data to write cache only.
> But this can be persistent (ssd, flash) cache. Or as Zdenek has wrote,
> "devices 'should' have enough energy held internally to be able to flush out all the caches in  in emergency cases".
> 
> However, in some cases, they may lose some data due to power failure and large amount of dirty data in the cache, especially ordinary, non-enterprise HDD. IMHO.
> 
Yes… The mechanism of write cache varies in different manufacturer and products. 
Some implements can even lie about the cache flush/FUA in 2024.
For serious enterprise cases, strict tests should be done for devices before uses in production lines.

The point is that filesystems and lvm should trust the underlying devices write barriers/flushing and make
best efforts to keep data integrity.

—
Su
> ----


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 15:26       ` Ilia Zykov
  2024-01-23  1:54         ` Su Yue
@ 2024-01-23  8:15         ` Martin Wilck
  1 sibling, 0 replies; 14+ messages in thread
From: Martin Wilck @ 2024-01-23  8:15 UTC (permalink / raw)
  To: Ilia Zykov, Zdenek Kabelac, Anthony Iliopoulos
  Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong

On Mon, 2024-01-22 at 18:26 +0300, Ilia Zykov wrote:
> > 
> 
> More over. There is a very old post about fsync() lying.
> https://brad.livejournal.com/2116715.html
> I don’t know, maybe this is also a post-lie) Or now the devices have 
> become more truthful.
> But many devices report that "Write cache" is enabled:
> 
> hdparm -I /dev/sda | grep 'Write cache'
>               * Write cache
> 
> And in many cases fsync() flushes data to write cache only.
> But this can be persistent (ssd, flash) cache. Or as Zdenek has
> wrote,
> "devices 'should' have enough energy held internally to be able to
> flush 
> out all the caches in  in emergency cases".
> 
> However, in some cases, they may lose some data due to power failure
> and 
> large amount of dirty data in the cache, especially ordinary, 
> non-enterprise HDD. IMHO.

SCSI has had SYNCHRONIZE_CACHE and FUA at least since SCSI-2 in the
mid-90s. It's true that some devices lie about the actual behavior,
because bypassing the cache is bad for benchmark results, and many end
users care more about performance than data safety, but that's not LVMs
(or even the kernel's) business. As Su wrote already, the kernel has to
trust the hardware.

Martin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-22 14:52     ` Zdenek Kabelac
  2024-01-22 15:26       ` Ilia Zykov
  2024-01-22 16:01       ` Anthony Iliopoulos
@ 2024-01-23 16:42       ` Demi Marie Obenour
  2024-01-23 17:50         ` Zdenek Kabelac
  2 siblings, 1 reply; 14+ messages in thread
From: Demi Marie Obenour @ 2024-01-23 16:42 UTC (permalink / raw)
  To: Zdenek Kabelac, Anthony Iliopoulos
  Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
> > On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
> > > Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
> > > > Hi lvm folks,
> > > >     Recently We received a report about the device cache issue after vgchange —deltag.
> > > > What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
> > > > 
> > > > IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
> > > > critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
> > > > provide data was persistent to storage when write returns. The data can still be in the device cache,
> > > > If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
> > > > 
> > > > Is there any particular reason not to flush data cache at VG commit time?
> > > > 
> > > 
> > > Hi
> > > 
> > > It seems the call to 'dev_flush()' function got somehow lost over the time
> > > of conversion to async aio usage - I'll investigate.
> > > 
> > > On the other hand the chance here of losing any data this way would be
> > > really really very specific to some oddly behaving device.
> > 
> > There's no guarantee that data will be persisted to storage without
> > explicitly flushing the device data cache. Those are usually volatile
> > write-back caches, so the data aren't really protected against power
> > loss without fsyncing the blockdev.
> 
> At technical level modern storage devices 'should' have enough energy held
> internally to be able to flush out all the caches in emergency cases to the
> persistent storage. So unless we deal with some 'virtual' storage that may
> fake various responses to IO handling - this should not be causing major
> troubles.

This is only true for enterprise storage with power loss protection.
The vast majority of Qubes OS users use LVM with consumer storage, which
does not have power loss protection.  If this is unsafe, then Qubes OS
should switch to a different storage pool that flushes drive caches as
needed.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-23 16:42       ` Demi Marie Obenour
@ 2024-01-23 17:50         ` Zdenek Kabelac
  2024-01-24 11:58           ` Anthony Iliopoulos
  0 siblings, 1 reply; 14+ messages in thread
From: Zdenek Kabelac @ 2024-01-23 17:50 UTC (permalink / raw)
  To: Demi Marie Obenour, Anthony Iliopoulos
  Cc: Su Yue, linux-lvm, Heming Zhao, Lidong Zhong, martin.wilck

Dne 23. 01. 24 v 17:42 Demi Marie Obenour napsal(a):
> On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
>> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
>>> On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
>>>> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
>>>>> Hi lvm folks,
>>>>>      Recently We received a report about the device cache issue after vgchange —deltag.
>>>>> What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
>>>>>
>>>>> IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
>>>>> critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
>>>>> provide data was persistent to storage when write returns. The data can still be in the device cache,
>>>>> If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
>>>>>
>>>>> Is there any particular reason not to flush data cache at VG commit time?
>>>>>
>>>>
>>>> Hi
>>>>
>>>> It seems the call to 'dev_flush()' function got somehow lost over the time
>>>> of conversion to async aio usage - I'll investigate.
>>>>
>>>> On the other hand the chance here of losing any data this way would be
>>>> really really very specific to some oddly behaving device.
>>>
>>> There's no guarantee that data will be persisted to storage without
>>> explicitly flushing the device data cache. Those are usually volatile
>>> write-back caches, so the data aren't really protected against power
>>> loss without fsyncing the blockdev.
>>
>> At technical level modern storage devices 'should' have enough energy held
>> internally to be able to flush out all the caches in emergency cases to the
>> persistent storage. So unless we deal with some 'virtual' storage that may
>> fake various responses to IO handling - this should not be causing major
>> troubles.
> 
> This is only true for enterprise storage with power loss protection.
> The vast majority of Qubes OS users use LVM with consumer storage, which
> does not have power loss protection.  If this is unsafe, then Qubes OS
> should switch to a different storage pool that flushes drive caches as
> needed.

 From lvm2 perspective - there are first written metadata - then there is 
usually a full flush of all I/O and suspend to the actual device - if there is 
any device already active on such disk -  so even if there would be no direct 
flush initiated by lvm2 itself - there is going to such on whenever we update 
existing LVs.

There is usually a stream of cache flushing operation whenever i.e. thin-pool 
is synchronizing metadata or any app running of device is synchronizing its 
data as well.

So while lvm2 is using O_DIRECT with write - there is likely a tiny window of 
opportunity where the user could 'crash' the device with lose of it's caches. 
If this happens - lvm2 still has 'history' & archive so it should be at worst 
case scenario see the older version of metadata for possible recovery.

All that said - for so many years - we have not seen a single reported issue 
caused by such mysterious crash event yet - and the potential 'risk of 
failure' could likely happen only in the case of user creating some new empty 
LV - so there shouldn't be a risk of losing any real data (unless I miss 
something).

So while we figure out how to add proper fsync call for device writes - as it 
seems to be still demanded with  direct i/o usage, it's IMHO not a reason to 
stop using of lvm2 ;)

Regards

Zdenek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-23 17:50         ` Zdenek Kabelac
@ 2024-01-24 11:58           ` Anthony Iliopoulos
  2024-01-24 12:35             ` Zdenek Kabelac
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony Iliopoulos @ 2024-01-24 11:58 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Demi Marie Obenour, Su Yue, linux-lvm, Heming Zhao, Lidong Zhong,
	martin.wilck

On Tue, Jan 23, 2024 at 06:50:01PM +0100, Zdenek Kabelac wrote:
> Dne 23. 01. 24 v 17:42 Demi Marie Obenour napsal(a):
> > On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
> > > Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
> > > > On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
> > > > > Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
> > > > > > Hi lvm folks,
> > > > > >      Recently We received a report about the device cache issue after vgchange —deltag.
> > > > > > What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
> > > > > > 
> > > > > > IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
> > > > > > critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
> > > > > > provide data was persistent to storage when write returns. The data can still be in the device cache,
> > > > > > If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
> > > > > > 
> > > > > > Is there any particular reason not to flush data cache at VG commit time?
> > > > > > 
> > > > > 
> > > > > Hi
> > > > > 
> > > > > It seems the call to 'dev_flush()' function got somehow lost over the time
> > > > > of conversion to async aio usage - I'll investigate.
> > > > > 
> > > > > On the other hand the chance here of losing any data this way would be
> > > > > really really very specific to some oddly behaving device.
> > > > 
> > > > There's no guarantee that data will be persisted to storage without
> > > > explicitly flushing the device data cache. Those are usually volatile
> > > > write-back caches, so the data aren't really protected against power
> > > > loss without fsyncing the blockdev.
> > > 
> > > At technical level modern storage devices 'should' have enough energy held
> > > internally to be able to flush out all the caches in emergency cases to the
> > > persistent storage. So unless we deal with some 'virtual' storage that may
> > > fake various responses to IO handling - this should not be causing major
> > > troubles.
> > 
> > This is only true for enterprise storage with power loss protection.
> > The vast majority of Qubes OS users use LVM with consumer storage, which
> > does not have power loss protection.  If this is unsafe, then Qubes OS
> > should switch to a different storage pool that flushes drive caches as
> > needed.
> 
> From lvm2 perspective - there are first written metadata - then there is
> usually a full flush of all I/O and suspend to the actual device - if there
> is any device already active on such disk -  so even if there would be no
> direct flush initiated by lvm2 itself - there is going to such on whenever
> we update existing LVs.

Can you elaborate on that? Flushing IO does not imply flushing of the
device cache, but it is not clear what you mean by "suspend" here.

> There is usually a stream of cache flushing operation whenever i.e.
> thin-pool is synchronizing metadata or any app running of device is
> synchronizing its data as well.

We cannot make any assumptions about what processes may be running and
if they are actually doing fsync on the partition. Also, on devices that
support FUA, data integrity operations are optimized by leveraging that
and global device cache is elided.

> So while lvm2 is using O_DIRECT with write - there is likely a tiny window
> of opportunity where the user could 'crash' the device with lose of it's
> caches. If this happens - lvm2 still has 'history' & archive so it should be
> at worst case scenario see the older version of metadata for possible
> recovery.
> 
> All that said - for so many years - we have not seen a single reported issue
> caused by such mysterious crash event yet - and the potential 'risk of
> failure' could likely happen only in the case of user creating some new
> empty LV - so there shouldn't be a risk of losing any real data (unless I
> miss something).

In our case this came in because LV tag manipulation wasn't properly
persisted in some HA failover scenario, but definitely not resulted to
actual data loss.

> So while we figure out how to add proper fsync call for device writes - as
> it seems to be still demanded with  direct i/o usage, it's IMHO not a reason
> to stop using of lvm2 ;)

An alternative to fsync on the blockdev would be to do open the device
with O_DSYNC or submit io with RWF_DSYNC so that all writes are flushed
to the storage medium.

Regards,
Anthony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-24 11:58           ` Anthony Iliopoulos
@ 2024-01-24 12:35             ` Zdenek Kabelac
  2024-01-24 13:13               ` Anthony Iliopoulos
  0 siblings, 1 reply; 14+ messages in thread
From: Zdenek Kabelac @ 2024-01-24 12:35 UTC (permalink / raw)
  To: Anthony Iliopoulos
  Cc: Demi Marie Obenour, Su Yue, linux-lvm, Heming Zhao, Lidong Zhong,
	martin.wilck

Dne 24. 01. 24 v 12:58 Anthony Iliopoulos napsal(a):
> On Tue, Jan 23, 2024 at 06:50:01PM +0100, Zdenek Kabelac wrote:
>> Dne 23. 01. 24 v 17:42 Demi Marie Obenour napsal(a):
>>> On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
>>>> Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
>>>>> On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
>>>>>> Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
>>>>>>> Hi lvm folks,
>>>>>>>       Recently We received a report about the device cache issue after vgchange —deltag.
>>>>>>> What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.
>>>>>>>
>>>>>>> IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
>>>>>>> critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
>>>>>>> provide data was persistent to storage when write returns. The data can still be in the device cache,
>>>>>>> If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.
>>>>>>>
>>>>>>> Is there any particular reason not to flush data cache at VG commit time?
>>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> It seems the call to 'dev_flush()' function got somehow lost over the time
>>>>>> of conversion to async aio usage - I'll investigate.
>>>>>>
>>>>>> On the other hand the chance here of losing any data this way would be
>>>>>> really really very specific to some oddly behaving device.
>>>>>
>>>>> There's no guarantee that data will be persisted to storage without
>>>>> explicitly flushing the device data cache. Those are usually volatile
>>>>> write-back caches, so the data aren't really protected against power
>>>>> loss without fsyncing the blockdev.
>>>>
>>>> At technical level modern storage devices 'should' have enough energy held
>>>> internally to be able to flush out all the caches in emergency cases to the
>>>> persistent storage. So unless we deal with some 'virtual' storage that may
>>>> fake various responses to IO handling - this should not be causing major
>>>> troubles.
>>>
>>> This is only true for enterprise storage with power loss protection.
>>> The vast majority of Qubes OS users use LVM with consumer storage, which
>>> does not have power loss protection.  If this is unsafe, then Qubes OS
>>> should switch to a different storage pool that flushes drive caches as
>>> needed.
>>
>>  From lvm2 perspective - there are first written metadata - then there is
>> usually a full flush of all I/O and suspend to the actual device - if there
>> is any device already active on such disk -  so even if there would be no
>> direct flush initiated by lvm2 itself - there is going to such on whenever
>> we update existing LVs.
> 
> Can you elaborate on that? Flushing IO does not imply flushing of the
> device cache, but it is not clear what you mean by "suspend" here.

i.e. when you create a snapshot of an LV - the origin LV is being suspended,
so this operation goes with   'flush & fsfreeze' request.
Basically we skip these suspend flags only for 'device extension' where we 
intentionally do not want to flush all data - but we need to know think though 
some cases and how to properly submit fsync() for them.

We also may likely need to extend this also to some files maintained by lvm2
where we may likely go with fdatasync().

>> There is usually a stream of cache flushing operation whenever i.e.
>> thin-pool is synchronizing metadata or any app running of device is
>> synchronizing its data as well.
> 
> We cannot make any assumptions about what processes may be running and
> if they are actually doing fsync on the partition. Also, on devices that
> support FUA, data integrity operations are optimized by leveraging that
> and global device cache is elided.

Note - it's not that we would want to depend on them. All I mean by this is - 
that in practice the race-window where the potential data remains only in 
disk's cache is very small - that's also likely the reason why we have not 
spotted it yet.

> 
> In our case this came in because LV tag manipulation wasn't properly
> persisted in some HA failover scenario, but definitely not resulted to
> actual data loss.

I'd be very interested in the more detailed description of this scenario how 
it's been observed and whether we can manage to write some simulation for this 
in our test suite  with monitoring via i.e. perf or something like this.

> An alternative to fsync on the blockdev would be to do open the device
> with O_DSYNC or submit io with RWF_DSYNC so that all writes are flushed
> to the storage medium.
> 

I guess our dev_flush() function is mostly handling all those cases properly
with the use of  ioctl(BLKFLSBUF).
The only problem is - it's usage somehow vanished - and even in the past it's 
been  basically used only for non-direct usage so likely still not correct.

Regards

Zdenek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-24 12:35             ` Zdenek Kabelac
@ 2024-01-24 13:13               ` Anthony Iliopoulos
  2024-01-24 23:17                 ` Heming Zhao
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony Iliopoulos @ 2024-01-24 13:13 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Demi Marie Obenour, Su Yue, linux-lvm, Heming Zhao, Lidong Zhong,
	martin.wilck

On Wed, Jan 24, 2024 at 01:35:49PM +0100, Zdenek Kabelac wrote:
> I guess our dev_flush() function is mostly handling all those cases properly
> with the use of  ioctl(BLKFLSBUF).

This ioctl by itself will only flush the page cache and not device
caches, but it is indeed followed by a fsync on the blockdev which is
basically the only way for userspace to trigger a device cache flush
when operating directly on a block device.

> The only problem is - it's usage somehow vanished - and even in the past
> it's been  basically used only for non-direct usage so likely still not
> correct.

Indeed, the device cache flushing is required for data integrity
irrespective of the io mode (unless O_DSYNC/RWF_DSYNC), direct-io only
obviates the need for flushing the page cache.

Regards,
Anthony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Question] why not flush device cache at _vg_commit_raw
  2024-01-24 13:13               ` Anthony Iliopoulos
@ 2024-01-24 23:17                 ` Heming Zhao
  0 siblings, 0 replies; 14+ messages in thread
From: Heming Zhao @ 2024-01-24 23:17 UTC (permalink / raw)
  To: Anthony Iliopoulos, Zdenek Kabelac
  Cc: Demi Marie Obenour, Su Yue, linux-lvm, Lidong Zhong, martin.wilck

On 1/24/24 21:13, Anthony Iliopoulos wrote:
> On Wed, Jan 24, 2024 at 01:35:49PM +0100, Zdenek Kabelac wrote:
>> I guess our dev_flush() function is mostly handling all those cases properly
>> with the use of  ioctl(BLKFLSBUF).
> 
> This ioctl by itself will only flush the page cache and not device
> caches, but it is indeed followed by a fsync on the blockdev which is
> basically the only way for userspace to trigger a device cache flush
> when operating directly on a block device.
> 
>> The only problem is - it's usage somehow vanished - and even in the past
>> it's been  basically used only for non-direct usage so likely still not
>> correct.
> 
> Indeed, the device cache flushing is required for data integrity
> irrespective of the io mode (unless O_DSYNC/RWF_DSYNC), direct-io only
> obviates the need for flushing the page cache.
> 

In my view, vg_commit() is a good place to call dev_flush(). This could
only affect (important) metadata IOs, make all write IOs to persistent
storage ASAP.

Thanks,
Heming

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-01-24 23:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-22 11:22 [Question] why not flush device cache at _vg_commit_raw Su Yue
2024-01-22 12:48 ` Zdenek Kabelac
2024-01-22 13:46   ` Anthony Iliopoulos
2024-01-22 14:52     ` Zdenek Kabelac
2024-01-22 15:26       ` Ilia Zykov
2024-01-23  1:54         ` Su Yue
2024-01-23  8:15         ` Martin Wilck
2024-01-22 16:01       ` Anthony Iliopoulos
2024-01-23 16:42       ` Demi Marie Obenour
2024-01-23 17:50         ` Zdenek Kabelac
2024-01-24 11:58           ` Anthony Iliopoulos
2024-01-24 12:35             ` Zdenek Kabelac
2024-01-24 13:13               ` Anthony Iliopoulos
2024-01-24 23:17                 ` Heming Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.