linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* lvm2 deadlock
@ 2024-05-30 10:21 Jaco Kroon
  2024-05-31 12:34 ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-05-30 10:21 UTC (permalink / raw)
  To: linux-lvm

Hi,

Possible lvm2 command deadlock scenario:

crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/*
/run/lock/lvm/P_global: 17231
/run/lock/lvm/V_lvm: 16087 17231

crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231'
24437 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
16087 ?        S      0:00  |       |       \_ /sbin/lvcreate -kn -An -s 
-n fsck_cerberus /dev/lvm/backup_cerberus
17231 ?        S      0:00  |           \_ /sbin/lvs --noheadings 
--nameprefixes

crowsnest [12:17:40] /run/lvm # dmsetup udevcookies
Cookie       Semid      Value      Last semop time           Last change 
time
0xd4d2051    10         1          Thu May 30 02:34:05 2024  Thu May 30 
02:32:22 2024

This was almost 10 hours ago.

crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051
DM_COOKIE_COMPLETED=0xd4d2051
crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231'
27252 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
crowsnest [12:18:45] /run/lvm #

Allows progress again.

I do not know how to troubleshoot this.

Kernel version 6.4.12 (in process of upgrading to 6.9.3).

crowsnest [12:19:47] /run/lvm # udevadm --version
254

aka systemd-utils-254.10

lvm2-2.03.22

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-05-30 10:21 lvm2 deadlock Jaco Kroon
@ 2024-05-31 12:34 ` Zdenek Kabelac
  2024-06-03 12:56   ` Jaco Kroon
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-05-31 12:34 UTC (permalink / raw)
  To: Jaco Kroon, linux-lvm

Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
> Hi,
> 
> Possible lvm2 command deadlock scenario:
> 
> crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/*
> /run/lock/lvm/P_global: 17231
> /run/lock/lvm/V_lvm: 16087 17231
> 
> crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231'
> 24437 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
> 16087 ?        S      0:00  |       |       \_ /sbin/lvcreate -kn -An -s -n 
> fsck_cerberus /dev/lvm/backup_cerberus
> 17231 ?        S      0:00  |           \_ /sbin/lvs --noheadings --nameprefixes
> 
> crowsnest [12:17:40] /run/lvm # dmsetup udevcookies
> Cookie       Semid      Value      Last semop time           Last change time
> 0xd4d2051    10         1          Thu May 30 02:34:05 2024  Thu May 30 
> 02:32:22 2024
> 
> This was almost 10 hours ago.
> 
> crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051
> DM_COOKIE_COMPLETED=0xd4d2051
> crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231'
> 27252 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
> crowsnest [12:18:45] /run/lvm #
> 
> Allows progress again.

Hi

I'm kind of missing here to see your 'deadlock' scenario from this description.

Lvm2 takes the VG lock - creates LV - waits for udev till it's finished with 
its job and confirms all the udev work with dmsetup udevcomplete.

If something 'kills'  your udev worker  (which may eventually happen on some 
'very very very' busy system - you may need to set up longer timeout for 
systemd to kill udev worker (I believe it's just 30seconds by default).

If it happens your cookies blocks your lvm2 command - you can 'unblock' them 
with  'dmsetup udevcomplete_all'  -  but that's a sign your system is already 
in very bad state.

It's also unclear which OS are you using - Debian, Fedora, ???
Version of your packages ?

Regards

Zdenek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-05-31 12:34 ` Zdenek Kabelac
@ 2024-06-03 12:56   ` Jaco Kroon
  2024-06-03 19:25     ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-03 12:56 UTC (permalink / raw)
  To: Zdenek Kabelac, linux-lvm

Hi,

Thanks for the insight.  Please refer below.

On 2024/05/31 14:34, Zdenek Kabelac wrote:
> Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
>> Hi,
>>
>> Possible lvm2 command deadlock scenario:
>>
>> crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/*
>> /run/lock/lvm/P_global: 17231
>> /run/lock/lvm/V_lvm: 16087 17231
>>
>> crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231'
>> 24437 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
>> 16087 ?        S      0:00  |       |       \_ /sbin/lvcreate -kn -An 
>> -s -n fsck_cerberus /dev/lvm/backup_cerberus
>> 17231 ?        S      0:00  |           \_ /sbin/lvs --noheadings 
>> --nameprefixes
>>
>> crowsnest [12:17:40] /run/lvm # dmsetup udevcookies
>> Cookie       Semid      Value      Last semop time Last change time
>> 0xd4d2051    10         1          Thu May 30 02:34:05 2024  Thu May 
>> 30 02:32:22 2024
>>
>> This was almost 10 hours ago.
>>
>> crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051
>> DM_COOKIE_COMPLETED=0xd4d2051
>> crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231'
>> 27252 pts/1    S+     0:00  |       \_ grep --colour=auto -E 16087|17231
>> crowsnest [12:18:45] /run/lvm #
>>
>> Allows progress again.
>
> Hi
>
> I'm kind of missing here to see your 'deadlock' scenario from this 
> description.
Well, stuff blocks, until the cookie is released by using the dmset 
udevcomplete command, so wrong wording perhaps?
>
> Lvm2 takes the VG lock - creates LV - waits for udev till it's 
> finished with its job and confirms all the udev work with dmsetup 
> udevcomplete.

So what I understand from this is that udevcomplete ends up never 
executing?  Is there some way of confirming this?

>
> If something 'kills'  your udev worker  (which may eventually happen 
> on some 'very very very' busy system - you may need to set up longer 
> timeout for systemd to kill udev worker (I believe it's just 30seconds 
> by default).

Well, I guess upwards of 1GB/s of IO at times qualify.  No systemd, so 
more likely the udev configuration.  When this happens we see load 
averages upwards of 50 ... so based on that I'd say busy is probably a 
reasonable assessment.  Normally doesn't exceed load average values 
10-15 at most.

Even so ... event_timeout = 180 should be a VERY, VERY long time in 
terms of udev event processing.

I'm not seeing anything for udev in /var/log/messages (which is 
explicitly configured to log everything logged to syslog).  But it may 
also be a case of "not logging because LVM isn't processing any IO (/var 
is stored on it's own LV).

>
> If it happens your cookies blocks your lvm2 command - you can 
> 'unblock' them with  'dmsetup udevcomplete_all'  -  but that's a sign 
> your system is already in very bad state.
>
> It's also unclear which OS are you using - Debian, Fedora, ???

Gentoo.

> Version of your packages ?

I thought I did provide this:

Kernel version was 6.4.12 when this hapened, is now 6.9.3.

crowsnest [12:19:47] /run/lvm # udevadm --version
254

aka systemd-utils-254.10

lvm2-2.03.22

Thanks for the feedback, what you say makes perfect sense, and the 
implication is that there are only a few options:

1.  Something is resulting in the udev trigger to take longer than three 
minutes, and the dmsetup udevcomplete never being executed.
2.  Something goes horribly wrong during the udev trigger (which invokes 
dmsetup a few times) processing and the process crashes, never executing 
dmsetup udevcomplete.

This could potentially be due to extremely heavy disk IO, or LVM itself 
freezing IO.

Given the rulesets the only way I see this happening is if dmsetup 
command takes very long to load - and even in the degraded (most 
filesystems blocked) state it was fast to execute.

Or if udevd itself has problems accessing /sys - which I find extremely 
unlikely.

I don't see the default value for udev_log from the config. Explicitly 
set to debug now, but still not seeing anything logged to syslog.  
Running with udevd --debug, which logs to a ramdisk on /run.  Hopefully 
(if/when this happens again) that may shed some light.  There is 256GB 
of RAM available, so as long as the log doesn't grow too quickly should 
be fine.

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-03 12:56   ` Jaco Kroon
@ 2024-06-03 19:25     ` Zdenek Kabelac
  2024-06-04  8:46       ` Jaco Kroon
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-06-03 19:25 UTC (permalink / raw)
  To: Jaco Kroon, linux-lvm@lists.linux.dev

Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a):
> Hi,
> 
> Thanks for the insight.  Please refer below.
> 
> On 2024/05/31 14:34, Zdenek Kabelac wrote:
>> Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
>>> Hi,
>>>
>> I'm kind of missing here to see your 'deadlock' scenario from this description.
> Well, stuff blocks, until the cookie is released by using the dmset 
> udevcomplete command, so wrong wording perhaps?
>>
>> Lvm2 takes the VG lock - creates LV - waits for udev till it's finished with 
>> its job and confirms all the udev work with dmsetup udevcomplete.
> 
> So what I understand from this is that udevcomplete ends up never executing?  
> Is there some way of confirming this?

udevcomplete needs  someone to create 'semaphore' for completion in the first 
place.


>>
>> It's also unclear which OS are you using - Debian, Fedora, ???
> 
> Gentoo.
> 
>> Version of your packages ?
> 
> I thought I did provide this:
> 
> Kernel version was 6.4.12 when this hapened, is now 6.9.3.
> 
> crowsnest [12:19:47] /run/lvm # udevadm --version
> 254
> 
> aka systemd-utils-254.10
> 
> lvm2-2.03.22

Since this is most likely your personal build - please provide full output of
'lvm version'  command.

For the 'udev' synchronization, there needs to be '--enable-udev_sync' 
configure option. So let's check which configure/build option were used here.
And also preferably upstream udev rules.

> 
> Thanks for the feedback, what you say makes perfect sense, and the implication 
> is that there are only a few options:
> 
> 1.  Something is resulting in the udev trigger to take longer than three 
> minutes, and the dmsetup udevcomplete never being executed.

systemd simply kills udev worker if takes too long.

However on properly running system, it would be very very unusual to hit these 
timeouts  - you would need to work with thousands of devices....

> 
> This could potentially be due to extremely heavy disk IO, or LVM itself 
> freezing IO.

well reducing the percentage of '/proc/sys/vm/dirty_ration' may possibly help
when your disk system is too slow and you create a very lengthy 'sync' io 
queues...

> I don't see the default value for udev_log from the config. Explicitly set to 
> debug now, but still not seeing anything logged to syslog. Running with udevd 
> --debug, which logs to a ramdisk on /run.  Hopefully (if/when this happens 
> again) that may shed some light.  There is 256GB of RAM available, so as long 
> as the log doesn't grow too quickly should be fine.

A lot of RAM may possibly create a huge amount of dirty pages...

Regards

Zdenek



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-03 19:25     ` Zdenek Kabelac
@ 2024-06-04  8:46       ` Jaco Kroon
  2024-06-04 10:48         ` Roger Heflin
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-04  8:46 UTC (permalink / raw)
  To: Zdenek Kabelac, linux-lvm@lists.linux.dev

Hi,

Please refer below.

On 2024/06/03 21:25, Zdenek Kabelac wrote:
> Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a):
>> Hi,
>>
>> Thanks for the insight.  Please refer below.
>>
>> On 2024/05/31 14:34, Zdenek Kabelac wrote:
>>> Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a):
>>>> Hi,
>>>>
>>> I'm kind of missing here to see your 'deadlock' scenario from this 
>>> description.
>> Well, stuff blocks, until the cookie is released by using the dmset 
>> udevcomplete command, so wrong wording perhaps?
>>>
>>> Lvm2 takes the VG lock - creates LV - waits for udev till it's 
>>> finished with its job and confirms all the udev work with dmsetup 
>>> udevcomplete.
>>
>> So what I understand from this is that udevcomplete ends up never 
>> executing?  Is there some way of confirming this?
>
> udevcomplete needs  someone to create 'semaphore' for completion in 
> the first place.


I'm not familiar with the LVM internals and the flows of different 
processes, even though you can probably safely consider me a "semi power 
user".  I do have compsci background, so do understand most of the 
principles of locking etc, but how they are applied in the LVM 
environment I'm clueless.

>
>
>>>
>>> It's also unclear which OS are you using - Debian, Fedora, ???
>>
>> Gentoo.
>>
>>> Version of your packages ?
>>
>> I thought I did provide this:
>>
>> Kernel version was 6.4.12 when this hapened, is now 6.9.3.
>>
>> crowsnest [12:19:47] /run/lvm # udevadm --version
>> 254
>>
>> aka systemd-utils-254.10
>>
>> lvm2-2.03.22
>
> Since this is most likely your personal build - please provide full 
> output of
> 'lvm version'  command.


crowsnest [09:46:04] ~ # lvm version
   LVM version:     2.03.22(2) (2023-08-02)
   Library version: 1.02.196 (2023-08-02)
   Driver version:  4.48.0
   Configuration:   ./configure --prefix=/usr 
--build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu 
--mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share 
--sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share 
--disable-dependency-tracking --disable-silent-rules 
--docdir=/usr/share/doc/lvm2-2.03.22-r5 
--htmldir=/usr/share/doc/lvm2-2.03.22-r5/html --enable-dmfilemapd 
--enable-dmeventd --enable-cmdlib --enable-fsadm --enable-lvmpolld 
--with-mirrors=internal --with-snapshots=internal --with-thin=internal 
--with-cache=internal --with-thin-check=/usr/sbin/thin_check 
--with-cache-check=/usr/sbin/cache_check 
--with-thin-dump=/usr/sbin/thin_dump 
--with-cache-dump=/usr/sbin/cache_dump 
--with-thin-repair=/usr/sbin/thin_repair 
--with-cache-repair=/usr/sbin/cache_repair 
--with-thin-restore=/usr/sbin/thin_restore 
--with-cache-restore=/usr/sbin/cache_restore --with-symvers=gnu 
--enable-readline --disable-selinux --enable-pkgconfig 
--with-confdir=/etc --exec-prefix= --sbindir=/sbin 
--with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 
--with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm 
--with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run 
--enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d 
--disable-lvmlockd-sanlock --disable-notify-dbus --disable-app-machineid 
--disable-systemd-journal --without-systemd-run --disable-valgrind-pool 
--with-systemdsystemunitdir=/lib/systemd/system CLDFLAGS=-Wl,-O1 
-Wl,--as-needed


>
> For the 'udev' synchronization, there needs to be '--enable-udev_sync' 
> configure option. So let's check which configure/build option were 
> used here.
> And also preferably upstream udev rules.


--enable-udev_sync all there.

To the best of my knowledge the udev rules are stock, certainly neither 
myself nor any of my colleagues modified them.  They would generally 
defer to me, and I won't touch that unless I understand the 
implications, which in this case I just don't.

>
>>
>> Thanks for the feedback, what you say makes perfect sense, and the 
>> implication is that there are only a few options:
>>
>> 1.  Something is resulting in the udev trigger to take longer than 
>> three minutes, and the dmsetup udevcomplete never being executed.
>
> systemd simply kills udev worker if takes too long.
>
> However on properly running system, it would be very very unusual to 
> hit these timeouts  - you would need to work with thousands of 
> devices....


32 physical NL-SAS drives, combined into 3 RAID6 arrays using mdadm.

These three md devices serve as PVs for LVM, single VG.

73 LVs, just over half of which are mounted.  Most of those are thin 
volumes inside:

crowsnest [09:54:04] ~ # lvdisplay /dev/lvm/thin_pool
   --- Logical volume ---
   LV Name                thin_pool
   VG Name                lvm
   LV UUID                twLSE1-3ckG-WRSO-5eHc-G3fY-YS2v-as4ABC
   LV Write Access        read/write (activated read only)
   LV Creation host, time crowsnest, 2020-02-19 12:26:00 +0200
   LV Pool metadata       thin_pool_tmeta
   LV Pool data           thin_pool_tdata
   LV Status              available
   # open                 0
   LV Size                125.00 TiB
   Allocated pool data    73.57%
   Allocated metadata     9.05%
   Current LE             32768000
   Segments               1
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     1024
   Block device           253:11

The rest are snapshots of the LVs that are mounted so that we have a 
roll-back destination in case of filesystem corruption (these snaps are 
made in multiple steps, first a snap of the origin is made, this is then 
fsck'ed, if that's successful it's fstrim'ed before being renamed into 
the final "save" location - any previously saved copy is first lvremove'd).

I'd describe that as a few tens of devices, not a few hundred and 
certainly not thousands of devices.


>
>>
>> This could potentially be due to extremely heavy disk IO, or LVM 
>> itself freezing IO.
>
> well reducing the percentage of '/proc/sys/vm/dirty_ration' may 
> possibly help
> when your disk system is too slow and you create a very lengthy 'sync' 
> io queues...


crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio
20

Happy to lower that even more if it would help.

Internet (Redhat) states:

Starts active writeback of dirty data at this percentage of total memory 
for the generator of dirty data, via pdflush. The default value is |40|.

I'm assuming the default is 20 though, not 40, since I can't find that 
I've reconfigured this value.

Should probably remain higher than dirty_background_ratio (which is 
currently 10), dirty_background_bytes is 0.


>
>> I don't see the default value for udev_log from the config. 
>> Explicitly set to debug now, but still not seeing anything logged to 
>> syslog. Running with udevd --debug, which logs to a ramdisk on /run.  
>> Hopefully (if/when this happens again) that may shed some light.  
>> There is 256GB of RAM available, so as long as the log doesn't grow 
>> too quickly should be fine.
>
> A lot of RAM may possibly create a huge amount of dirty pages...


May I safely interpret this as "lower the dirty_ratio even further"?

Given a value of 10 and 20 I'm assuming that pdflush will start flushing 
out in the background when >~26GB of in-memory data is dirty, or if the 
data has been dirty for more than 5 seconds (dirty_writeback_centisecs = 
500).

Don't mind lowering the dirty_background ratio as low as 1 even? But 
won't the primary dirty_ratio start blocking processes from writing if 
 >40% of the caches/buffers are considered dirty?

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04  8:46       ` Jaco Kroon
@ 2024-06-04 10:48         ` Roger Heflin
  2024-06-04 11:52           ` Jaco Kroon
  0 siblings, 1 reply; 20+ messages in thread
From: Roger Heflin @ 2024-06-04 10:48 UTC (permalink / raw)
  To: Jaco Kroon; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev

On Tue, Jun 4, 2024 at 4:06 AM Jaco Kroon <jaco@uls.co.za> wrote:
>
> Hi,
>
> Please refer below.
>
> >>
> >> This could potentially be due to extremely heavy disk IO, or LVM
> >> itself freezing IO.
> >
> > well reducing the percentage of '/proc/sys/vm/dirty_ration' may
> > possibly help
> > when your disk system is too slow and you create a very lengthy 'sync'
> > io queues...
>
>
> crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio
> 20
>
> Happy to lower that even more if it would help.
>
> Internet (Redhat) states:
>
> Starts active writeback of dirty data at this percentage of total memory
> for the generator of dirty data, via pdflush. The default value is |40|.
>
> I'm assuming the default is 20 though, not 40, since I can't find that
> I've reconfigured this value.
>
> Should probably remain higher than dirty_background_ratio (which is
> currently 10), dirty_background_bytes is 0.
>
>
> >
> >> I don't see the default value for udev_log from the config.
> >> Explicitly set to debug now, but still not seeing anything logged to
> >> syslog. Running with udevd --debug, which logs to a ramdisk on /run.
> >> Hopefully (if/when this happens again) that may shed some light.
> >> There is 256GB of RAM available, so as long as the log doesn't grow
> >> too quickly should be fine.
> >
> > A lot of RAM may possibly create a huge amount of dirty pages...
>
>
> May I safely interpret this as "lower the dirty_ratio even further"?
>
> Given a value of 10 and 20 I'm assuming that pdflush will start flushing
> out in the background when >~26GB of in-memory data is dirty, or if the
> data has been dirty for more than 5 seconds (dirty_writeback_centisecs =
> 500).
>
> Don't mind lowering the dirty_background ratio as low as 1 even? But
> won't the primary dirty_ratio start blocking processes from writing if
>  >40% of the caches/buffers are considered dirty?
>
> Kind regards,
> Jaco
>
>

Use the *_bytes values.  If they are non-zero then they are used and
that allows setting even below 1% (quite large on anything with a lot
of ram).

I have been using this for quite a while:
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000

ie 5M and 3M such that I should never have a huge amount of writes outstanding.

And you can go lower.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 10:48         ` Roger Heflin
@ 2024-06-04 11:52           ` Jaco Kroon
  2024-06-04 13:30             ` Roger Heflin
  2024-06-04 16:07             ` Zdenek Kabelac
  0 siblings, 2 replies; 20+ messages in thread
From: Jaco Kroon @ 2024-06-04 11:52 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev

Hi,

On 2024/06/04 12:48, Roger Heflin wrote:

> Use the *_bytes values.  If they are non-zero then they are used and
> that allows setting even below 1% (quite large on anything with a lot
> of ram).
>
> I have been using this for quite a while:
> vm.dirty_background_bytes = 3000000
> vm.dirty_bytes = 5000000


crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000
vm.dirty_background_bytes = 3000000
crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000
vm.dirty_bytes = 500000000

And persisted via /etc/sysctl.conf

Thank you.  Must be noted this host doesn't do much else other than disk 
IO, so I'm hoping the 500MB value will be OK, this is just so IO won't 
block CPU heavy-at-the-time tasks.

The purpose of 256GB RAM was so that we could have ~250GB worth of disk 
cache (obviously we don't want all of that to be dirty, OS and "used" 
used to be below 4GB, now generally around 8-12GB, currently it's in 
"quiet" time, so a bit lower, just busy running some background 
compression).  As per iostat:

avg-cpu:  %user   %nice %system %iowait %steal   %idle
            7.73   18.43   18.96   37.86    0.00   17.01

Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s MB_read    
MB_wrtn    MB_dscd
md2             392.13        10.00         5.11         0.00 4244888    
2167644          0
md3            2270.12        43.88        56.82         0.00 18626309   
24120982          0
md4            1406.06        30.47        16.83         0.00 
12934654    7143330          0

That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written 
in just under 5 days.

What I am noticing immediately is that the "free" value as per "free -m" 
is definitely much higher, which to me is indicative that we're not 
caching as aggressively as can be done.  Will monitor this for the time 
being:

crowsnest [13:50:09] ~ # free -m
                total        used        free      shared buff/cache   
available
Mem:          257661        6911      105313           7 145436      248246
Swap:              0           0           0

The Total DISK WRITE and Current DISK Write values in in iotop seems to 
have a tighter correlation now (no longer seeing constant Total DISK 
WRITE with spikes in current, seems to be more even now).

Kind regards,
Jaco

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 11:52           ` Jaco Kroon
@ 2024-06-04 13:30             ` Roger Heflin
  2024-06-04 13:46               ` Stuart D Gathman
  2024-06-04 14:07               ` Jaco Kroon
  2024-06-04 16:07             ` Zdenek Kabelac
  1 sibling, 2 replies; 20+ messages in thread
From: Roger Heflin @ 2024-06-04 13:30 UTC (permalink / raw)
  To: Jaco Kroon; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev

My experience is that heavy disk io/batch disk io systems work better
with these values being smallish.

Ie both even under 10MB or so.    About all having the number larger
has done is trick io benchmarks that don't force a sync at the end,
and/or appear to make large saves happen faster.

There is also the freeze/pause for outstandingwritesMB/<iorateMB>
seconds, smaller shortens the freeze.

I don't see a use case for having large values.   It seems to have no
real upside and several downsides.  Get the buffer size small enough
and you will still get pauses to clear the writes the be pauses will
be short enough to not be a problem.


On Tue, Jun 4, 2024 at 6:52 AM Jaco Kroon <jaco@uls.co.za> wrote:
>
> Hi,
>
> On 2024/06/04 12:48, Roger Heflin wrote:
>
> > Use the *_bytes values.  If they are non-zero then they are used and
> > that allows setting even below 1% (quite large on anything with a lot
> > of ram).
> >
> > I have been using this for quite a while:
> > vm.dirty_background_bytes = 3000000
> > vm.dirty_bytes = 5000000
>
>
> crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000
> vm.dirty_background_bytes = 3000000
> crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000
> vm.dirty_bytes = 500000000
>
> And persisted via /etc/sysctl.conf
>
> Thank you.  Must be noted this host doesn't do much else other than disk
> IO, so I'm hoping the 500MB value will be OK, this is just so IO won't
> block CPU heavy-at-the-time tasks.
>
> The purpose of 256GB RAM was so that we could have ~250GB worth of disk
> cache (obviously we don't want all of that to be dirty, OS and "used"
> used to be below 4GB, now generally around 8-12GB, currently it's in
> "quiet" time, so a bit lower, just busy running some background
> compression).  As per iostat:
>
> avg-cpu:  %user   %nice %system %iowait %steal   %idle
>             7.73   18.43   18.96   37.86    0.00   17.01
>
> Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s MB_read
> MB_wrtn    MB_dscd
> md2             392.13        10.00         5.11         0.00 4244888
> 2167644          0
> md3            2270.12        43.88        56.82         0.00 18626309
> 24120982          0
> md4            1406.06        30.47        16.83         0.00
> 12934654    7143330          0
>
> That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written
> in just under 5 days.
>
> What I am noticing immediately is that the "free" value as per "free -m"
> is definitely much higher, which to me is indicative that we're not
> caching as aggressively as can be done.  Will monitor this for the time
> being:
>
> crowsnest [13:50:09] ~ # free -m
>                 total        used        free      shared buff/cache
> available
> Mem:          257661        6911      105313           7 145436      248246
> Swap:              0           0           0
>
> The Total DISK WRITE and Current DISK Write values in in iotop seems to
> have a tighter correlation now (no longer seeing constant Total DISK
> WRITE with spikes in current, seems to be more even now).
>
> Kind regards,
> Jaco

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 13:30             ` Roger Heflin
@ 2024-06-04 13:46               ` Stuart D Gathman
  2024-06-04 14:49                 ` Jaco Kroon
  2024-06-04 14:07               ` Jaco Kroon
  1 sibling, 1 reply; 20+ messages in thread
From: Stuart D Gathman @ 2024-06-04 13:46 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Jaco Kroon, Zdenek Kabelac, linux-lvm@lists.linux.dev

On Tue, 4 Jun 2024, Roger Heflin wrote:

> My experience is that heavy disk io/batch disk io systems work better
> with these values being smallish.

> I don't see a use case for having large values.   It seems to have no
> real upside and several downsides.  Get the buffer size small enough
> and you will still get pauses to clear the writes the be pauses will
> be short enough to not be a problem.

Not a normal situation, but I should mention my recent experience.
One of the disks in an underlying RAID was going bad.  It still worked,
but the disk struggled manfully with multiple retries and recalibrates
to complete many reads/writes - i.e. it was extremely slow.  I was
running into all kinds of strange boundary conditions because of this.
E.g. VMs were getting timeouts on their virtio disk devices, leading
to file system corruption and other issues.

I was not modifying any LVM volumes, so did not run into any problems
with LVM - but that is a boundary condition to keep in mind.  You
don't necessarily need to fully work under such conditions, but need
to do something sane.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 13:30             ` Roger Heflin
  2024-06-04 13:46               ` Stuart D Gathman
@ 2024-06-04 14:07               ` Jaco Kroon
  1 sibling, 0 replies; 20+ messages in thread
From: Jaco Kroon @ 2024-06-04 14:07 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev

Hi,

On 2024/06/04 15:30, Roger Heflin wrote:
> My experience is that heavy disk io/batch disk io systems work better
> with these values being smallish.
>
> Ie both even under 10MB or so.    About all having the number larger
> has done is trick io benchmarks that don't force a sync at the end,
> and/or appear to make large saves happen faster.
>
> There is also the freeze/pause for outstandingwritesMB/<iorateMB>
> seconds, smaller shortens the freeze.
>
> I don't see a use case for having large values.   It seems to have no
> real upside and several downsides.  Get the buffer size small enough
> and you will still get pauses to clear the writes the be pauses will
> be short enough to not be a problem.

Thanks, this is extremely insightful.  So with original values there 
could be "up to" ~ 50GB outstanding for write, let's assume that's all 
to one disk (extremely unlikely, and assuming 100MB/s which is 
optimistic if it's random access) this will take upwards of 500 seconds, 
which is a hellishly long time in our world.

I think the value of 500MB I've set now should almost never exceed 10s 
or so for a sync even if everything is targeted at a single drive.  I 
think we're OK with that on this specific host.

Kind regards,
Jaco

>
>
> On Tue, Jun 4, 2024 at 6:52 AM Jaco Kroon <jaco@uls.co.za> wrote:
>> Hi,
>>
>> On 2024/06/04 12:48, Roger Heflin wrote:
>>
>>> Use the *_bytes values.  If they are non-zero then they are used and
>>> that allows setting even below 1% (quite large on anything with a lot
>>> of ram).
>>>
>>> I have been using this for quite a while:
>>> vm.dirty_background_bytes = 3000000
>>> vm.dirty_bytes = 5000000
>>
>> crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000
>> vm.dirty_background_bytes = 3000000
>> crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000
>> vm.dirty_bytes = 500000000
>>
>> And persisted via /etc/sysctl.conf
>>
>> Thank you.  Must be noted this host doesn't do much else other than disk
>> IO, so I'm hoping the 500MB value will be OK, this is just so IO won't
>> block CPU heavy-at-the-time tasks.
>>
>> The purpose of 256GB RAM was so that we could have ~250GB worth of disk
>> cache (obviously we don't want all of that to be dirty, OS and "used"
>> used to be below 4GB, now generally around 8-12GB, currently it's in
>> "quiet" time, so a bit lower, just busy running some background
>> compression).  As per iostat:
>>
>> avg-cpu:  %user   %nice %system %iowait %steal   %idle
>>              7.73   18.43   18.96   37.86    0.00   17.01
>>
>> Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s MB_read
>> MB_wrtn    MB_dscd
>> md2             392.13        10.00         5.11         0.00 4244888
>> 2167644          0
>> md3            2270.12        43.88        56.82         0.00 18626309
>> 24120982          0
>> md4            1406.06        30.47        16.83         0.00
>> 12934654    7143330          0
>>
>> That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written
>> in just under 5 days.
>>
>> What I am noticing immediately is that the "free" value as per "free -m"
>> is definitely much higher, which to me is indicative that we're not
>> caching as aggressively as can be done.  Will monitor this for the time
>> being:
>>
>> crowsnest [13:50:09] ~ # free -m
>>                  total        used        free      shared buff/cache
>> available
>> Mem:          257661        6911      105313           7 145436      248246
>> Swap:              0           0           0
>>
>> The Total DISK WRITE and Current DISK Write values in in iotop seems to
>> have a tighter correlation now (no longer seeing constant Total DISK
>> WRITE with spikes in current, seems to be more even now).
>>
>> Kind regards,
>> Jaco

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 13:46               ` Stuart D Gathman
@ 2024-06-04 14:49                 ` Jaco Kroon
  2024-06-04 15:03                   ` Roger Heflin
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-04 14:49 UTC (permalink / raw)
  To: Stuart D Gathman, Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev

Hi,

On 2024/06/04 15:46, Stuart D Gathman wrote:
> On Tue, 4 Jun 2024, Roger Heflin wrote:
>
>> My experience is that heavy disk io/batch disk io systems work better
>> with these values being smallish.
>
>> I don't see a use case for having large values.   It seems to have no
>> real upside and several downsides.  Get the buffer size small enough
>> and you will still get pauses to clear the writes the be pauses will
>> be short enough to not be a problem.
>
> Not a normal situation, but I should mention my recent experience.
> One of the disks in an underlying RAID was going bad.  It still worked,
> but the disk struggled manfully with multiple retries and recalibrates
> to complete many reads/writes - i.e. it was extremely slow.  I was
> running into all kinds of strange boundary conditions because of this.
> E.g. VMs were getting timeouts on their virtio disk devices, leading
> to file system corruption and other issues.
>
> I was not modifying any LVM volumes, so did not run into any problems
> with LVM - but that is a boundary condition to keep in mind.  You
> don't necessarily need to fully work under such conditions, but need
> to do something sane.

On SAS or NL-SAS drives?

I've seen this before on SATA drives, and is probably the single biggest 
reason why I have a major dislike for deploying SATA drives to any kind 
of high-reliability environment.

Regardless, we do monitor all cases using smartd and it *usually* picks 
up a bad drive before it gets to the above point of pain but with SATA 
drives this isn't always the case, and the drive will simply 
indefinitely keep retrying, blocking request slot numbers over time 
(SATA protocol handles 32 requests IIRC, but after how long can the 
Linux kernel re-use a number it has never received a response on kind of 
problem) and getting slower and slower until you power cycle the drive, 
after which it's fine again for a while.  Never had that crap with 
NL-SAS drives.

Specific host is all NL-SAS.  No VM involvement here.

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 14:49                 ` Jaco Kroon
@ 2024-06-04 15:03                   ` Roger Heflin
  0 siblings, 0 replies; 20+ messages in thread
From: Roger Heflin @ 2024-06-04 15:03 UTC (permalink / raw)
  To: Jaco Kroon; +Cc: Stuart D Gathman, Zdenek Kabelac, linux-lvm@lists.linux.dev

The SATA disks work ok if you use smartctl to set the SCTERC timeout
as low as possible (smartctl -l scterc,20,20 /dev/${drive} ).  I have
a set of commands that starts high and sets it lower with the idea
that each different manufactuers disk will have a different min value
and I simply want it is as low as I can go.

Desktop/Green disks do not have a settable timeout and timeout at 60+
seconds or more.

Red/NAS/Video/Enterprise/Purple SATA and such typically timeout at
10sec but can be set lower.  SAS timeouts are typically 1sec or lower
on a bad sector.

And I have personally dealt with "enterprise" vendors that when using
SATA leave the timeout at default (10 seconds) rather than lowering it
so that the disks work reasonably when bad sectors are happening.

On Tue, Jun 4, 2024 at 9:50 AM Jaco Kroon <jaco@uls.co.za> wrote:
>
> Hi,
>
> On 2024/06/04 15:46, Stuart D Gathman wrote:
> > On Tue, 4 Jun 2024, Roger Heflin wrote:
> >
> >> My experience is that heavy disk io/batch disk io systems work better
> >> with these values being smallish.
> >
> >> I don't see a use case for having large values.   It seems to have no
> >> real upside and several downsides.  Get the buffer size small enough
> >> and you will still get pauses to clear the writes the be pauses will
> >> be short enough to not be a problem.
> >
> > Not a normal situation, but I should mention my recent experience.
> > One of the disks in an underlying RAID was going bad.  It still worked,
> > but the disk struggled manfully with multiple retries and recalibrates
> > to complete many reads/writes - i.e. it was extremely slow.  I was
> > running into all kinds of strange boundary conditions because of this.
> > E.g. VMs were getting timeouts on their virtio disk devices, leading
> > to file system corruption and other issues.
> >
> > I was not modifying any LVM volumes, so did not run into any problems
> > with LVM - but that is a boundary condition to keep in mind.  You
> > don't necessarily need to fully work under such conditions, but need
> > to do something sane.
>
> On SAS or NL-SAS drives?
>
> I've seen this before on SATA drives, and is probably the single biggest
> reason why I have a major dislike for deploying SATA drives to any kind
> of high-reliability environment.
>
> Regardless, we do monitor all cases using smartd and it *usually* picks
> up a bad drive before it gets to the above point of pain but with SATA
> drives this isn't always the case, and the drive will simply
> indefinitely keep retrying, blocking request slot numbers over time
> (SATA protocol handles 32 requests IIRC, but after how long can the
> Linux kernel re-use a number it has never received a response on kind of
> problem) and getting slower and slower until you power cycle the drive,
> after which it's fine again for a while.  Never had that crap with
> NL-SAS drives.
>
> Specific host is all NL-SAS.  No VM involvement here.
>
> Kind regards,
> Jaco
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 11:52           ` Jaco Kroon
  2024-06-04 13:30             ` Roger Heflin
@ 2024-06-04 16:07             ` Zdenek Kabelac
  2024-06-05  8:59               ` Jaco Kroon
  1 sibling, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-06-04 16:07 UTC (permalink / raw)
  To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
> Hi,
> 
> On 2024/06/04 12:48, Roger Heflin wrote:
> 
>> Use the *_bytes values.  If they are non-zero then they are used and
>> that allows setting even below 1% (quite large on anything with a lot
>> of ram).
>>
>> I have been using this for quite a while:
>> vm.dirty_background_bytes = 3000000
>> vm.dirty_bytes = 5000000
> 
> 
> What I am noticing immediately is that the "free" value as per "free -m" is 
> definitely much higher, which to me is indicative that we're not caching as 
> aggressively as can be done.  Will monitor this for the time being:
> 
> crowsnest [13:50:09] ~ # free -m
>                 total        used        free      shared buff/cache available
> Mem:          257661        6911      105313           7 145436      248246
> Swap:              0           0           0
> 
> The Total DISK WRITE and Current DISK Write values in in iotop seems to have a 
> tighter correlation now (no longer seeing constant Total DISK WRITE with 
> spikes in current, seems to be more even now).

Hi

So now while we are solving various system setting - there are more things to 
think through.

The big 'range' of unwritten data may put them in risk for the 'power' failure.
On the other hand large  'dirty pages'  allows system to 'optimize'  and even 
bypass storing them on disk if they are frequently changed - so in this case 
'lower' dirty ration may cause significant performance impact - so please 
check whats the typical workload and what is result...

It's worth to mention lvm2 support  writecache target to kind of offload dirty 
pages to fast storage...

Last but not least -  disk scheduling policies also do have impact - to i.e. 
ensure better fairness - at the prices of lower throughput...

So now let's get back to lvm2  'possible' deadlock - which I'm still not fully 
certain we deciphered in this thread yet.

So if you happen to 'spot' stuck  commands -  do you notice anything strange 
in systemd journal -  usually when  systemd  decides to kill udevd worker task 
- it's briefly stated in journal - with this check we would kind of know that 
reason of your problems was killed worked that was not able to 'finalize' lvm 
command which is waiting for confirmation from udev (currently without any 
timeout limits).

To unstuck such command  'udevcomplete_all' is a cure - but as said - the 
system is already kind of 'damaged' since udev is failing and has 'invalid' 
information about devices...

So maybe you could check whether your journal around date&time of problem has 
some 'interesting'  'killing action' record ?

Regards

Zdenek

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-04 16:07             ` Zdenek Kabelac
@ 2024-06-05  8:59               ` Jaco Kroon
  2024-06-06 22:14                 ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-05  8:59 UTC (permalink / raw)
  To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Hi,

On 2024/06/04 18:07, Zdenek Kabelac wrote:
> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
>> Hi,
>>
>> On 2024/06/04 12:48, Roger Heflin wrote:
>>
>>> Use the *_bytes values.  If they are non-zero then they are used and
>>> that allows setting even below 1% (quite large on anything with a lot
>>> of ram).
>>>
>>> I have been using this for quite a while:
>>> vm.dirty_background_bytes = 3000000
>>> vm.dirty_bytes = 5000000
>>
>>
>> What I am noticing immediately is that the "free" value as per "free 
>> -m" is definitely much higher, which to me is indicative that we're 
>> not caching as aggressively as can be done.  Will monitor this for 
>> the time being:
>>
>> crowsnest [13:50:09] ~ # free -m
>>                 total        used        free      shared buff/cache 
>> available
>> Mem:          257661        6911      105313           7 145436      
>> 248246
>> Swap:              0           0           0
>>
>> The Total DISK WRITE and Current DISK Write values in in iotop seems 
>> to have a tighter correlation now (no longer seeing constant Total 
>> DISK WRITE with spikes in current, seems to be more even now).
The free value how now dropped drastically anyway.  So looks like the 
increase of free was a temporary situation.
>
> Hi
>
> So now while we are solving various system setting - there are more 
> things to think through.
Yea.  Realised we derailed, but given that the theory is that "stuff" is 
blocking the complete (probably due to backlogged IO?), it's not 
completely unrelated is it?
>
> The big 'range' of unwritten data may put them in risk for the 'power' 
> failure.

I'd be more worried about host crash in this case to be honest (dual PSU 
and in several years we've not had a single phase or PDU failure).

> On the other hand large  'dirty pages'  allows system to 'optimize'  
> and even bypass storing them on disk if they are frequently changed - 
> so in this case 'lower' dirty ration may cause significant performance 
> impact - so please check whats the typical workload and what is result...

Based on observations from task timings last night I reckon workloads 
are around 25% faster on average.  Tasks that used to run just shy of 20 
hours (would still have been busy right now) completed last night in 
just under 15 .  This would need to be monitored over time though, as a 
single run is definitely not authoritative.  This was with the _bytes 
settings as suggested by Roger.

For the specific use-case I doubt "frequently changed" applies, and it's 
probably best to get the data persisted as soon as possible, allowing 
for improved "future IO capacity" (hope my wording makes sense).

>
> It's worth to mention lvm2 support  writecache target to kind of 
> offload dirty pages to fast storage...
We normally use raid controller battery backup for this in other 
environments, not relevant in this specific case though, we are using 
dm-cache in other environments mostly for a read-cache (ie, 
write-through strategy) on NVMe though because the raid controller 
whilst buffering writes really sucks at serving reads, which given the 
nature of spinning drives makes perfect sense, and given the amount of 
READ on those two hosts the NVMe setup more than quadrupled throughput 
there.
>
> Last but not least -  disk scheduling policies also do have impact - 
> to i.e. ensure better fairness - at the prices of lower throughput...
We normally use mq-deadline, in this setup I notice this has been 
updated to "none", the plan was to revert, this was done in 
collaboration with a discussion with Bart van Assche.  Happy to revert 
this to be honest. 
https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ 
relates.
>
> So now let's get back to lvm2  'possible' deadlock - which I'm still 
> not fully certain we deciphered in this thread yet.
>
> So if you happen to 'spot' stuck  commands -  do you notice anything 
> strange in systemd journal -  usually when  systemd decides to kill 
> udevd worker task - it's briefly stated in journal - with this check 
> we would kind of know that reason of your problems was killed worked 
> that was not able to 'finalize' lvm command which is waiting for 
> confirmation from udev (currently without any timeout limits).

Not using systemd, but udev does come from the systemd package. Nothing 
in the logs at all for udev, as mentioned previously. Don't seem to be 
able to get normal logs working, but I have set up the debuglog now.  
This does log very detailed, except there are no timestamps.  So *if* 
this happens again hopefully we'll be able to look for some working that 
was killed rather than merely exited.  What I can see is that it looks 
like a single forked worker can perform multiple tasks and execute 
multiple other calls, so I believe that the three minute timeout is 
*overall*, not on just a single RUN command, which implies that the 
theory that udevcomplete is never signalled is very much valid.

>
> To unstuck such command  'udevcomplete_all' is a cure - but as said - 
> the system is already kind of 'damaged' since udev is failing and has 
> 'invalid' information about devices...
Agreed.  It gets things going again, which really just allows for a 
cleaner reboot rather than echo b > /proc/sysrq-trigger or remotely 
yanking the power (which is where we normally end up at if we don't 
catch it early enough).
>
> So maybe you could check whether your journal around date&time of 
> problem has some 'interesting'  'killing action' record ?

If we can get normal udev logging working correctly that would be great, 
but this is not your responsibility, so let me figure out how I can get 
udevd to log tot syslog (if that is even possible given the way things 
seems to be moving with systemd).

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-05  8:59               ` Jaco Kroon
@ 2024-06-06 22:14                 ` Zdenek Kabelac
  2024-06-06 22:17                   ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-06-06 22:14 UTC (permalink / raw)
  To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
> Hi,
> 
> On 2024/06/04 18:07, Zdenek Kabelac wrote:
>> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
>> Last but not least -  disk scheduling policies also do have impact - to i.e. 
>> ensure better fairness - at the prices of lower throughput...
> We normally use mq-deadline, in this setup I notice this has been updated to 
> "none", the plan was to revert, this was done in collaboration with a 
> discussion with Bart van Assche.  Happy to revert this to be honest. 
> https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ 
> relates.

Hi

So I guess we can tell the store like this -

When you've created your 'snapshot' of a thin-volume - this enforces full 
flush (& fsfreeze) of a thin volume - so any dirty pages need to written in 
thin pool before snapshot could be taken (and thin pool should not run out of 
space) - this CAN potentially hold your system running for a long time 
(depending on performance of your storage) and may cause various lock-ups 
states of your system if you are using this 'snapshoted' volume for anything 
else - as the volume is suspended - so it blocks further operations on this 
device  - eventually causing full system circular deadlock  (catch 22) - this 
is hard to analyze without whole picture of the system.

We may eventually think whether we can somehow minimize the amount of holding
vglock and suspending with flush & fsfreeze -  but it's about some future 
possible enhancement and flush disk upfront to minimize dirty size.

For now reducing dirty page queue to minize the blocking time associated with 
snapshoting is a right choice  (although 500M is probably unnecessarily low...)


Regards

Zdenek


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-06 22:14                 ` Zdenek Kabelac
@ 2024-06-06 22:17                   ` Zdenek Kabelac
  2024-06-07  9:03                     ` Jaco Kroon
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-06-06 22:17 UTC (permalink / raw)
  To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a):
> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
>> Hi,
>>
>> On 2024/06/04 18:07, Zdenek Kabelac wrote:
>>> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
>>> Last but not least -  disk scheduling policies also do have impact - to 
>>> i.e. ensure better fairness - at the prices of lower throughput...
>> We normally use mq-deadline, in this setup I notice this has been updated to 
>> "none", the plan was to revert, this was done in collaboration with a 
>> discussion with Bart van Assche.  Happy to revert this to be honest. 
>> https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ 
>> relates.
> 
> Hi
> 
> So I guess we can tell the store like this -
> 
> When you've created your 'snapshot' of a thin-volume - this enforces full 
> flush (& fsfreeze) of a thin volume - so any dirty pages need to written in 
> thin pool before snapshot could be taken (and thin pool should not run out of 
> space) - this CAN potentially hold your system running for a long time 
> (depending on performance of your storage) and may cause various lock-ups 
> states of your system if you are using this 'snapshoted' volume for anything 
> else - as the volume is suspended - so it blocks further operations on this 
> device  - eventually causing full system circular deadlock  (catch 22) - this 
> is hard to analyze without whole picture of the system.
> 
> We may eventually think whether we can somehow minimize the amount of holding
> vglock and suspending with flush & fsfreeze -  but it's about some future 
> possible enhancement and flush disk upfront to minimize dirty size.

I've forget to mention that a 'simplest' way is just to run  'sync' before 
running 'lvcreate -s...' command...

Regards

Zdenek



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-06 22:17                   ` Zdenek Kabelac
@ 2024-06-07  9:03                     ` Jaco Kroon
  2024-06-07  9:26                       ` Zdenek Kabelac
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-07  9:03 UTC (permalink / raw)
  To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Hi,

On 2024/06/07 00:17, Zdenek Kabelac wrote:
> Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a):
>> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
>>> Hi,
>>>
>>> On 2024/06/04 18:07, Zdenek Kabelac wrote:
>>>> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
>>>> Last but not least -  disk scheduling policies also do have impact 
>>>> - to i.e. ensure better fairness - at the prices of lower 
>>>> throughput...
>>> We normally use mq-deadline, in this setup I notice this has been 
>>> updated to "none", the plan was to revert, this was done in 
>>> collaboration with a discussion with Bart van Assche. Happy to 
>>> revert this to be honest. 
>>> https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ 
>>> relates.
>>
>> Hi
>>
>> So I guess we can tell the store like this -
>>
>> When you've created your 'snapshot' of a thin-volume - this enforces 
>> full flush (& fsfreeze) of a thin volume - so any dirty pages need to 
>> written in thin pool before snapshot could be taken (and thin pool 
>> should not run out of space) - this CAN potentially hold your system 
>> running for a long time (depending on performance of your storage) 
>> and may cause various lock-ups states of your system if you are using 
>> this 'snapshoted' volume for anything else - as the volume is 
>> suspended - so it blocks further operations on this device  - 
>> eventually causing full system circular deadlock  (catch 22) - this 
>> is hard to analyze without whole picture of the system.
>>
>> We may eventually think whether we can somehow minimize the amount of 
>> holding
>> vglock and suspending with flush & fsfreeze -  but it's about some 
>> future possible enhancement and flush disk upfront to minimize dirty 
>> size.
>
> I've forget to mention that a 'simplest' way is just to run 'sync' 
> before running 'lvcreate -s...' command...

Thanks.  I think all in all everything mentioned here makes a lot of 
sense, and (in my opinion at least) explains the symptoms we've been seeing.

Overall the system does "feel" more responsive with the lower dirty 
buffers, and most likely it helps with data persistence (as has been 
mentioned) in case of system crashes and/or loss of power.

The tasks during peak usage also does seem to run faster on average, I 
suspect this is because of the use-case for this host:

1.  Data is seldomly overwritten (this was touched on).  Pretty much 
everything is WORM-type access (Write-Once, Read-Many).
2.  Caches are mostly needed to avoid read-bandwidth from consuming 
capacity for writing.
3.  It's thus beneficial to get writes out of the way as soon as 
possible, rather than at a later stage having to block getting many 
writes done for a flush() or sync() or lvcreate (snapshot).

Is 500MB needlessly low?  Probably.  But given the above I think this is 
acceptable.  Rather keep the disk writing *now* in order to free up 
*future* capacity.

I'm guessing your "simple way" is workable for the generic case as well, 
towards that end, is a relatively simple change to the lvm2 tools not 
perhaps to add an syncfs() call to lvcreate *just prior* to freezing?  
The hard part is probably to figure out if the LV is mounted somewhere, 
and if it is, to open() that path in order to have a file-descriptor to 
pass to syncfs()?  Obviously if the LV isn't mounted none of this is a 
concern and we can just proceed.

What would be more interesting is if cluster-lvm is in play and the 
origin LV is active/open on an alternative node?  But that's well beyond 
the scope of our requirements (for now).

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-07  9:03                     ` Jaco Kroon
@ 2024-06-07  9:26                       ` Zdenek Kabelac
  2024-06-07  9:36                         ` Jaco Kroon
  0 siblings, 1 reply; 20+ messages in thread
From: Zdenek Kabelac @ 2024-06-07  9:26 UTC (permalink / raw)
  To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Dne 07. 06. 24 v 11:03 Jaco Kroon napsal(a):
> Hi,
> 
> On 2024/06/07 00:17, Zdenek Kabelac wrote:
>> Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a):
>>> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
>>>> Hi,
>>>
>> 
> I'm guessing your "simple way" is workable for the generic case as well, 
> towards that end, is a relatively simple change to the lvm2 tools not perhaps 
> to add an syncfs() call to lvcreate *just prior* to freezing? The hard part is 
> probably to figure out if the LV is mounted somewhere, and if it is, to open() 
> that path in order to have a file-descriptor to pass to syncfs()?  Obviously 
> if the LV isn't mounted none of this is a concern and we can just proceed.
>


Hi

There is no simple answer here -

a) 'sync' flushes all io for all disk in the system - user can play with tools 
like hdparm -F /dev/xxxx  - so still everything in range of 'admin's hand'...

b) it's about the definition of the 'snapshot' moment - do you want to take 
snapshot as of 'now'  or after possibly X minutes where everything has been 
flushed and meanwhile new data flown-in ??

c) lvm2 needs some 'multi LV' atomic snapshot support...

d) with thin-pool and out-of-space potential it gets more tricky....


> What would be more interesting is if cluster-lvm is in play and the origin LV 
> is active/open on an alternative node?  But that's well beyond the scope of 
> our requirements (for now).

Clearly in the cluster case user can use multi-node active LV only in the case 
there is something that is able to 'manage' this storage - i.g. gfs2.   Surely 
use of ext4/xfs this way is out of question...

Regards

Zdenek



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: lvm2 deadlock
  2024-06-07  9:26                       ` Zdenek Kabelac
@ 2024-06-07  9:36                         ` Jaco Kroon
  2024-09-02  5:48                           ` Unsubscribe box, listen
  0 siblings, 1 reply; 20+ messages in thread
From: Jaco Kroon @ 2024-06-07  9:36 UTC (permalink / raw)
  To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev

Hi,

On 2024/06/07 11:26, Zdenek Kabelac wrote:
> Dne 07. 06. 24 v 11:03 Jaco Kroon napsal(a):
>> Hi,
>>
>> On 2024/06/07 00:17, Zdenek Kabelac wrote:
>>> Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a):
>>>> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
>>>>> Hi,
>>>>
>>>
>> I'm guessing your "simple way" is workable for the generic case as 
>> well, towards that end, is a relatively simple change to the lvm2 
>> tools not perhaps to add an syncfs() call to lvcreate *just prior* to 
>> freezing? The hard part is probably to figure out if the LV is 
>> mounted somewhere, and if it is, to open() that path in order to have 
>> a file-descriptor to pass to syncfs()? Obviously if the LV isn't 
>> mounted none of this is a concern and we can just proceed.
>>
>
>
> Hi
>
> There is no simple answer here -
>
> a) 'sync' flushes all io for all disk in the system - user can play 
> with tools like hdparm -F /dev/xxxx  - so still everything in range of 
> 'admin's hand'...
Fair.  Or sync -f /path/to/mountpoint.
>
> b) it's about the definition of the 'snapshot' moment - do you want to 
> take snapshot as of 'now'  or after possibly X minutes where 
> everything has been flushed and meanwhile new data flown-in ??
Oh yea, that's very valid, so instead of just lvcreate the sysadmin 
should sync -f /path/to/mountpoint *before* issuing lvcreate in the case 
where "possibly X minutes from now" is acceptable.  Guessing this can be 
a --pre-sync argument for lvcreate but obviously the sysadmin is 
perfectly capable (if he's aware of this caveat) just run sync -f 
/path/to/mountpoint just before lvcreate.
>
> c) lvm2 needs some 'multi LV' atomic snapshot support...
>
> d) with thin-pool and out-of-space potential it gets more tricky....
>
>
>> What would be more interesting is if cluster-lvm is in play and the 
>> origin LV is active/open on an alternative node?  But that's well 
>> beyond the scope of our requirements (for now).
>
> Clearly in the cluster case user can use multi-node active LV only in 
> the case there is something that is able to 'manage' this storage - 
> i.g. gfs2.   Surely use of ext4/xfs this way is out of question...

Was referring to the case where an LV is only active on *one* node at a 
time, but it's on shared physical storage.  Not even sure if a thin pool 
can be active on more than one node at a time in such a case.  This is 
research I've not yet done.  We tried gfs2 a few years back and the 
sheer number of unresolvable failure scenarios at the time just had us 
switch to glusterfs instead.

I think this can be considered closed now.

Thanks again for all the help and insight, I thoroughly enjoyed the 
discussion too, it was most insightful and I learned a lot from it.

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Unsubscribe
  2024-06-07  9:36                         ` Jaco Kroon
@ 2024-09-02  5:48                           ` box, listen
  0 siblings, 0 replies; 20+ messages in thread
From: box, listen @ 2024-09-02  5:48 UTC (permalink / raw)
  To: linux-lvm

Unsubscribe
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-09-02  5:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-30 10:21 lvm2 deadlock Jaco Kroon
2024-05-31 12:34 ` Zdenek Kabelac
2024-06-03 12:56   ` Jaco Kroon
2024-06-03 19:25     ` Zdenek Kabelac
2024-06-04  8:46       ` Jaco Kroon
2024-06-04 10:48         ` Roger Heflin
2024-06-04 11:52           ` Jaco Kroon
2024-06-04 13:30             ` Roger Heflin
2024-06-04 13:46               ` Stuart D Gathman
2024-06-04 14:49                 ` Jaco Kroon
2024-06-04 15:03                   ` Roger Heflin
2024-06-04 14:07               ` Jaco Kroon
2024-06-04 16:07             ` Zdenek Kabelac
2024-06-05  8:59               ` Jaco Kroon
2024-06-06 22:14                 ` Zdenek Kabelac
2024-06-06 22:17                   ` Zdenek Kabelac
2024-06-07  9:03                     ` Jaco Kroon
2024-06-07  9:26                       ` Zdenek Kabelac
2024-06-07  9:36                         ` Jaco Kroon
2024-09-02  5:48                           ` Unsubscribe box, listen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).