* lvm2 deadlock @ 2024-05-30 10:21 Jaco Kroon 2024-05-31 12:34 ` Zdenek Kabelac 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-05-30 10:21 UTC (permalink / raw) To: linux-lvm Hi, Possible lvm2 command deadlock scenario: crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/* /run/lock/lvm/P_global: 17231 /run/lock/lvm/V_lvm: 16087 17231 crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231' 24437 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 16087 ? S 0:00 | | \_ /sbin/lvcreate -kn -An -s -n fsck_cerberus /dev/lvm/backup_cerberus 17231 ? S 0:00 | \_ /sbin/lvs --noheadings --nameprefixes crowsnest [12:17:40] /run/lvm # dmsetup udevcookies Cookie Semid Value Last semop time Last change time 0xd4d2051 10 1 Thu May 30 02:34:05 2024 Thu May 30 02:32:22 2024 This was almost 10 hours ago. crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051 DM_COOKIE_COMPLETED=0xd4d2051 crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231' 27252 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 crowsnest [12:18:45] /run/lvm # Allows progress again. I do not know how to troubleshoot this. Kernel version 6.4.12 (in process of upgrading to 6.9.3). crowsnest [12:19:47] /run/lvm # udevadm --version 254 aka systemd-utils-254.10 lvm2-2.03.22 Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-05-30 10:21 lvm2 deadlock Jaco Kroon @ 2024-05-31 12:34 ` Zdenek Kabelac 2024-06-03 12:56 ` Jaco Kroon 0 siblings, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-05-31 12:34 UTC (permalink / raw) To: Jaco Kroon, linux-lvm Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a): > Hi, > > Possible lvm2 command deadlock scenario: > > crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/* > /run/lock/lvm/P_global: 17231 > /run/lock/lvm/V_lvm: 16087 17231 > > crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231' > 24437 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 > 16087 ? S 0:00 | | \_ /sbin/lvcreate -kn -An -s -n > fsck_cerberus /dev/lvm/backup_cerberus > 17231 ? S 0:00 | \_ /sbin/lvs --noheadings --nameprefixes > > crowsnest [12:17:40] /run/lvm # dmsetup udevcookies > Cookie Semid Value Last semop time Last change time > 0xd4d2051 10 1 Thu May 30 02:34:05 2024 Thu May 30 > 02:32:22 2024 > > This was almost 10 hours ago. > > crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051 > DM_COOKIE_COMPLETED=0xd4d2051 > crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231' > 27252 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 > crowsnest [12:18:45] /run/lvm # > > Allows progress again. Hi I'm kind of missing here to see your 'deadlock' scenario from this description. Lvm2 takes the VG lock - creates LV - waits for udev till it's finished with its job and confirms all the udev work with dmsetup udevcomplete. If something 'kills' your udev worker (which may eventually happen on some 'very very very' busy system - you may need to set up longer timeout for systemd to kill udev worker (I believe it's just 30seconds by default). If it happens your cookies blocks your lvm2 command - you can 'unblock' them with 'dmsetup udevcomplete_all' - but that's a sign your system is already in very bad state. It's also unclear which OS are you using - Debian, Fedora, ??? Version of your packages ? Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-05-31 12:34 ` Zdenek Kabelac @ 2024-06-03 12:56 ` Jaco Kroon 2024-06-03 19:25 ` Zdenek Kabelac 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-03 12:56 UTC (permalink / raw) To: Zdenek Kabelac, linux-lvm Hi, Thanks for the insight. Please refer below. On 2024/05/31 14:34, Zdenek Kabelac wrote: > Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a): >> Hi, >> >> Possible lvm2 command deadlock scenario: >> >> crowsnest [12:15:47] /run/lvm # fuser //run/lock/lvm/* >> /run/lock/lvm/P_global: 17231 >> /run/lock/lvm/V_lvm: 16087 17231 >> >> crowsnest [12:15:54] /run/lvm # ps axf | grep -E '16087|17231' >> 24437 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 >> 16087 ? S 0:00 | | \_ /sbin/lvcreate -kn -An >> -s -n fsck_cerberus /dev/lvm/backup_cerberus >> 17231 ? S 0:00 | \_ /sbin/lvs --noheadings >> --nameprefixes >> >> crowsnest [12:17:40] /run/lvm # dmsetup udevcookies >> Cookie Semid Value Last semop time Last change time >> 0xd4d2051 10 1 Thu May 30 02:34:05 2024 Thu May >> 30 02:32:22 2024 >> >> This was almost 10 hours ago. >> >> crowsnest [12:17:44] /run/lvm # dmsetup udevcomplete 0xd4d2051 >> DM_COOKIE_COMPLETED=0xd4d2051 >> crowsnest [12:18:43] /run/lvm # ps axf | grep -E '16087|17231' >> 27252 pts/1 S+ 0:00 | \_ grep --colour=auto -E 16087|17231 >> crowsnest [12:18:45] /run/lvm # >> >> Allows progress again. > > Hi > > I'm kind of missing here to see your 'deadlock' scenario from this > description. Well, stuff blocks, until the cookie is released by using the dmset udevcomplete command, so wrong wording perhaps? > > Lvm2 takes the VG lock - creates LV - waits for udev till it's > finished with its job and confirms all the udev work with dmsetup > udevcomplete. So what I understand from this is that udevcomplete ends up never executing? Is there some way of confirming this? > > If something 'kills' your udev worker (which may eventually happen > on some 'very very very' busy system - you may need to set up longer > timeout for systemd to kill udev worker (I believe it's just 30seconds > by default). Well, I guess upwards of 1GB/s of IO at times qualify. No systemd, so more likely the udev configuration. When this happens we see load averages upwards of 50 ... so based on that I'd say busy is probably a reasonable assessment. Normally doesn't exceed load average values 10-15 at most. Even so ... event_timeout = 180 should be a VERY, VERY long time in terms of udev event processing. I'm not seeing anything for udev in /var/log/messages (which is explicitly configured to log everything logged to syslog). But it may also be a case of "not logging because LVM isn't processing any IO (/var is stored on it's own LV). > > If it happens your cookies blocks your lvm2 command - you can > 'unblock' them with 'dmsetup udevcomplete_all' - but that's a sign > your system is already in very bad state. > > It's also unclear which OS are you using - Debian, Fedora, ??? Gentoo. > Version of your packages ? I thought I did provide this: Kernel version was 6.4.12 when this hapened, is now 6.9.3. crowsnest [12:19:47] /run/lvm # udevadm --version 254 aka systemd-utils-254.10 lvm2-2.03.22 Thanks for the feedback, what you say makes perfect sense, and the implication is that there are only a few options: 1. Something is resulting in the udev trigger to take longer than three minutes, and the dmsetup udevcomplete never being executed. 2. Something goes horribly wrong during the udev trigger (which invokes dmsetup a few times) processing and the process crashes, never executing dmsetup udevcomplete. This could potentially be due to extremely heavy disk IO, or LVM itself freezing IO. Given the rulesets the only way I see this happening is if dmsetup command takes very long to load - and even in the degraded (most filesystems blocked) state it was fast to execute. Or if udevd itself has problems accessing /sys - which I find extremely unlikely. I don't see the default value for udev_log from the config. Explicitly set to debug now, but still not seeing anything logged to syslog. Running with udevd --debug, which logs to a ramdisk on /run. Hopefully (if/when this happens again) that may shed some light. There is 256GB of RAM available, so as long as the log doesn't grow too quickly should be fine. Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-03 12:56 ` Jaco Kroon @ 2024-06-03 19:25 ` Zdenek Kabelac 2024-06-04 8:46 ` Jaco Kroon 0 siblings, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-06-03 19:25 UTC (permalink / raw) To: Jaco Kroon, linux-lvm@lists.linux.dev Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a): > Hi, > > Thanks for the insight. Please refer below. > > On 2024/05/31 14:34, Zdenek Kabelac wrote: >> Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a): >>> Hi, >>> >> I'm kind of missing here to see your 'deadlock' scenario from this description. > Well, stuff blocks, until the cookie is released by using the dmset > udevcomplete command, so wrong wording perhaps? >> >> Lvm2 takes the VG lock - creates LV - waits for udev till it's finished with >> its job and confirms all the udev work with dmsetup udevcomplete. > > So what I understand from this is that udevcomplete ends up never executing? > Is there some way of confirming this? udevcomplete needs someone to create 'semaphore' for completion in the first place. >> >> It's also unclear which OS are you using - Debian, Fedora, ??? > > Gentoo. > >> Version of your packages ? > > I thought I did provide this: > > Kernel version was 6.4.12 when this hapened, is now 6.9.3. > > crowsnest [12:19:47] /run/lvm # udevadm --version > 254 > > aka systemd-utils-254.10 > > lvm2-2.03.22 Since this is most likely your personal build - please provide full output of 'lvm version' command. For the 'udev' synchronization, there needs to be '--enable-udev_sync' configure option. So let's check which configure/build option were used here. And also preferably upstream udev rules. > > Thanks for the feedback, what you say makes perfect sense, and the implication > is that there are only a few options: > > 1. Something is resulting in the udev trigger to take longer than three > minutes, and the dmsetup udevcomplete never being executed. systemd simply kills udev worker if takes too long. However on properly running system, it would be very very unusual to hit these timeouts - you would need to work with thousands of devices.... > > This could potentially be due to extremely heavy disk IO, or LVM itself > freezing IO. well reducing the percentage of '/proc/sys/vm/dirty_ration' may possibly help when your disk system is too slow and you create a very lengthy 'sync' io queues... > I don't see the default value for udev_log from the config. Explicitly set to > debug now, but still not seeing anything logged to syslog. Running with udevd > --debug, which logs to a ramdisk on /run. Hopefully (if/when this happens > again) that may shed some light. There is 256GB of RAM available, so as long > as the log doesn't grow too quickly should be fine. A lot of RAM may possibly create a huge amount of dirty pages... Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-03 19:25 ` Zdenek Kabelac @ 2024-06-04 8:46 ` Jaco Kroon 2024-06-04 10:48 ` Roger Heflin 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-04 8:46 UTC (permalink / raw) To: Zdenek Kabelac, linux-lvm@lists.linux.dev Hi, Please refer below. On 2024/06/03 21:25, Zdenek Kabelac wrote: > Dne 03. 06. 24 v 14:56 Jaco Kroon napsal(a): >> Hi, >> >> Thanks for the insight. Please refer below. >> >> On 2024/05/31 14:34, Zdenek Kabelac wrote: >>> Dne 30. 05. 24 v 12:21 Jaco Kroon napsal(a): >>>> Hi, >>>> >>> I'm kind of missing here to see your 'deadlock' scenario from this >>> description. >> Well, stuff blocks, until the cookie is released by using the dmset >> udevcomplete command, so wrong wording perhaps? >>> >>> Lvm2 takes the VG lock - creates LV - waits for udev till it's >>> finished with its job and confirms all the udev work with dmsetup >>> udevcomplete. >> >> So what I understand from this is that udevcomplete ends up never >> executing? Is there some way of confirming this? > > udevcomplete needs someone to create 'semaphore' for completion in > the first place. I'm not familiar with the LVM internals and the flows of different processes, even though you can probably safely consider me a "semi power user". I do have compsci background, so do understand most of the principles of locking etc, but how they are applied in the LVM environment I'm clueless. > > >>> >>> It's also unclear which OS are you using - Debian, Fedora, ??? >> >> Gentoo. >> >>> Version of your packages ? >> >> I thought I did provide this: >> >> Kernel version was 6.4.12 when this hapened, is now 6.9.3. >> >> crowsnest [12:19:47] /run/lvm # udevadm --version >> 254 >> >> aka systemd-utils-254.10 >> >> lvm2-2.03.22 > > Since this is most likely your personal build - please provide full > output of > 'lvm version' command. crowsnest [09:46:04] ~ # lvm version LVM version: 2.03.22(2) (2023-08-02) Library version: 1.02.196 (2023-08-02) Driver version: 4.48.0 Configuration: ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share --disable-dependency-tracking --disable-silent-rules --docdir=/usr/share/doc/lvm2-2.03.22-r5 --htmldir=/usr/share/doc/lvm2-2.03.22-r5/html --enable-dmfilemapd --enable-dmeventd --enable-cmdlib --enable-fsadm --enable-lvmpolld --with-mirrors=internal --with-snapshots=internal --with-thin=internal --with-cache=internal --with-thin-check=/usr/sbin/thin_check --with-cache-check=/usr/sbin/cache_check --with-thin-dump=/usr/sbin/thin_dump --with-cache-dump=/usr/sbin/cache_dump --with-thin-repair=/usr/sbin/thin_repair --with-cache-repair=/usr/sbin/cache_repair --with-thin-restore=/usr/sbin/thin_restore --with-cache-restore=/usr/sbin/cache_restore --with-symvers=gnu --enable-readline --disable-selinux --enable-pkgconfig --with-confdir=/etc --exec-prefix= --sbindir=/sbin --with-staticdir=/sbin --libdir=/lib64 --with-usrlibdir=/usr/lib64 --with-default-dm-run-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-default-pid-dir=/run --enable-udev_rules --enable-udev_sync --with-udevdir=/lib/udev/rules.d --disable-lvmlockd-sanlock --disable-notify-dbus --disable-app-machineid --disable-systemd-journal --without-systemd-run --disable-valgrind-pool --with-systemdsystemunitdir=/lib/systemd/system CLDFLAGS=-Wl,-O1 -Wl,--as-needed > > For the 'udev' synchronization, there needs to be '--enable-udev_sync' > configure option. So let's check which configure/build option were > used here. > And also preferably upstream udev rules. --enable-udev_sync all there. To the best of my knowledge the udev rules are stock, certainly neither myself nor any of my colleagues modified them. They would generally defer to me, and I won't touch that unless I understand the implications, which in this case I just don't. > >> >> Thanks for the feedback, what you say makes perfect sense, and the >> implication is that there are only a few options: >> >> 1. Something is resulting in the udev trigger to take longer than >> three minutes, and the dmsetup udevcomplete never being executed. > > systemd simply kills udev worker if takes too long. > > However on properly running system, it would be very very unusual to > hit these timeouts - you would need to work with thousands of > devices.... 32 physical NL-SAS drives, combined into 3 RAID6 arrays using mdadm. These three md devices serve as PVs for LVM, single VG. 73 LVs, just over half of which are mounted. Most of those are thin volumes inside: crowsnest [09:54:04] ~ # lvdisplay /dev/lvm/thin_pool --- Logical volume --- LV Name thin_pool VG Name lvm LV UUID twLSE1-3ckG-WRSO-5eHc-G3fY-YS2v-as4ABC LV Write Access read/write (activated read only) LV Creation host, time crowsnest, 2020-02-19 12:26:00 +0200 LV Pool metadata thin_pool_tmeta LV Pool data thin_pool_tdata LV Status available # open 0 LV Size 125.00 TiB Allocated pool data 73.57% Allocated metadata 9.05% Current LE 32768000 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 1024 Block device 253:11 The rest are snapshots of the LVs that are mounted so that we have a roll-back destination in case of filesystem corruption (these snaps are made in multiple steps, first a snap of the origin is made, this is then fsck'ed, if that's successful it's fstrim'ed before being renamed into the final "save" location - any previously saved copy is first lvremove'd). I'd describe that as a few tens of devices, not a few hundred and certainly not thousands of devices. > >> >> This could potentially be due to extremely heavy disk IO, or LVM >> itself freezing IO. > > well reducing the percentage of '/proc/sys/vm/dirty_ration' may > possibly help > when your disk system is too slow and you create a very lengthy 'sync' > io queues... crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio 20 Happy to lower that even more if it would help. Internet (Redhat) states: Starts active writeback of dirty data at this percentage of total memory for the generator of dirty data, via pdflush. The default value is |40|. I'm assuming the default is 20 though, not 40, since I can't find that I've reconfigured this value. Should probably remain higher than dirty_background_ratio (which is currently 10), dirty_background_bytes is 0. > >> I don't see the default value for udev_log from the config. >> Explicitly set to debug now, but still not seeing anything logged to >> syslog. Running with udevd --debug, which logs to a ramdisk on /run. >> Hopefully (if/when this happens again) that may shed some light. >> There is 256GB of RAM available, so as long as the log doesn't grow >> too quickly should be fine. > > A lot of RAM may possibly create a huge amount of dirty pages... May I safely interpret this as "lower the dirty_ratio even further"? Given a value of 10 and 20 I'm assuming that pdflush will start flushing out in the background when >~26GB of in-memory data is dirty, or if the data has been dirty for more than 5 seconds (dirty_writeback_centisecs = 500). Don't mind lowering the dirty_background ratio as low as 1 even? But won't the primary dirty_ratio start blocking processes from writing if >40% of the caches/buffers are considered dirty? Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 8:46 ` Jaco Kroon @ 2024-06-04 10:48 ` Roger Heflin 2024-06-04 11:52 ` Jaco Kroon 0 siblings, 1 reply; 20+ messages in thread From: Roger Heflin @ 2024-06-04 10:48 UTC (permalink / raw) To: Jaco Kroon; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev On Tue, Jun 4, 2024 at 4:06 AM Jaco Kroon <jaco@uls.co.za> wrote: > > Hi, > > Please refer below. > > >> > >> This could potentially be due to extremely heavy disk IO, or LVM > >> itself freezing IO. > > > > well reducing the percentage of '/proc/sys/vm/dirty_ration' may > > possibly help > > when your disk system is too slow and you create a very lengthy 'sync' > > io queues... > > > crowsnest [09:52:21] /proc/sys/vm # cat dirty_ratio > 20 > > Happy to lower that even more if it would help. > > Internet (Redhat) states: > > Starts active writeback of dirty data at this percentage of total memory > for the generator of dirty data, via pdflush. The default value is |40|. > > I'm assuming the default is 20 though, not 40, since I can't find that > I've reconfigured this value. > > Should probably remain higher than dirty_background_ratio (which is > currently 10), dirty_background_bytes is 0. > > > > > >> I don't see the default value for udev_log from the config. > >> Explicitly set to debug now, but still not seeing anything logged to > >> syslog. Running with udevd --debug, which logs to a ramdisk on /run. > >> Hopefully (if/when this happens again) that may shed some light. > >> There is 256GB of RAM available, so as long as the log doesn't grow > >> too quickly should be fine. > > > > A lot of RAM may possibly create a huge amount of dirty pages... > > > May I safely interpret this as "lower the dirty_ratio even further"? > > Given a value of 10 and 20 I'm assuming that pdflush will start flushing > out in the background when >~26GB of in-memory data is dirty, or if the > data has been dirty for more than 5 seconds (dirty_writeback_centisecs = > 500). > > Don't mind lowering the dirty_background ratio as low as 1 even? But > won't the primary dirty_ratio start blocking processes from writing if > >40% of the caches/buffers are considered dirty? > > Kind regards, > Jaco > > Use the *_bytes values. If they are non-zero then they are used and that allows setting even below 1% (quite large on anything with a lot of ram). I have been using this for quite a while: vm.dirty_background_bytes = 3000000 vm.dirty_bytes = 5000000 ie 5M and 3M such that I should never have a huge amount of writes outstanding. And you can go lower. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 10:48 ` Roger Heflin @ 2024-06-04 11:52 ` Jaco Kroon 2024-06-04 13:30 ` Roger Heflin 2024-06-04 16:07 ` Zdenek Kabelac 0 siblings, 2 replies; 20+ messages in thread From: Jaco Kroon @ 2024-06-04 11:52 UTC (permalink / raw) To: Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev Hi, On 2024/06/04 12:48, Roger Heflin wrote: > Use the *_bytes values. If they are non-zero then they are used and > that allows setting even below 1% (quite large on anything with a lot > of ram). > > I have been using this for quite a while: > vm.dirty_background_bytes = 3000000 > vm.dirty_bytes = 5000000 crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000 vm.dirty_background_bytes = 3000000 crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000 vm.dirty_bytes = 500000000 And persisted via /etc/sysctl.conf Thank you. Must be noted this host doesn't do much else other than disk IO, so I'm hoping the 500MB value will be OK, this is just so IO won't block CPU heavy-at-the-time tasks. The purpose of 256GB RAM was so that we could have ~250GB worth of disk cache (obviously we don't want all of that to be dirty, OS and "used" used to be below 4GB, now generally around 8-12GB, currently it's in "quiet" time, so a bit lower, just busy running some background compression). As per iostat: avg-cpu: %user %nice %system %iowait %steal %idle 7.73 18.43 18.96 37.86 0.00 17.01 Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd md2 392.13 10.00 5.11 0.00 4244888 2167644 0 md3 2270.12 43.88 56.82 0.00 18626309 24120982 0 md4 1406.06 30.47 16.83 0.00 12934654 7143330 0 That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written in just under 5 days. What I am noticing immediately is that the "free" value as per "free -m" is definitely much higher, which to me is indicative that we're not caching as aggressively as can be done. Will monitor this for the time being: crowsnest [13:50:09] ~ # free -m total used free shared buff/cache available Mem: 257661 6911 105313 7 145436 248246 Swap: 0 0 0 The Total DISK WRITE and Current DISK Write values in in iotop seems to have a tighter correlation now (no longer seeing constant Total DISK WRITE with spikes in current, seems to be more even now). Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 11:52 ` Jaco Kroon @ 2024-06-04 13:30 ` Roger Heflin 2024-06-04 13:46 ` Stuart D Gathman 2024-06-04 14:07 ` Jaco Kroon 2024-06-04 16:07 ` Zdenek Kabelac 1 sibling, 2 replies; 20+ messages in thread From: Roger Heflin @ 2024-06-04 13:30 UTC (permalink / raw) To: Jaco Kroon; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev My experience is that heavy disk io/batch disk io systems work better with these values being smallish. Ie both even under 10MB or so. About all having the number larger has done is trick io benchmarks that don't force a sync at the end, and/or appear to make large saves happen faster. There is also the freeze/pause for outstandingwritesMB/<iorateMB> seconds, smaller shortens the freeze. I don't see a use case for having large values. It seems to have no real upside and several downsides. Get the buffer size small enough and you will still get pauses to clear the writes the be pauses will be short enough to not be a problem. On Tue, Jun 4, 2024 at 6:52 AM Jaco Kroon <jaco@uls.co.za> wrote: > > Hi, > > On 2024/06/04 12:48, Roger Heflin wrote: > > > Use the *_bytes values. If they are non-zero then they are used and > > that allows setting even below 1% (quite large on anything with a lot > > of ram). > > > > I have been using this for quite a while: > > vm.dirty_background_bytes = 3000000 > > vm.dirty_bytes = 5000000 > > > crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000 > vm.dirty_background_bytes = 3000000 > crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000 > vm.dirty_bytes = 500000000 > > And persisted via /etc/sysctl.conf > > Thank you. Must be noted this host doesn't do much else other than disk > IO, so I'm hoping the 500MB value will be OK, this is just so IO won't > block CPU heavy-at-the-time tasks. > > The purpose of 256GB RAM was so that we could have ~250GB worth of disk > cache (obviously we don't want all of that to be dirty, OS and "used" > used to be below 4GB, now generally around 8-12GB, currently it's in > "quiet" time, so a bit lower, just busy running some background > compression). As per iostat: > > avg-cpu: %user %nice %system %iowait %steal %idle > 7.73 18.43 18.96 37.86 0.00 17.01 > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read > MB_wrtn MB_dscd > md2 392.13 10.00 5.11 0.00 4244888 > 2167644 0 > md3 2270.12 43.88 56.82 0.00 18626309 > 24120982 0 > md4 1406.06 30.47 16.83 0.00 > 12934654 7143330 0 > > That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written > in just under 5 days. > > What I am noticing immediately is that the "free" value as per "free -m" > is definitely much higher, which to me is indicative that we're not > caching as aggressively as can be done. Will monitor this for the time > being: > > crowsnest [13:50:09] ~ # free -m > total used free shared buff/cache > available > Mem: 257661 6911 105313 7 145436 248246 > Swap: 0 0 0 > > The Total DISK WRITE and Current DISK Write values in in iotop seems to > have a tighter correlation now (no longer seeing constant Total DISK > WRITE with spikes in current, seems to be more even now). > > Kind regards, > Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 13:30 ` Roger Heflin @ 2024-06-04 13:46 ` Stuart D Gathman 2024-06-04 14:49 ` Jaco Kroon 2024-06-04 14:07 ` Jaco Kroon 1 sibling, 1 reply; 20+ messages in thread From: Stuart D Gathman @ 2024-06-04 13:46 UTC (permalink / raw) To: Roger Heflin; +Cc: Jaco Kroon, Zdenek Kabelac, linux-lvm@lists.linux.dev On Tue, 4 Jun 2024, Roger Heflin wrote: > My experience is that heavy disk io/batch disk io systems work better > with these values being smallish. > I don't see a use case for having large values. It seems to have no > real upside and several downsides. Get the buffer size small enough > and you will still get pauses to clear the writes the be pauses will > be short enough to not be a problem. Not a normal situation, but I should mention my recent experience. One of the disks in an underlying RAID was going bad. It still worked, but the disk struggled manfully with multiple retries and recalibrates to complete many reads/writes - i.e. it was extremely slow. I was running into all kinds of strange boundary conditions because of this. E.g. VMs were getting timeouts on their virtio disk devices, leading to file system corruption and other issues. I was not modifying any LVM volumes, so did not run into any problems with LVM - but that is a boundary condition to keep in mind. You don't necessarily need to fully work under such conditions, but need to do something sane. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 13:46 ` Stuart D Gathman @ 2024-06-04 14:49 ` Jaco Kroon 2024-06-04 15:03 ` Roger Heflin 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-04 14:49 UTC (permalink / raw) To: Stuart D Gathman, Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev Hi, On 2024/06/04 15:46, Stuart D Gathman wrote: > On Tue, 4 Jun 2024, Roger Heflin wrote: > >> My experience is that heavy disk io/batch disk io systems work better >> with these values being smallish. > >> I don't see a use case for having large values. It seems to have no >> real upside and several downsides. Get the buffer size small enough >> and you will still get pauses to clear the writes the be pauses will >> be short enough to not be a problem. > > Not a normal situation, but I should mention my recent experience. > One of the disks in an underlying RAID was going bad. It still worked, > but the disk struggled manfully with multiple retries and recalibrates > to complete many reads/writes - i.e. it was extremely slow. I was > running into all kinds of strange boundary conditions because of this. > E.g. VMs were getting timeouts on their virtio disk devices, leading > to file system corruption and other issues. > > I was not modifying any LVM volumes, so did not run into any problems > with LVM - but that is a boundary condition to keep in mind. You > don't necessarily need to fully work under such conditions, but need > to do something sane. On SAS or NL-SAS drives? I've seen this before on SATA drives, and is probably the single biggest reason why I have a major dislike for deploying SATA drives to any kind of high-reliability environment. Regardless, we do monitor all cases using smartd and it *usually* picks up a bad drive before it gets to the above point of pain but with SATA drives this isn't always the case, and the drive will simply indefinitely keep retrying, blocking request slot numbers over time (SATA protocol handles 32 requests IIRC, but after how long can the Linux kernel re-use a number it has never received a response on kind of problem) and getting slower and slower until you power cycle the drive, after which it's fine again for a while. Never had that crap with NL-SAS drives. Specific host is all NL-SAS. No VM involvement here. Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 14:49 ` Jaco Kroon @ 2024-06-04 15:03 ` Roger Heflin 0 siblings, 0 replies; 20+ messages in thread From: Roger Heflin @ 2024-06-04 15:03 UTC (permalink / raw) To: Jaco Kroon; +Cc: Stuart D Gathman, Zdenek Kabelac, linux-lvm@lists.linux.dev The SATA disks work ok if you use smartctl to set the SCTERC timeout as low as possible (smartctl -l scterc,20,20 /dev/${drive} ). I have a set of commands that starts high and sets it lower with the idea that each different manufactuers disk will have a different min value and I simply want it is as low as I can go. Desktop/Green disks do not have a settable timeout and timeout at 60+ seconds or more. Red/NAS/Video/Enterprise/Purple SATA and such typically timeout at 10sec but can be set lower. SAS timeouts are typically 1sec or lower on a bad sector. And I have personally dealt with "enterprise" vendors that when using SATA leave the timeout at default (10 seconds) rather than lowering it so that the disks work reasonably when bad sectors are happening. On Tue, Jun 4, 2024 at 9:50 AM Jaco Kroon <jaco@uls.co.za> wrote: > > Hi, > > On 2024/06/04 15:46, Stuart D Gathman wrote: > > On Tue, 4 Jun 2024, Roger Heflin wrote: > > > >> My experience is that heavy disk io/batch disk io systems work better > >> with these values being smallish. > > > >> I don't see a use case for having large values. It seems to have no > >> real upside and several downsides. Get the buffer size small enough > >> and you will still get pauses to clear the writes the be pauses will > >> be short enough to not be a problem. > > > > Not a normal situation, but I should mention my recent experience. > > One of the disks in an underlying RAID was going bad. It still worked, > > but the disk struggled manfully with multiple retries and recalibrates > > to complete many reads/writes - i.e. it was extremely slow. I was > > running into all kinds of strange boundary conditions because of this. > > E.g. VMs were getting timeouts on their virtio disk devices, leading > > to file system corruption and other issues. > > > > I was not modifying any LVM volumes, so did not run into any problems > > with LVM - but that is a boundary condition to keep in mind. You > > don't necessarily need to fully work under such conditions, but need > > to do something sane. > > On SAS or NL-SAS drives? > > I've seen this before on SATA drives, and is probably the single biggest > reason why I have a major dislike for deploying SATA drives to any kind > of high-reliability environment. > > Regardless, we do monitor all cases using smartd and it *usually* picks > up a bad drive before it gets to the above point of pain but with SATA > drives this isn't always the case, and the drive will simply > indefinitely keep retrying, blocking request slot numbers over time > (SATA protocol handles 32 requests IIRC, but after how long can the > Linux kernel re-use a number it has never received a response on kind of > problem) and getting slower and slower until you power cycle the drive, > after which it's fine again for a while. Never had that crap with > NL-SAS drives. > > Specific host is all NL-SAS. No VM involvement here. > > Kind regards, > Jaco > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 13:30 ` Roger Heflin 2024-06-04 13:46 ` Stuart D Gathman @ 2024-06-04 14:07 ` Jaco Kroon 1 sibling, 0 replies; 20+ messages in thread From: Jaco Kroon @ 2024-06-04 14:07 UTC (permalink / raw) To: Roger Heflin; +Cc: Zdenek Kabelac, linux-lvm@lists.linux.dev Hi, On 2024/06/04 15:30, Roger Heflin wrote: > My experience is that heavy disk io/batch disk io systems work better > with these values being smallish. > > Ie both even under 10MB or so. About all having the number larger > has done is trick io benchmarks that don't force a sync at the end, > and/or appear to make large saves happen faster. > > There is also the freeze/pause for outstandingwritesMB/<iorateMB> > seconds, smaller shortens the freeze. > > I don't see a use case for having large values. It seems to have no > real upside and several downsides. Get the buffer size small enough > and you will still get pauses to clear the writes the be pauses will > be short enough to not be a problem. Thanks, this is extremely insightful. So with original values there could be "up to" ~ 50GB outstanding for write, let's assume that's all to one disk (extremely unlikely, and assuming 100MB/s which is optimistic if it's random access) this will take upwards of 500 seconds, which is a hellishly long time in our world. I think the value of 500MB I've set now should almost never exceed 10s or so for a sync even if everything is targeted at a single drive. I think we're OK with that on this specific host. Kind regards, Jaco > > > On Tue, Jun 4, 2024 at 6:52 AM Jaco Kroon <jaco@uls.co.za> wrote: >> Hi, >> >> On 2024/06/04 12:48, Roger Heflin wrote: >> >>> Use the *_bytes values. If they are non-zero then they are used and >>> that allows setting even below 1% (quite large on anything with a lot >>> of ram). >>> >>> I have been using this for quite a while: >>> vm.dirty_background_bytes = 3000000 >>> vm.dirty_bytes = 5000000 >> >> crowsnest [13:32:48] ~ # sysctl vm.dirty_background_bytes=3000000 >> vm.dirty_background_bytes = 3000000 >> crowsnest [13:32:59] ~ # sysctl vm.dirty_bytes=500000000 >> vm.dirty_bytes = 500000000 >> >> And persisted via /etc/sysctl.conf >> >> Thank you. Must be noted this host doesn't do much else other than disk >> IO, so I'm hoping the 500MB value will be OK, this is just so IO won't >> block CPU heavy-at-the-time tasks. >> >> The purpose of 256GB RAM was so that we could have ~250GB worth of disk >> cache (obviously we don't want all of that to be dirty, OS and "used" >> used to be below 4GB, now generally around 8-12GB, currently it's in >> "quiet" time, so a bit lower, just busy running some background >> compression). As per iostat: >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 7.73 18.43 18.96 37.86 0.00 17.01 >> >> Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read >> MB_wrtn MB_dscd >> md2 392.13 10.00 5.11 0.00 4244888 >> 2167644 0 >> md3 2270.12 43.88 56.82 0.00 18626309 >> 24120982 0 >> md4 1406.06 30.47 16.83 0.00 >> 12934654 7143330 0 >> >> That's total 35805851 MB (34.1B) read and 33431956 MB (31.9TB) written >> in just under 5 days. >> >> What I am noticing immediately is that the "free" value as per "free -m" >> is definitely much higher, which to me is indicative that we're not >> caching as aggressively as can be done. Will monitor this for the time >> being: >> >> crowsnest [13:50:09] ~ # free -m >> total used free shared buff/cache >> available >> Mem: 257661 6911 105313 7 145436 248246 >> Swap: 0 0 0 >> >> The Total DISK WRITE and Current DISK Write values in in iotop seems to >> have a tighter correlation now (no longer seeing constant Total DISK >> WRITE with spikes in current, seems to be more even now). >> >> Kind regards, >> Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 11:52 ` Jaco Kroon 2024-06-04 13:30 ` Roger Heflin @ 2024-06-04 16:07 ` Zdenek Kabelac 2024-06-05 8:59 ` Jaco Kroon 1 sibling, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-06-04 16:07 UTC (permalink / raw) To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a): > Hi, > > On 2024/06/04 12:48, Roger Heflin wrote: > >> Use the *_bytes values. If they are non-zero then they are used and >> that allows setting even below 1% (quite large on anything with a lot >> of ram). >> >> I have been using this for quite a while: >> vm.dirty_background_bytes = 3000000 >> vm.dirty_bytes = 5000000 > > > What I am noticing immediately is that the "free" value as per "free -m" is > definitely much higher, which to me is indicative that we're not caching as > aggressively as can be done. Will monitor this for the time being: > > crowsnest [13:50:09] ~ # free -m > total used free shared buff/cache available > Mem: 257661 6911 105313 7 145436 248246 > Swap: 0 0 0 > > The Total DISK WRITE and Current DISK Write values in in iotop seems to have a > tighter correlation now (no longer seeing constant Total DISK WRITE with > spikes in current, seems to be more even now). Hi So now while we are solving various system setting - there are more things to think through. The big 'range' of unwritten data may put them in risk for the 'power' failure. On the other hand large 'dirty pages' allows system to 'optimize' and even bypass storing them on disk if they are frequently changed - so in this case 'lower' dirty ration may cause significant performance impact - so please check whats the typical workload and what is result... It's worth to mention lvm2 support writecache target to kind of offload dirty pages to fast storage... Last but not least - disk scheduling policies also do have impact - to i.e. ensure better fairness - at the prices of lower throughput... So now let's get back to lvm2 'possible' deadlock - which I'm still not fully certain we deciphered in this thread yet. So if you happen to 'spot' stuck commands - do you notice anything strange in systemd journal - usually when systemd decides to kill udevd worker task - it's briefly stated in journal - with this check we would kind of know that reason of your problems was killed worked that was not able to 'finalize' lvm command which is waiting for confirmation from udev (currently without any timeout limits). To unstuck such command 'udevcomplete_all' is a cure - but as said - the system is already kind of 'damaged' since udev is failing and has 'invalid' information about devices... So maybe you could check whether your journal around date&time of problem has some 'interesting' 'killing action' record ? Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-04 16:07 ` Zdenek Kabelac @ 2024-06-05 8:59 ` Jaco Kroon 2024-06-06 22:14 ` Zdenek Kabelac 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-05 8:59 UTC (permalink / raw) To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Hi, On 2024/06/04 18:07, Zdenek Kabelac wrote: > Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a): >> Hi, >> >> On 2024/06/04 12:48, Roger Heflin wrote: >> >>> Use the *_bytes values. If they are non-zero then they are used and >>> that allows setting even below 1% (quite large on anything with a lot >>> of ram). >>> >>> I have been using this for quite a while: >>> vm.dirty_background_bytes = 3000000 >>> vm.dirty_bytes = 5000000 >> >> >> What I am noticing immediately is that the "free" value as per "free >> -m" is definitely much higher, which to me is indicative that we're >> not caching as aggressively as can be done. Will monitor this for >> the time being: >> >> crowsnest [13:50:09] ~ # free -m >> total used free shared buff/cache >> available >> Mem: 257661 6911 105313 7 145436 >> 248246 >> Swap: 0 0 0 >> >> The Total DISK WRITE and Current DISK Write values in in iotop seems >> to have a tighter correlation now (no longer seeing constant Total >> DISK WRITE with spikes in current, seems to be more even now). The free value how now dropped drastically anyway. So looks like the increase of free was a temporary situation. > > Hi > > So now while we are solving various system setting - there are more > things to think through. Yea. Realised we derailed, but given that the theory is that "stuff" is blocking the complete (probably due to backlogged IO?), it's not completely unrelated is it? > > The big 'range' of unwritten data may put them in risk for the 'power' > failure. I'd be more worried about host crash in this case to be honest (dual PSU and in several years we've not had a single phase or PDU failure). > On the other hand large 'dirty pages' allows system to 'optimize' > and even bypass storing them on disk if they are frequently changed - > so in this case 'lower' dirty ration may cause significant performance > impact - so please check whats the typical workload and what is result... Based on observations from task timings last night I reckon workloads are around 25% faster on average. Tasks that used to run just shy of 20 hours (would still have been busy right now) completed last night in just under 15 . This would need to be monitored over time though, as a single run is definitely not authoritative. This was with the _bytes settings as suggested by Roger. For the specific use-case I doubt "frequently changed" applies, and it's probably best to get the data persisted as soon as possible, allowing for improved "future IO capacity" (hope my wording makes sense). > > It's worth to mention lvm2 support writecache target to kind of > offload dirty pages to fast storage... We normally use raid controller battery backup for this in other environments, not relevant in this specific case though, we are using dm-cache in other environments mostly for a read-cache (ie, write-through strategy) on NVMe though because the raid controller whilst buffering writes really sucks at serving reads, which given the nature of spinning drives makes perfect sense, and given the amount of READ on those two hosts the NVMe setup more than quadrupled throughput there. > > Last but not least - disk scheduling policies also do have impact - > to i.e. ensure better fairness - at the prices of lower throughput... We normally use mq-deadline, in this setup I notice this has been updated to "none", the plan was to revert, this was done in collaboration with a discussion with Bart van Assche. Happy to revert this to be honest. https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ relates. > > So now let's get back to lvm2 'possible' deadlock - which I'm still > not fully certain we deciphered in this thread yet. > > So if you happen to 'spot' stuck commands - do you notice anything > strange in systemd journal - usually when systemd decides to kill > udevd worker task - it's briefly stated in journal - with this check > we would kind of know that reason of your problems was killed worked > that was not able to 'finalize' lvm command which is waiting for > confirmation from udev (currently without any timeout limits). Not using systemd, but udev does come from the systemd package. Nothing in the logs at all for udev, as mentioned previously. Don't seem to be able to get normal logs working, but I have set up the debuglog now. This does log very detailed, except there are no timestamps. So *if* this happens again hopefully we'll be able to look for some working that was killed rather than merely exited. What I can see is that it looks like a single forked worker can perform multiple tasks and execute multiple other calls, so I believe that the three minute timeout is *overall*, not on just a single RUN command, which implies that the theory that udevcomplete is never signalled is very much valid. > > To unstuck such command 'udevcomplete_all' is a cure - but as said - > the system is already kind of 'damaged' since udev is failing and has > 'invalid' information about devices... Agreed. It gets things going again, which really just allows for a cleaner reboot rather than echo b > /proc/sysrq-trigger or remotely yanking the power (which is where we normally end up at if we don't catch it early enough). > > So maybe you could check whether your journal around date&time of > problem has some 'interesting' 'killing action' record ? If we can get normal udev logging working correctly that would be great, but this is not your responsibility, so let me figure out how I can get udevd to log tot syslog (if that is even possible given the way things seems to be moving with systemd). Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-05 8:59 ` Jaco Kroon @ 2024-06-06 22:14 ` Zdenek Kabelac 2024-06-06 22:17 ` Zdenek Kabelac 0 siblings, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-06-06 22:14 UTC (permalink / raw) To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a): > Hi, > > On 2024/06/04 18:07, Zdenek Kabelac wrote: >> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a): >> Last but not least - disk scheduling policies also do have impact - to i.e. >> ensure better fairness - at the prices of lower throughput... > We normally use mq-deadline, in this setup I notice this has been updated to > "none", the plan was to revert, this was done in collaboration with a > discussion with Bart van Assche. Happy to revert this to be honest. > https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ > relates. Hi So I guess we can tell the store like this - When you've created your 'snapshot' of a thin-volume - this enforces full flush (& fsfreeze) of a thin volume - so any dirty pages need to written in thin pool before snapshot could be taken (and thin pool should not run out of space) - this CAN potentially hold your system running for a long time (depending on performance of your storage) and may cause various lock-ups states of your system if you are using this 'snapshoted' volume for anything else - as the volume is suspended - so it blocks further operations on this device - eventually causing full system circular deadlock (catch 22) - this is hard to analyze without whole picture of the system. We may eventually think whether we can somehow minimize the amount of holding vglock and suspending with flush & fsfreeze - but it's about some future possible enhancement and flush disk upfront to minimize dirty size. For now reducing dirty page queue to minize the blocking time associated with snapshoting is a right choice (although 500M is probably unnecessarily low...) Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-06 22:14 ` Zdenek Kabelac @ 2024-06-06 22:17 ` Zdenek Kabelac 2024-06-07 9:03 ` Jaco Kroon 0 siblings, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-06-06 22:17 UTC (permalink / raw) To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a): > Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a): >> Hi, >> >> On 2024/06/04 18:07, Zdenek Kabelac wrote: >>> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a): >>> Last but not least - disk scheduling policies also do have impact - to >>> i.e. ensure better fairness - at the prices of lower throughput... >> We normally use mq-deadline, in this setup I notice this has been updated to >> "none", the plan was to revert, this was done in collaboration with a >> discussion with Bart van Assche. Happy to revert this to be honest. >> https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ >> relates. > > Hi > > So I guess we can tell the store like this - > > When you've created your 'snapshot' of a thin-volume - this enforces full > flush (& fsfreeze) of a thin volume - so any dirty pages need to written in > thin pool before snapshot could be taken (and thin pool should not run out of > space) - this CAN potentially hold your system running for a long time > (depending on performance of your storage) and may cause various lock-ups > states of your system if you are using this 'snapshoted' volume for anything > else - as the volume is suspended - so it blocks further operations on this > device - eventually causing full system circular deadlock (catch 22) - this > is hard to analyze without whole picture of the system. > > We may eventually think whether we can somehow minimize the amount of holding > vglock and suspending with flush & fsfreeze - but it's about some future > possible enhancement and flush disk upfront to minimize dirty size. I've forget to mention that a 'simplest' way is just to run 'sync' before running 'lvcreate -s...' command... Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-06 22:17 ` Zdenek Kabelac @ 2024-06-07 9:03 ` Jaco Kroon 2024-06-07 9:26 ` Zdenek Kabelac 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-07 9:03 UTC (permalink / raw) To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Hi, On 2024/06/07 00:17, Zdenek Kabelac wrote: > Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a): >> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a): >>> Hi, >>> >>> On 2024/06/04 18:07, Zdenek Kabelac wrote: >>>> Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a): >>>> Last but not least - disk scheduling policies also do have impact >>>> - to i.e. ensure better fairness - at the prices of lower >>>> throughput... >>> We normally use mq-deadline, in this setup I notice this has been >>> updated to "none", the plan was to revert, this was done in >>> collaboration with a discussion with Bart van Assche. Happy to >>> revert this to be honest. >>> https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@acm.org/ >>> relates. >> >> Hi >> >> So I guess we can tell the store like this - >> >> When you've created your 'snapshot' of a thin-volume - this enforces >> full flush (& fsfreeze) of a thin volume - so any dirty pages need to >> written in thin pool before snapshot could be taken (and thin pool >> should not run out of space) - this CAN potentially hold your system >> running for a long time (depending on performance of your storage) >> and may cause various lock-ups states of your system if you are using >> this 'snapshoted' volume for anything else - as the volume is >> suspended - so it blocks further operations on this device - >> eventually causing full system circular deadlock (catch 22) - this >> is hard to analyze without whole picture of the system. >> >> We may eventually think whether we can somehow minimize the amount of >> holding >> vglock and suspending with flush & fsfreeze - but it's about some >> future possible enhancement and flush disk upfront to minimize dirty >> size. > > I've forget to mention that a 'simplest' way is just to run 'sync' > before running 'lvcreate -s...' command... Thanks. I think all in all everything mentioned here makes a lot of sense, and (in my opinion at least) explains the symptoms we've been seeing. Overall the system does "feel" more responsive with the lower dirty buffers, and most likely it helps with data persistence (as has been mentioned) in case of system crashes and/or loss of power. The tasks during peak usage also does seem to run faster on average, I suspect this is because of the use-case for this host: 1. Data is seldomly overwritten (this was touched on). Pretty much everything is WORM-type access (Write-Once, Read-Many). 2. Caches are mostly needed to avoid read-bandwidth from consuming capacity for writing. 3. It's thus beneficial to get writes out of the way as soon as possible, rather than at a later stage having to block getting many writes done for a flush() or sync() or lvcreate (snapshot). Is 500MB needlessly low? Probably. But given the above I think this is acceptable. Rather keep the disk writing *now* in order to free up *future* capacity. I'm guessing your "simple way" is workable for the generic case as well, towards that end, is a relatively simple change to the lvm2 tools not perhaps to add an syncfs() call to lvcreate *just prior* to freezing? The hard part is probably to figure out if the LV is mounted somewhere, and if it is, to open() that path in order to have a file-descriptor to pass to syncfs()? Obviously if the LV isn't mounted none of this is a concern and we can just proceed. What would be more interesting is if cluster-lvm is in play and the origin LV is active/open on an alternative node? But that's well beyond the scope of our requirements (for now). Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-07 9:03 ` Jaco Kroon @ 2024-06-07 9:26 ` Zdenek Kabelac 2024-06-07 9:36 ` Jaco Kroon 0 siblings, 1 reply; 20+ messages in thread From: Zdenek Kabelac @ 2024-06-07 9:26 UTC (permalink / raw) To: Jaco Kroon, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Dne 07. 06. 24 v 11:03 Jaco Kroon napsal(a): > Hi, > > On 2024/06/07 00:17, Zdenek Kabelac wrote: >> Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a): >>> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a): >>>> Hi, >>> >> > I'm guessing your "simple way" is workable for the generic case as well, > towards that end, is a relatively simple change to the lvm2 tools not perhaps > to add an syncfs() call to lvcreate *just prior* to freezing? The hard part is > probably to figure out if the LV is mounted somewhere, and if it is, to open() > that path in order to have a file-descriptor to pass to syncfs()? Obviously > if the LV isn't mounted none of this is a concern and we can just proceed. > Hi There is no simple answer here - a) 'sync' flushes all io for all disk in the system - user can play with tools like hdparm -F /dev/xxxx - so still everything in range of 'admin's hand'... b) it's about the definition of the 'snapshot' moment - do you want to take snapshot as of 'now' or after possibly X minutes where everything has been flushed and meanwhile new data flown-in ?? c) lvm2 needs some 'multi LV' atomic snapshot support... d) with thin-pool and out-of-space potential it gets more tricky.... > What would be more interesting is if cluster-lvm is in play and the origin LV > is active/open on an alternative node? But that's well beyond the scope of > our requirements (for now). Clearly in the cluster case user can use multi-node active LV only in the case there is something that is able to 'manage' this storage - i.g. gfs2. Surely use of ext4/xfs this way is out of question... Regards Zdenek ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: lvm2 deadlock 2024-06-07 9:26 ` Zdenek Kabelac @ 2024-06-07 9:36 ` Jaco Kroon 2024-09-02 5:48 ` Unsubscribe box, listen 0 siblings, 1 reply; 20+ messages in thread From: Jaco Kroon @ 2024-06-07 9:36 UTC (permalink / raw) To: Zdenek Kabelac, Roger Heflin; +Cc: linux-lvm@lists.linux.dev Hi, On 2024/06/07 11:26, Zdenek Kabelac wrote: > Dne 07. 06. 24 v 11:03 Jaco Kroon napsal(a): >> Hi, >> >> On 2024/06/07 00:17, Zdenek Kabelac wrote: >>> Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a): >>>> Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a): >>>>> Hi, >>>> >>> >> I'm guessing your "simple way" is workable for the generic case as >> well, towards that end, is a relatively simple change to the lvm2 >> tools not perhaps to add an syncfs() call to lvcreate *just prior* to >> freezing? The hard part is probably to figure out if the LV is >> mounted somewhere, and if it is, to open() that path in order to have >> a file-descriptor to pass to syncfs()? Obviously if the LV isn't >> mounted none of this is a concern and we can just proceed. >> > > > Hi > > There is no simple answer here - > > a) 'sync' flushes all io for all disk in the system - user can play > with tools like hdparm -F /dev/xxxx - so still everything in range of > 'admin's hand'... Fair. Or sync -f /path/to/mountpoint. > > b) it's about the definition of the 'snapshot' moment - do you want to > take snapshot as of 'now' or after possibly X minutes where > everything has been flushed and meanwhile new data flown-in ?? Oh yea, that's very valid, so instead of just lvcreate the sysadmin should sync -f /path/to/mountpoint *before* issuing lvcreate in the case where "possibly X minutes from now" is acceptable. Guessing this can be a --pre-sync argument for lvcreate but obviously the sysadmin is perfectly capable (if he's aware of this caveat) just run sync -f /path/to/mountpoint just before lvcreate. > > c) lvm2 needs some 'multi LV' atomic snapshot support... > > d) with thin-pool and out-of-space potential it gets more tricky.... > > >> What would be more interesting is if cluster-lvm is in play and the >> origin LV is active/open on an alternative node? But that's well >> beyond the scope of our requirements (for now). > > Clearly in the cluster case user can use multi-node active LV only in > the case there is something that is able to 'manage' this storage - > i.g. gfs2. Surely use of ext4/xfs this way is out of question... Was referring to the case where an LV is only active on *one* node at a time, but it's on shared physical storage. Not even sure if a thin pool can be active on more than one node at a time in such a case. This is research I've not yet done. We tried gfs2 a few years back and the sheer number of unresolvable failure scenarios at the time just had us switch to glusterfs instead. I think this can be considered closed now. Thanks again for all the help and insight, I thoroughly enjoyed the discussion too, it was most insightful and I learned a lot from it. Kind regards, Jaco ^ permalink raw reply [flat|nested] 20+ messages in thread
* Unsubscribe 2024-06-07 9:36 ` Jaco Kroon @ 2024-09-02 5:48 ` box, listen 0 siblings, 0 replies; 20+ messages in thread From: box, listen @ 2024-09-02 5:48 UTC (permalink / raw) To: linux-lvm Unsubscribe -- ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2024-09-02 5:55 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-30 10:21 lvm2 deadlock Jaco Kroon 2024-05-31 12:34 ` Zdenek Kabelac 2024-06-03 12:56 ` Jaco Kroon 2024-06-03 19:25 ` Zdenek Kabelac 2024-06-04 8:46 ` Jaco Kroon 2024-06-04 10:48 ` Roger Heflin 2024-06-04 11:52 ` Jaco Kroon 2024-06-04 13:30 ` Roger Heflin 2024-06-04 13:46 ` Stuart D Gathman 2024-06-04 14:49 ` Jaco Kroon 2024-06-04 15:03 ` Roger Heflin 2024-06-04 14:07 ` Jaco Kroon 2024-06-04 16:07 ` Zdenek Kabelac 2024-06-05 8:59 ` Jaco Kroon 2024-06-06 22:14 ` Zdenek Kabelac 2024-06-06 22:17 ` Zdenek Kabelac 2024-06-07 9:03 ` Jaco Kroon 2024-06-07 9:26 ` Zdenek Kabelac 2024-06-07 9:36 ` Jaco Kroon 2024-09-02 5:48 ` Unsubscribe box, listen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).