LVM kernel lockup scenario during lvcreate

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* LVM kernel lockup scenario during lvcreate
@ 2023-06-09  7:29 Jaco Kroon
  2023-06-12 18:40 ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-06-09  7:29 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3774 bytes --]

Hi All,

We've been tracking a scenario where IO on one of our hosts locks up.

We were hopeful that 
https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.2.15 and 
specifially 62a4b137d8aff9ce9dc5e96685253190b4ec6e62 (aka 
38d11da522aacaa05898c734a1cec86f1e611129) would fix this.

It did not.

What we consistently see when we're in time to catch this prior to being 
unable to log in is an lvcreate process running, not always this 
specific LV but a variation of:

13469 ?        D<L    0:00  |       |       \_ /sbin/lvcreate -kn -An -s 
-n fsck_MDKServer /dev/lvm/backup_MDKServer

(The idea here is to create thin snapshots of the origin volume in order 
to run a fsck on them so that we can recover from filesystem corruption 
on the server which acts as a backup server for others, once the fsck is 
done the snapshot is then renamed to save_* after removing the previous 
save_* LVs.)

This always seems to happen on lvcreate, not lvremove nor lvrename.

I'm attaching dmesg -T and ps axf.  dmesg in particular may provide 
clues as it provides a number of stack traces indicating stalling at IO 
time.

Once this has triggered, even commands such as "lvs" goes into 
uninterruptable wait, I unfortunately didn't test "dmsetup ls" now and 
triggered a reboot already (system needs to be up).

To recover we have to "echo b > /proc/sysrq-trigger" since a normal 
reboot simply doesn't work (presumably due to being unable to close and 
unmount the filesystems, even emergency sync never completes).

In terms of disk setup, there are a number of mdraids that serves as PVs 
into the VG.  /proc/mdstat and some details from LVM attached as 
disksetup.txt.

The one thing from lvs that stands out for me is this one:

  66   fsck_MDKServer          lvm Vwi---tz--   4.00t thin_pool 
backup_MDKServer

Which to me indicates that the lvcreate didn't complete (the LV isn't 
marked active above, whereas all other snaps are active, even if not open).

Assistance would be greatly appreciated.  I notice that newer kernels 
are out, so happy to upgrade if there are relevant changes that might 
affect this.  Nothing at https://kernelnewbies.org/Linux_6.3#Block_layer 
jumps out to me.

This host locks up approximately weekly.  It's worse if dmeventd is also 
running (won't survive a day).  We do have the following "non-standard" 
udev rules, which I'm not sure if they might be an influencing factor:

crowsnest [10:24:37] /etc/udev/rules.d (master) # cat 
98-md-stripe-cache.rules
SUBSYSTEM=="block", KERNEL=="md*", ACTION=="change|add", 
TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="32768"
SUBSYSTEM=="block", ACTION=="change|add", TEST=="bdi/read_ahead_kb", 
ATTR{bdi/read_ahead_kb}="512"
SUBSYSTEM=="block", ACTION=="change|add", TEST=="device/timeout", 
ATTR{device/timeout}="180"

We've not aware of any other hosts being  affected, at least not to this 
scale.  This is the host where we make the most heavy use of both mdraid 
as well as lvm (especially thin snaps).  In comparison most of our other 
hosts are toys w.r.t. disks.

A few things I find strange (perhaps due to ignorance):

1.  It's not just the LV that's being snapped that locks up, it's always 
more than just that one.

2.  Some filesystems are usually still usable (/var).  Once the lvcreate 
triggered the issue, these get fewer over time, once /var is affected we 
can no longer log in.

To be noted:  when this host idles we see ~40-60MB/s worth of IO, goes 
up to 400-800MB/s and we saw it peak around 1.3GB/s around two weeks 
back whilst also investigating possible causes (Current DISK READ/WRITE 
as per iotop).

Filesystems are all ext4 with O_DIRINDEX disabled (standard options 
otherwise).

Kind regards,
Jaco

[-- Attachment #2: disksetup.txt.gz --]
[-- Type: application/gzip, Size: 1517 bytes --]

[-- Attachment #3: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 22155 bytes --]

[-- Attachment #4: ps.txt.gz --]
[-- Type: application/gzip, Size: 4214 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-09  7:29 LVM kernel lockup scenario during lvcreate Jaco Kroon
@ 2023-06-12 18:40 ` Bart Van Assche
  2023-06-12 19:36   ` Jaco Kroon
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Bart Van Assche @ 2023-06-12 18:40 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 6/9/23 00:29, Jaco Kroon wrote:
> I'm attaching dmesg -T and ps axf.  dmesg in particular may provide
> clues as it provides a number of stack traces indicating stalling at
> IO time.
> 
> Once this has triggered, even commands such as "lvs" goes into
> uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
> and triggered a reboot already (system needs to be up).

To me the call traces suggest that an I/O request got stuck. 
Unfortunately call traces are not sufficient to identify the root cause 
in case I/O gets stuck. Has debugfs been mounted? If so, how about 
dumping the contents of /sys/kernel/debug/block/ into a tar file after 
the lockup has been reproduced and sharing that information?

tar -czf- -C /sys/kernel/debug/block . >block.tgz

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-12 18:40 ` Bart Van Assche
@ 2023-06-12 19:36   ` Jaco Kroon
  2023-06-18 23:28     ` Bart Van Assche
  2023-06-18 19:34   ` Jaco Kroon
  2023-08-24 17:29   ` Laurence Oberman
  2 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-06-12 19:36 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi Bart,

On 2023/06/12 20:40, Bart Van Assche wrote:
> On 6/9/23 00:29, Jaco Kroon wrote:
>> I'm attaching dmesg -T and ps axf.  dmesg in particular may provide
>> clues as it provides a number of stack traces indicating stalling at
>> IO time.
>>
>> Once this has triggered, even commands such as "lvs" goes into
>> uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
>> and triggered a reboot already (system needs to be up).
>
> To me the call traces suggest that an I/O request got stuck. 
> Unfortunately call traces are not sufficient to identify the root 
> cause in case I/O gets stuck. Has debugfs been mounted? If so, how 
> about dumping the contents of /sys/kernel/debug/block/ into a tar file 
> after the lockup has been reproduced and sharing that information?

Looks to be mounted, at least I've got a /sys/kernel/debug/block/ folder 
on the relevant server.

>
> tar -czf- -C /sys/kernel/debug/block . >block.tgz

Definitely can do, I'm not sure how to interpret the data in this - is 
there anything specific you're looking for?  Would love to not just pass 
the information on but also learn from this.

Generally the lockup rate seem to be about once a week currently so I 
expect (on average) to see this pop again some time over the weekend.

Kind regards,
Jaco


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-12 18:40 ` Bart Van Assche
  2023-06-12 19:36   ` Jaco Kroon
@ 2023-06-18 19:34   ` Jaco Kroon
  2023-06-18 23:56     ` Bart Van Assche
  2023-08-24 17:29   ` Laurence Oberman
  2 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-06-18 19:34 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

Hi,

On 2023/06/12 20:40, Bart Van Assche wrote:
> On 6/9/23 00:29, Jaco Kroon wrote:
>> I'm attaching dmesg -T and ps axf.  dmesg in particular may provide
>> clues as it provides a number of stack traces indicating stalling at
>> IO time.
>>
>> Once this has triggered, even commands such as "lvs" goes into
>> uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
>> and triggered a reboot already (system needs to be up).
>
> To me the call traces suggest that an I/O request got stuck. 
> Unfortunately call traces are not sufficient to identify the root 
> cause in case I/O gets stuck. Has debugfs been mounted? If so, how 
> about dumping the contents of /sys/kernel/debug/block/ into a tar file 
> after the lockup has been reproduced and sharing that information?
>
> tar -czf- -C /sys/kernel/debug/block . >block.tgz
>
Right on queue ... please find attached.  Not seeing any content in any 
of the files from the tar czf so I doubt there is much use here ... 
perhaps you might be able to explain why all of these files under 
/sys/kernel/debug/block would be empty?

Kind Regards,
Jaco

[-- Attachment #2: block.tgz --]
[-- Type: application/gzip, Size: 25836 bytes --]

[-- Attachment #3: ps.txt.gz --]
[-- Type: application/gzip, Size: 5504 bytes --]

[-- Attachment #4: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 24757 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-12 19:36   ` Jaco Kroon
@ 2023-06-18 23:28     ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2023-06-18 23:28 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 6/12/23 12:36, Jaco Kroon wrote:
> On 2023/06/12 20:40, Bart Van Assche wrote:
>> tar -czf- -C /sys/kernel/debug/block . >block.tgz
> 
> Definitely can do, I'm not sure how to interpret the data in this - is 
> there anything specific you're looking for?  Would love to not just pass 
> the information on but also learn from this.

Most of the information in debugfs is needed. It's easier to capture it 
all than to only request for the many attributes that are needed.

I'm not sure there is any better documentation available for these 
attributes than the information in the source file block/blk-mq-debugfs.c.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-18 19:34   ` Jaco Kroon
@ 2023-06-18 23:56     ` Bart Van Assche
  2023-06-19  6:06       ` Jaco Kroon
  2023-06-26  8:30       ` Jaco Kroon
  0 siblings, 2 replies; 26+ messages in thread
From: Bart Van Assche @ 2023-06-18 23:56 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 6/18/23 12:34, Jaco Kroon wrote:
> On 2023/06/12 20:40, Bart Van Assche wrote:
>> tar -czf- -C /sys/kernel/debug/block . >block.tgz
>>
> Right on queue ... please find attached.  Not seeing any content in any 
> of the files from the tar czf so I doubt there is much use here ... 
> perhaps you might be able to explain why all of these files under 
> /sys/kernel/debug/block would be empty?

Apparently the tar command is incompatible with debugfs :-( I should 
have tested the command before I sent it to you.

Does this work better?

(cd /sys/kernel/debug/block/ && grep -r . .) | gzip -9 > block.gz

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-18 23:56     ` Bart Van Assche
@ 2023-06-19  6:06       ` Jaco Kroon
  2023-06-26  8:30       ` Jaco Kroon
  1 sibling, 0 replies; 26+ messages in thread
From: Jaco Kroon @ 2023-06-19  6:06 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi,

On 2023/06/19 01:56, Bart Van Assche wrote:
> On 6/18/23 12:34, Jaco Kroon wrote:
>> On 2023/06/12 20:40, Bart Van Assche wrote:
>>> tar -czf- -C /sys/kernel/debug/block . >block.tgz
>>>
>> Right on queue ... please find attached.  Not seeing any content in 
>> any of the files from the tar czf so I doubt there is much use here 
>> ... perhaps you might be able to explain why all of these files under 
>> /sys/kernel/debug/block would be empty?
>
> Apparently the tar command is incompatible with debugfs :-( I should 
> have tested the command before I sent it to you.
>
> Does this work better?
>
> (cd /sys/kernel/debug/block/ && grep -r . .) | gzip -9 > block.gz

Yes it will, but now we wait again.  And I really don't like waiting.

Wonder why tar won't work though ... but alas, it is what it is.

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-18 23:56     ` Bart Van Assche
  2023-06-19  6:06       ` Jaco Kroon
@ 2023-06-26  8:30       ` Jaco Kroon
  2023-06-26 16:42         ` Bart Van Assche
  1 sibling, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-06-26  8:30 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

Hi Bart,

Thank you for you patience for this ...

Please find attached updated ps and dmesg too, diskinfo wasn't 
regenerated but this doesn't generally change.

Not sure how this all works, according to what I can see the only disk 
with pending activity (queued) is sdw?  Yet, a large number of processes 
is blocking on IO, and yet again the stack traces in dmesg points at 
__schedule.  For a change we do not have lvcreate in the process list!  
This time that particular script's got a fsck in uninterruptable wait ...

Kind regards,
Jaco

On 2023/06/19 01:56, Bart Van Assche wrote:
> On 6/18/23 12:34, Jaco Kroon wrote:
>> On 2023/06/12 20:40, Bart Van Assche wrote:
>>> tar -czf- -C /sys/kernel/debug/block . >block.tgz
>>>
>> Right on queue ... please find attached.  Not seeing any content in 
>> any of the files from the tar czf so I doubt there is much use here 
>> ... perhaps you might be able to explain why all of these files under 
>> /sys/kernel/debug/block would be empty?
>
> Apparently the tar command is incompatible with debugfs :-( I should 
> have tested the command before I sent it to you.
>
> Does this work better?
>
> (cd /sys/kernel/debug/block/ && grep -r . .) | gzip -9 > block.gz
>
> Thanks,
>
> Bart.

[-- Attachment #2: disksetup.txt.gz --]
[-- Type: application/gzip, Size: 3551 bytes --]

[-- Attachment #3: ps.txt.gz --]
[-- Type: application/gzip, Size: 6296 bytes --]

[-- Attachment #4: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 28777 bytes --]

[-- Attachment #5: block.gz --]
[-- Type: application/gzip, Size: 9631 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-26  8:30       ` Jaco Kroon
@ 2023-06-26 16:42         ` Bart Van Assche
  2023-06-26 23:29           ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2023-06-26 16:42 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 6/26/23 01:30, Jaco Kroon wrote:
> Please find attached updated ps and dmesg too, diskinfo wasn't regenerated but this doesn't generally change.
> 
> Not sure how this all works, according to what I can see the only disk with pending activity (queued) is sdw?  Yet, a large number of processes is blocking on IO, and yet again the stack traces in dmesg points at __schedule.  For a change we do not have lvcreate in the 
> process list! This time that particular script's got a fsck in uninterruptable wait ...

Hi Jaco,

I see pending commands for five different SCSI disks:

$ zgrep /busy: block.gz
./sdh/hctx0/busy:00000000affe2ba0 {.op=WRITE, .cmd_flags=SYNC|FUA, .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, .state=in_flight, .tag=2055, .internal_tag=214, .cmd=Write(16) 8a 08 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
.flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
./sda/hctx0/busy:00000000987bb7c7 {.op=WRITE, .cmd_flags=SYNC|FUA, .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, .state=in_flight, .tag=2050, .internal_tag=167, .cmd=Write(16) 8a 08 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
.flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}
./sdw/hctx0/busy:00000000aec61b17 {.op=READ, .cmd_flags=META|PRIO, .rq_flags=STARTED|MQ_INFLIGHT|DONTPREP|ELVPRIV|IO_STAT|ELV, .state=in_flight, .tag=2056, .internal_tag=8, .cmd=Read(16) 88 00 00 00 00 00 00 1c 01 a8 00 00 00 08 00 00, .retries=0, .result = 0x0, 
.flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
./sdw/hctx0/busy:0000000087e9a58e {.op=WRITE, .cmd_flags=SYNC|FUA, .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, .state=in_flight, .tag=2058, .internal_tag=102, .cmd=Write(16) 8a 08 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
.flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
./sdaf/hctx0/busy:00000000d8751601 {.op=WRITE, .cmd_flags=SYNC|FUA, .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, .state=in_flight, .tag=2057, .internal_tag=51, .cmd=Write(16) 8a 08 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
.flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}

All requests have the flag "ELV". So my follow-up questions are:
* Which I/O scheduler has been configured? If it is BFQ, please try whether
   mq-deadline or "none" work better.
* Have any of the cgroup I/O controllers been activated?
* Are the disks directly connected to the motherboard of the server or are
   the disks perhaps controlled by a HBA? If so, which HBA? There are multiple
   lines in dmesg that start with "mpt3sas". Is the firmware of this HBA
   up-to-date?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-26 16:42         ` Bart Van Assche
@ 2023-06-26 23:29           ` Jaco Kroon
  2023-07-11 13:22             ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-06-26 23:29 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi Bart,

On 2023/06/26 18:42, Bart Van Assche wrote:
> On 6/26/23 01:30, Jaco Kroon wrote:
>> Please find attached updated ps and dmesg too, diskinfo wasn't 
>> regenerated but this doesn't generally change.
>>
>> Not sure how this all works, according to what I can see the only 
>> disk with pending activity (queued) is sdw?  Yet, a large number of 
>> processes is blocking on IO, and yet again the stack traces in dmesg 
>> points at __schedule.  For a change we do not have lvcreate in the 
>> process list! This time that particular script's got a fsck in 
>> uninterruptable wait ...
>
> Hi Jaco,
>
> I see pending commands for five different SCSI disks:
>
> $ zgrep /busy: block.gz
> ./sdh/hctx0/busy:00000000affe2ba0 {.op=WRITE, .cmd_flags=SYNC|FUA, 
> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
> .state=in_flight, .tag=2055, .internal_tag=214, .cmd=Write(16) 8a 08 
> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
> ./sda/hctx0/busy:00000000987bb7c7 {.op=WRITE, .cmd_flags=SYNC|FUA, 
> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
> .state=in_flight, .tag=2050, .internal_tag=167, .cmd=Write(16) 8a 08 
> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}
> ./sdw/hctx0/busy:00000000aec61b17 {.op=READ, .cmd_flags=META|PRIO, 
> .rq_flags=STARTED|MQ_INFLIGHT|DONTPREP|ELVPRIV|IO_STAT|ELV, 
> .state=in_flight, .tag=2056, .internal_tag=8, .cmd=Read(16) 88 00 00 
> 00 00 00 00 1c 01 a8 00 00 00 08 00 00, .retries=0, .result = 0x0, 
> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
> ./sdw/hctx0/busy:0000000087e9a58e {.op=WRITE, .cmd_flags=SYNC|FUA, 
> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
> .state=in_flight, .tag=2058, .internal_tag=102, .cmd=Write(16) 8a 08 
> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
> ./sdaf/hctx0/busy:00000000d8751601 {.op=WRITE, .cmd_flags=SYNC|FUA, 
> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
> .state=in_flight, .tag=2057, .internal_tag=51, .cmd=Write(16) 8a 08 00 
> 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}
>
> All requests have the flag "ELV". So my follow-up questions are:
> * Which I/O scheduler has been configured? If it is BFQ, please try 
> whether
>   mq-deadline or "none" work better.

crowsnest [00:34:58] /sys/class/block/sda/device/block/sda/queue # cat 
scheduler
none [mq-deadline] kyber bfq

crowsnest [00:35:31] /sys/class/block # for i in 
*/device/block/sd*/queue/scheduler; do echo none > $i; done
crowsnest [00:35:45] /sys/class/block #

So let's see if that perhaps relates.  Neither CFQ nor BFQ has ever 
given me anywhere near the performance of deadline, so that's our 
default goto.

> * Have any of the cgroup I/O controllers been activated?

No.  Kernel is compiled without CGROUPS support on this specific host.

> * Are the disks directly connected to the motherboard of the server or 
> are
>   the disks perhaps controlled by a HBA? If so, which HBA? There are 
> multiple
>   lines in dmesg that start with "mpt3sas". Is the firmware of this HBA
>   up-to-date?

Very good question.  The host is a 36-drive bay Supermicro host 
(SSG-5048R-E1CR36L to be precise).

Internally it may be using some form of SAS extender/HBA, but there are 
no externally connected drives, and from my knowledge of HBAs these (in 
all cases where I've seen them in use) they're used to connect external 
storage arrays to a host.  So to the best of my knowledge all drives are 
directly connected to the mpt3sas controller using the SAS bus.  
Unfortunately there are some SATA drives remaining in the mix, but as 
these fail they're getting replaced with NL-SAS drives.  This may 
(obviously) have some negative influence.

These includes (the drives that were busy from your assessment):

/dev/sda    4001GB SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)   p3:md2 
p2:md1 p1:md0
/dev/sdh    4001GB SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)   p3:md2 
p2:md1 p1:md0
/dev/sdw    4001GB SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)   p3:md2 
p2:md1 p1:md0

But not sdaf:

/dev/sdaf   4001GB SAS (SPL-4) p3:md2 p2:md1 p1:md0

If I understand correctly from what you're indicating, 4/5 operations 
were SYNC operations, with the fifth a read op for meta data?

                 description: Serial Attached SCSI controller
                 product: SAS3008 PCI-Express Fusion-MPT SAS-3
                 vendor: Broadcom / LSI
                 physical id: 0
                 bus info: pci@0000:01:00.0
                 logical name: scsi0
                 version: 02
                 width: 64 bits
                 clock: 33MHz
                 capabilities: sas pm pciexpress vpd msi msix bus_master 
cap_list rom
                 configuration: driver=mpt3sas latency=0
                 resources: irq:24 ioport:e000(size=256) 
memory:fb200000-fb20ffff memory:fb100000-fb1fffff

lshw reports all drives directly below this, I'm not sure how much trust 
to put in that.

 From a fresh boot:

crowsnest [00:47:53] ~ # dmesg | grep -i mpt3sas
[    2.844696] mpt3sas version 43.100.00.00 loaded
[    2.844977] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, 
total mem (263761276 kB)
[    2.900278] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k
[    2.900290] mpt3sas_cm0: MSI-X vectors supported: 96
[    2.900294] mpt3sas_cm0:  0 6 6
[    2.900490] mpt3sas_cm0: High IOPs queues : disabled
[    2.900493] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 45
[    2.900494] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 46
[    2.900496] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 47
[    2.900497] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 49
[    2.900498] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 50
[    2.900499] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 51
[    2.900501] mpt3sas_cm0: iomem(0x00000000fb200000), 
mapped(0x0000000047ba3e2a), size(65536)
[    2.900504] mpt3sas_cm0: ioport(0x000000000000e000), size(256)
[    2.956709] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k
[    2.956715] mpt3sas_cm0: sending message unit reset !!
[    2.958301] mpt3sas_cm0: message unit reset: SUCCESS
[    2.986679] mpt3sas_cm0: scatter gather: sge_in_main_msg(1), 
sge_per_chain(7), sge_per_io(128), chains_per_io(19)
[    2.986892] mpt3sas_cm0: request pool(0x00000000ea746c3a) - 
dma(0xfff00000): depth(3200), frame_size(128), pool_size(400 kB)
[    3.008136] mpt3sas_cm0: sense pool(0x00000000f479118e) - 
dma(0xff780000): depth(2939), element_size(96), pool_size (275 kB)
[    3.008214] mpt3sas_cm0: reply pool(0x000000009bfc1ea9) - 
dma(0xff700000): depth(3264), frame_size(128), pool_size(408 kB)
[    3.008227] mpt3sas_cm0: config page(0x0000000088a84930) - 
dma(0xff6fa000): size(512)
[    3.008229] mpt3sas_cm0: Allocated physical memory: size(8380 kB)
[    3.008231] mpt3sas_cm0: Current Controller Queue Depth(2936),Max 
Controller Queue Depth(3072)
[    3.008233] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[    3.173296] mpt3sas_cm0: _base_display_fwpkg_version: complete
[    3.173585] mpt3sas_cm0: LSISAS3008: FWVersion(03.00.06.136), 
ChipRevision(0x02), BiosVersion(08.07.00.00)
[    3.173589] mpt3sas_cm0: Protocol=(Initiator,Target), 
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[    3.174415] mpt3sas_cm0: sending port enable !!
[    3.174806] mpt3sas_cm0: hba_port entry: 000000007848ca98, port: 255 
is added to hba_port list
[    3.175771] mpt3sas_cm0: host_add: handle(0x0001), 
sas_addr(0x5003048016846300), phys(8)
[    3.177028] mpt3sas_cm0: expander_add: handle(0x0009), 
parent(0x0001), sas_addr(0x500304800175f0bf), phys(51)
[    3.183121] mpt3sas_cm0: handle(0xa) sas_address(0x500304800175f081) 
port_type(0x1)
[    3.183221] mpt3sas_cm0: handle(0xb) sas_address(0x5000c500e23ccd9d) 
port_type(0x1)
[    3.183319] mpt3sas_cm0: handle(0xc) sas_address(0x5000c500d7e872cd) 
port_type(0x1)
[    3.183417] mpt3sas_cm0: handle(0xd) sas_address(0x5000c500d1800469) 
port_type(0x1)
[    3.183515] mpt3sas_cm0: handle(0xe) sas_address(0x5000c500e2bca685) 
port_type(0x1)
[    3.183612] mpt3sas_cm0: handle(0xf) sas_address(0x5000c500e23ccd3d) 
port_type(0x1)
[    3.183710] mpt3sas_cm0: handle(0x10) sas_address(0x5000c500a7b26111) 
port_type(0x1)
[    3.183807] mpt3sas_cm0: handle(0x11) sas_address(0x500304800175f088) 
port_type(0x1)
[    3.183905] mpt3sas_cm0: handle(0x12) sas_address(0x5000c500e23cd02d) 
port_type(0x1)
[    3.184002] mpt3sas_cm0: handle(0x13) sas_address(0x5000c500a7389019) 
port_type(0x1)
[    3.184100] mpt3sas_cm0: handle(0x14) sas_address(0x500304800175f08b) 
port_type(0x1)
[    3.185511] mpt3sas_cm0: handle(0x16) sas_address(0x500304800175f09c) 
port_type(0x1)
[    3.185609] mpt3sas_cm0: handle(0x17) sas_address(0x500304800175f09d) 
port_type(0x1)
[    3.185706] mpt3sas_cm0: handle(0x18) sas_address(0x5000c500a79abd8d) 
port_type(0x1)
[    3.185804] mpt3sas_cm0: handle(0x19) sas_address(0x5000c500a7bb5add) 
port_type(0x1)
[    3.185901] mpt3sas_cm0: handle(0x1a) sas_address(0x5000c500cae2efc9) 
port_type(0x1)
[    3.185998] mpt3sas_cm0: handle(0x1b) sas_address(0x5000c500ca9fd351) 
port_type(0x1)
[    3.186096] mpt3sas_cm0: handle(0x1c) sas_address(0x500304800175f0a2) 
port_type(0x1)
[    3.186194] mpt3sas_cm0: handle(0x1d) sas_address(0x500304800175f0a3) 
port_type(0x1)
[    3.186291] mpt3sas_cm0: handle(0x1e) sas_address(0x5000c500ef616e61) 
port_type(0x1)
[    3.186389] mpt3sas_cm0: handle(0x1f) sas_address(0x5000c500ef616d3d) 
port_type(0x1)
[    3.186486] mpt3sas_cm0: handle(0x20) sas_address(0x5000c500cae2f5a1) 
port_type(0x1)
[    3.186584] mpt3sas_cm0: handle(0x21) sas_address(0x500304800175f0bd) 
port_type(0x1)
[    3.187276] mpt3sas_cm0: expander_add: handle(0x0015), 
parent(0x0009), sas_addr(0x500304800174b8bf), phys(31)
[    3.191107] mpt3sas_cm0: handle(0x22) sas_address(0x500304800174b880) 
port_type(0x1)
[    3.191206] mpt3sas_cm0: handle(0x23) sas_address(0x500304800174b881) 
port_type(0x1)
[    3.191303] mpt3sas_cm0: handle(0x24) sas_address(0x5000c50094aca809) 
port_type(0x1)
[    3.191400] mpt3sas_cm0: handle(0x25) sas_address(0x5000c500955eb465) 
port_type(0x1)
[    3.191498] mpt3sas_cm0: handle(0x26) sas_address(0x500304800174b884) 
port_type(0x1)
[    3.191595] mpt3sas_cm0: handle(0x27) sas_address(0x5000c500a6782f2d) 
port_type(0x1)
[    3.191692] mpt3sas_cm0: handle(0x28) sas_address(0x5000c500e21f30ad) 
port_type(0x1)
[    3.191790] mpt3sas_cm0: handle(0x29) sas_address(0x5000c500955ea88d) 
port_type(0x1)
[    3.191887] mpt3sas_cm0: handle(0x2a) sas_address(0x5000c500a67837a1) 
port_type(0x1)
[    3.191984] mpt3sas_cm0: handle(0x2b) sas_address(0x5000c500a61aefb1) 
port_type(0x1)
[    3.192877] mpt3sas_cm0: handle(0x2c) sas_address(0x500304800174b8bd) 
port_type(0x1)
[    3.200396] mpt3sas_cm0: port enable: SUCCESS

That at least tells me there is an HBA and an expander involved, and 
we're looking at firmware/bios versions:

LSISAS3008: FWVersion(03.00.06.136), ChipRevision(0x02), 
BiosVersion(08.07.00.00)

https://www.broadcom.com/products/storage/sas-sata-controllers/sas-3008

Seeing that the card only has 8 ports I'm guessing an HBA and/or 
expander is a given :).

And various hints that newer firmware exists ... but broadcom is not 
making it easy to find the download nor is supermicro's website of much 
help ... will try again during more sane hours.

https://www.broadcom.com/support/download-search?pg=Storage+Adapters,+Controllers,+and+ICs&pf=Storage+Adapters,+Controllers,+and+ICs&pn=SAS3008+I/O+Controller&pa=Firmware&po=&dk=&pl=&l=false

I do make note that we've in the preceding weak had a lockup on another 
host that at face value looked similar, which runs AHCI and SATA rather 
than mp3sas + a mix of SATA + NL-SAS. Unfortunately at that point in 
time the priority in getting the host up was extremely high and we were 
precluded from taking time to gather debug information of any kind 
unfortunately.

>
> Thanks,

No, thank you!

Kind Regards,
Jaco


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-26 23:29           ` Jaco Kroon
@ 2023-07-11 13:22             ` Jaco Kroon
  2023-07-11 14:45               ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-07-11 13:22 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi Bart,

On 2023/06/27 01:29, Jaco Kroon wrote:
> Hi Bart,
>
> On 2023/06/26 18:42, Bart Van Assche wrote:
>> On 6/26/23 01:30, Jaco Kroon wrote:
>>> Please find attached updated ps and dmesg too, diskinfo wasn't 
>>> regenerated but this doesn't generally change.
>>>
>>> Not sure how this all works, according to what I can see the only 
>>> disk with pending activity (queued) is sdw?  Yet, a large number of 
>>> processes is blocking on IO, and yet again the stack traces in dmesg 
>>> points at __schedule.  For a change we do not have lvcreate in the 
>>> process list! This time that particular script's got a fsck in 
>>> uninterruptable wait ...
>>
>> Hi Jaco,
>>
>> I see pending commands for five different SCSI disks:
>>
>> $ zgrep /busy: block.gz
>> ./sdh/hctx0/busy:00000000affe2ba0 {.op=WRITE, .cmd_flags=SYNC|FUA, 
>> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
>> .state=in_flight, .tag=2055, .internal_tag=214, .cmd=Write(16) 8a 08 
>> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
>> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
>> ./sda/hctx0/busy:00000000987bb7c7 {.op=WRITE, .cmd_flags=SYNC|FUA, 
>> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
>> .state=in_flight, .tag=2050, .internal_tag=167, .cmd=Write(16) 8a 08 
>> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
>> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}
>> ./sdw/hctx0/busy:00000000aec61b17 {.op=READ, .cmd_flags=META|PRIO, 
>> .rq_flags=STARTED|MQ_INFLIGHT|DONTPREP|ELVPRIV|IO_STAT|ELV, 
>> .state=in_flight, .tag=2056, .internal_tag=8, .cmd=Read(16) 88 00 00 
>> 00 00 00 00 1c 01 a8 00 00 00 08 00 00, .retries=0, .result = 0x0, 
>> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
>> ./sdw/hctx0/busy:0000000087e9a58e {.op=WRITE, .cmd_flags=SYNC|FUA, 
>> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
>> .state=in_flight, .tag=2058, .internal_tag=102, .cmd=Write(16) 8a 08 
>> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
>> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.000 s ago}
>> ./sdaf/hctx0/busy:00000000d8751601 {.op=WRITE, .cmd_flags=SYNC|FUA, 
>> .rq_flags=FLUSH_SEQ|MQ_INFLIGHT|DONTPREP|IO_STAT|ELV, 
>> .state=in_flight, .tag=2057, .internal_tag=51, .cmd=Write(16) 8a 08 
>> 00 00 00 01 d1 c0 b7 88 00 00 00 08 00 00, .retries=0, .result = 0x0, 
>> .flags=TAGGED|INITIALIZED|LAST, .timeout=180.000, allocated 0.010 s ago}
>>
>> All requests have the flag "ELV". So my follow-up questions are:
>> * Which I/O scheduler has been configured? If it is BFQ, please try 
>> whether
>>   mq-deadline or "none" work better.
>
> crowsnest [00:34:58] /sys/class/block/sda/device/block/sda/queue # cat 
> scheduler
> none [mq-deadline] kyber bfq
>
> crowsnest [00:35:31] /sys/class/block # for i in 
> */device/block/sd*/queue/scheduler; do echo none > $i; done
> crowsnest [00:35:45] /sys/class/block #
>
> So let's see if that perhaps relates.  Neither CFQ nor BFQ has ever 
> given me anywhere near the performance of deadline, so that's our 
> default goto.

crowsnest [15:03:34] ~ # uptime
  15:07:52 up 15 days,  4:47,  3 users,  load average: 10.26, 9.88, 9.68

"how long is a piece of string?" comes to mind whether looking to decide 
if that's sufficiently long to call it success?  Normally died after 
about 7 days.

So *suspected* mq-deadline bug?

> And various hints that newer firmware exists ... but broadcom is not 
> making it easy to find the download nor is supermicro's website of 
> much help ... will try again during more sane hours.
>
> https://www.broadcom.com/support/download-search?pg=Storage+Adapters,+Controllers,+and+ICs&pf=Storage+Adapters,+Controllers,+and+ICs&pn=SAS3008+I/O+Controller&pa=Firmware&po=&dk=&pl=&l=false 
>

Supermicro has responded with appropriate upgrade options which we've 
not executed yet, but I'll make time for that over the coming weekend, 
or perhaps I should wait a bit longer to give more time for a similar 
lockup with the none scheduler?

Kind Regards,
Jaco


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-07-11 13:22             ` Jaco Kroon
@ 2023-07-11 14:45               ` Bart Van Assche
  2023-07-12 10:12                 ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2023-07-11 14:45 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 7/11/23 06:22, Jaco Kroon wrote:
> So *suspected* mq-deadline bug?

That seems unlikely to me. I have not yet seen mq-deadline causing an
I/O lockup. I'm not claiming that it would be impossible that there is a
bug in mq-deadline but it seems unlikely to me. However, I have seen it
many times that an I/O lockup was caused by a buggy HBA driver and/or
HBA firmware so I recommend to start with checking these thoroughly.

> Supermicro has responded with appropriate upgrade options which
> we've not executed yet, but I'll make time for that over the coming 
> weekend, or perhaps I should wait a bit longer to give more time for 
> a similar lockup with the none scheduler?

If this is possible, verifying whether the lockup can be reproduced 
without I/O scheduler sounds like a good idea to me.

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-07-11 14:45               ` Bart Van Assche
@ 2023-07-12 10:12                 ` Jaco Kroon
  2023-07-12 13:43                   ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-07-12 10:12 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi,

On 2023/07/11 16:45, Bart Van Assche wrote:

> On 7/11/23 06:22, Jaco Kroon wrote:
>> So *suspected* mq-deadline bug?
>
> That seems unlikely to me. I have not yet seen mq-deadline causing an
> I/O lockup. I'm not claiming that it would be impossible that there is a
> bug in mq-deadline but it seems unlikely to me. However, I have seen it
> many times that an I/O lockup was caused by a buggy HBA driver and/or
> HBA firmware so I recommend to start with checking these thoroughly.
Care to share how or at least point me in a direction please?
>
>> Supermicro has responded with appropriate upgrade options which
>> we've not executed yet, but I'll make time for that over the coming 
>> weekend, or perhaps I should wait a bit longer to give more time for 
>> a similar lockup with the none scheduler?
>
> If this is possible, verifying whether the lockup can be reproduced 
> without I/O scheduler sounds like a good idea to me.

If I had a reliable way to trigger this it would help a great deal.  
Currently we're at 14 days uptime with scheduler set to none rather than 
mq-deadline, previously we had to hard reboot approximately every 7 days 
... again, how long is a piece of string?  A lockup proves the presence 
of an issue, but how long must it not lock up to prove the absence of?

Ideas/Suggestions?

Kind regards,
Jaco



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-07-12 10:12                 ` Jaco Kroon
@ 2023-07-12 13:43                   ` Bart Van Assche
       [not found]                     ` <ef2812b4-7853-9dda-85dd-210636840a59@uls.co.za>
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2023-07-12 13:43 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 7/12/23 03:12, Jaco Kroon wrote:
> Ideas/Suggestions?

How about manually increasing the workload, e.g. by using fio to 
randomly read 4 KiB fragments with a high queue depth?

Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
       [not found]                     ` <ef2812b4-7853-9dda-85dd-210636840a59@uls.co.za>
@ 2023-08-24  7:29                       ` Jaco Kroon
  2023-08-24 17:13                         ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-08-24  7:29 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi Bart,

Just a follow up on this.

It seems even with the "none" scheduler we had an occurrence of this 
now.  Unfortunately I could not get to the host quickly enough in order 
to confirm ongoing IO, although based on the activity LEDs there were 
disks with IO.  I believe the disk controller controls these LEDs, but 
I'm not sure the pattern used to switch them on/off and this could vary 
from controller to controller (ie, do they go off only once the host has 
confirmed receipt of data, or once the data has been sent to the host?). 
This does seem to support your theory of a controller firmware issue.

It definitely happens more often with mq-deadline compared to none.

We're definitely seeing the same thing on another host using an ahci 
controller.  This seems to hint that it's not a firmware issue, as does 
the fact that this happens much less frequently with the none scheduler.

I will make a plan to action the firmware updates on the raid controller 
over the weekend regardless, just in order to eliminate that.  I will 
then revert to mq-deadline.  Assuming this does NOT fix it, how would I 
go about assessing if this is a controller firmware issue or a Linux 
kernel issue?

Come to think of it, it may be related or not, we've long since switched 
off dmeventd as running dmeventd causes this to happen on all hosts the 
moments any form of snapshots are involved.  With dmeventd combined with 
"heavy" use of the lv commands we could pretty much guarantee some level 
of lockup within a couple of days.

Kind regards,
Jaco

On 2023/07/13 17:07, Jaco Kroon wrote:
>
> Hi Bart,
>
>
> Not familiar at all with fio, so hoping this was OK.
>
>
> On 2023/07/12 15:43, Bart Van Assche wrote:
>> On 7/12/23 03:12, Jaco Kroon wrote:
>>> Ideas/Suggestions?
>>
>> How about manually increasing the workload, e.g. by using fio to 
>> randomly read 4 KiB fragments with a high queue depth?
>>
>> Bart.
>
>
> [global]
> kb_base=1024
> unit_base=8
> loops=10000
> runtime=7200
> time_based=1
> directory=/home/fio
> nrfiles=1
> size=4194304
> iodepth=256
> ioengine=io_uring
> numjobs=512
> create_fsync=1
>
> [reader]
>
>
>
> crowsnest [17:01:35] ~ # fio --alloc-size=$(( 32 * 1024 )) fio.ini
>
> Load averag went up to 1200+, IO was consistently 1GB/s read 
> throughput, and IOPs anywhere between 100k and 500k, mostly around the 
> 150k region.
>
>
> Guessing the next step would be to restore mq-deadline as scheduler 
> and re-do?
>
>
> I've neglected to capture the output unfortunately, will do next run 
> with --output if needed.  Can definitely initiate another run around 
> 6:00am GMT in the morning.
>
>
> Kind Regards,
> Jaco
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24  7:29                       ` Jaco Kroon
@ 2023-08-24 17:13                         ` Bart Van Assche
  2023-08-24 20:16                           ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2023-08-24 17:13 UTC (permalink / raw)
  To: Jaco Kroon, linux-block@vger.kernel.org

On 8/24/23 00:29, Jaco Kroon wrote:
> We're definitely seeing the same thing on another host using an ahci 
> controller.  This seems to hint that it's not a firmware issue, as does 
> the fact that this happens much less frequently with the none scheduler.

That is unexpected. I don't think there is enough data available yet to
conclude whether these issues are identical or not?

> I will make a plan to action the firmware updates on the raid controller 
> over the weekend regardless, just in order to eliminate that.  I will 
> then revert to mq-deadline.  Assuming this does NOT fix it, how would I 
> go about assessing if this is a controller firmware issue or a Linux 
> kernel issue?

If the root cause would be an issue in the mq-deadline scheduler or in
the core block layer then there would be many more reports about I/O
lockups. For this case I think that it's very likely that the root cause 
is either the I/O controller driver or the I/O controller firmware.

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-06-12 18:40 ` Bart Van Assche
  2023-06-12 19:36   ` Jaco Kroon
  2023-06-18 19:34   ` Jaco Kroon
@ 2023-08-24 17:29   ` Laurence Oberman
  2023-08-24 20:01     ` Jaco Kroon
  2 siblings, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2023-08-24 17:29 UTC (permalink / raw)
  To: Bart Van Assche, Jaco Kroon, linux-block@vger.kernel.org

On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote:
> On 6/9/23 00:29, Jaco Kroon wrote:
> > I'm attaching dmesg -T and ps axf.  dmesg in particular may provide
> > clues as it provides a number of stack traces indicating stalling
> > at
> > IO time.
> > 
> > Once this has triggered, even commands such as "lvs" goes into
> > uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
> > and triggered a reboot already (system needs to be up).
> 
> To me the call traces suggest that an I/O request got stuck. 
> Unfortunately call traces are not sufficient to identify the root
> cause 
> in case I/O gets stuck. Has debugfs been mounted? If so, how about 
> dumping the contents of /sys/kernel/debug/block/ into a tar file
> after 
> the lockup has been reproduced and sharing that information?
> 
> tar -czf- -C /sys/kernel/debug/block . >block.tgz
> 
> Thanks,
> 
> Bart.
> 

One I am aware of is this
commit 106397376c0369fcc01c58dd189ff925a2724a57
Author: David Jeffery <djeffery@redhat.com>

Can we try get a vmcore (assuming its not a secure site)

Add these to /etc/sysctl.conf

kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1

Run sysctl -p
Ensure kdump is running and can capture a vmcore

When it locks up again
send an NMI via the SuperMicro Web Managemnt interface

Share the vmcore, or we can have you capture some specifics from it to
triage.

Thanks
Laurence



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24 17:29   ` Laurence Oberman
@ 2023-08-24 20:01     ` Jaco Kroon
  2023-08-24 20:19       ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-08-24 20:01 UTC (permalink / raw)
  To: Laurence Oberman, Bart Van Assche, linux-block@vger.kernel.org

Hi,

On 2023/08/24 19:29, Laurence Oberman wrote:

> On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote:
>> On 6/9/23 00:29, Jaco Kroon wrote:
>>> I'm attaching dmesg -T and ps axf.  dmesg in particular may provide
>>> clues as it provides a number of stack traces indicating stalling
>>> at
>>> IO time.
>>>
>>> Once this has triggered, even commands such as "lvs" goes into
>>> uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
>>> and triggered a reboot already (system needs to be up).
>> To me the call traces suggest that an I/O request got stuck.
>> Unfortunately call traces are not sufficient to identify the root
>> cause
>> in case I/O gets stuck. Has debugfs been mounted? If so, how about
>> dumping the contents of /sys/kernel/debug/block/ into a tar file
>> after
>> the lockup has been reproduced and sharing that information?
>>
>> tar -czf- -C /sys/kernel/debug/block . >block.tgz
>>
>> Thanks,
>>
>> Bart.
>>
> One I am aware of is this
> commit 106397376c0369fcc01c58dd189ff925a2724a57
> Author: David Jeffery <djeffery@redhat.com>
>
> Can we try get a vmcore (assuming its not a secure site)

Certainly.  Obviously on any host handling any kind of sensitive data 
there is a likelihood that sensitive data may be present in the vmcore, 
as such I more than happy to create a vmcore, I'm assuming this will 
create a kernel version of a core dump ... with 256GB of RAM (most of 
which goes towards disk caches) I'm further assuming this file can be 
potentially large.  Where will this get stored should the capture be 
made?  (I need to ensure that the filesystem has sufficient storage 
available)

>
> Add these to /etc/sysctl.conf
>
> kernel.panic_on_io_nmi = 1
> kernel.panic_on_unrecovered_nmi = 1
> kernel.unknown_nmi_panic = 1
>
> Run sysctl -p
> Ensure kdump is running and can capture a vmcore
Done.  Had to enable a few extra kernel options to get all the other 
requirements, so scheduled a reboot to activate the new kernel. This 
will happen on Saturday morning very early.
>
> When it locks up again
> send an NMI via the SuperMicro Web Managemnt interface

Possible to send from sysrq at the keyboard?  Otherwise I'll just need 
to set up the RMI, will just be easier to do this from the keyboard if 
possible, it's not always if it's left too late.

>
> Share the vmcore, or we can have you capture some specifics from it to
> triage.

I'd prefer you let me know what you need ... security concerns and all 
... frankly, I highly doubt there is any data that is really so 
sensitive that it can be classified as "top secret" but we do have NDAs 
in place prohibiting me from sharing anything that may potentially 
contain customer related data ...

Kind regards,
Jaco

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24 17:13                         ` Bart Van Assche
@ 2023-08-24 20:16                           ` Jaco Kroon
  0 siblings, 0 replies; 26+ messages in thread
From: Jaco Kroon @ 2023-08-24 20:16 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org

Hi,

On 2023/08/24 19:13, Bart Van Assche wrote:
> On 8/24/23 00:29, Jaco Kroon wrote:
>> We're definitely seeing the same thing on another host using an ahci 
>> controller.  This seems to hint that it's not a firmware issue, as 
>> does the fact that this happens much less frequently with the none 
>> scheduler.
>
> That is unexpected. I don't think there is enough data available yet to
> conclude whether these issues are identical or not?
It's hard for me to even conclude that two consecutive crashes are even 
exactly the same issue ... however, there's strong correlation in that 
there generally are lvcreate commands in D state which to me hints that 
it's something to do with LVM snapshot creation (both traditional - ahci 
controller, and thing - super micro).
>
>> I will make a plan to action the firmware updates on the raid 
>> controller over the weekend regardless, just in order to eliminate 
>> that.  I will then revert to mq-deadline. Assuming this does NOT fix 
>> it, how would I go about assessing if this is a controller firmware 
>> issue or a Linux kernel issue?
>
> If the root cause would be an issue in the mq-deadline scheduler or in
> the core block layer then there would be many more reports about I/O
> lockups. For this case I think that it's very likely that the root 
> cause is either the I/O controller driver or the I/O controller firmware.

I tend to agree with that.  And given the fact that we probably have in 
excess of 50 hosts and it generally just seems to be these two hosts in 
question that bites into this ... I agree with your assessment.  Except 
that at least the AHCI host never *used* to do this and only fairly 
recently started with this behaviour.

So here's what I personally *think* makes these two hosts unique:

1.  The ACHI controller hosts unfortunately ~15 years back was set up 
with "thick" volumes and use traditional snapshots (The hardware has 
been replaced piecemeal over the years so none of the original hardware 
is still in use).  This started exhibiting the same behaviour where for 
reasons I can't go into we started making multiple snapshots of the same 
origin LV simultaneously - this is unfortunate, thin snaps would be way 
more performant during the few hours where these two snaps are required.

2.  The LSI controller on the SM host uses a thin pool of of 125TB and 
contains 27 "origins", 26 of which follows this pattern on a daily basis:
2.1  Create thin snap of ${name} as fsck_${name}.
2.2  fsck gets run on the snapshot to ensure consistency.  If this 
fails, bail out and report error to management systems.
2.3 if save_${name} exist, remote it.
2.4 rename fsck_${name} to save_${name}.

3.  IO on the SM host often goes in excess of 1GB/s and often "idles" 
around 400MB/s, which I'm sure in the bigger scheme of things isn't 
really that heavy of a load, but considering most of our other hosts 
barely peak at 150MB/s and generally don't do more than 10MB/s it's 
significant for us.  Right now as I'm typing this we're doing between 
1500 and 3000 reads/s (saw it peak just over 6000 now) and 500-1000 
writes/s (and peaked just over 3000).  I'm well aware there are systems 
with much higher IOPs values, but for us this is fairly high, even a few 
years back I saw statistics on systems doing 10k+ IOPs.

4.  Majority of our hosts with raid controllers are megaraid, I can't 
think of any other hosts off the top of my head also using mpt3sas, but 
we do have a number with AHCI.  This supports the theory again that it's 
the firmware on the controller, so I'll be sure to do that on Sat 
morning too when I've got a reboot slot. Hopefully that'll just make the 
problem go away.

Thanks for all the help in this, really appreciated.  I know we seem to 
be running in circles, but I believe we are making progress even if 
slowly, at a minimum I'm learning quite a bit which in and by itself is 
putting us in a better position to figure this out.  I do think that it 
could be controller, but as I've stated before as well, we've previously 
seen issues with snapshot creation for many years now, killing dmeventd 
sorted that out except on these two hosts now.  And they are special in 
that they create multiple snapshots of the same origin.  Perhaps that's 
the clue since frankly that's the one thing they share, and the one 
thing that makes them distinct from the other hosts we run.

Kind regards,
Jaco

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24 20:01     ` Jaco Kroon
@ 2023-08-24 20:19       ` Laurence Oberman
  2023-08-24 23:40         ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2023-08-24 20:19 UTC (permalink / raw)
  To: Jaco Kroon, Bart Van Assche, linux-block@vger.kernel.org

On Thu, 2023-08-24 at 22:01 +0200, Jaco Kroon wrote:
> Hi,
> 
> On 2023/08/24 19:29, Laurence Oberman wrote:
> 
> > On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote:
> > > On 6/9/23 00:29, Jaco Kroon wrote:
> > > > I'm attaching dmesg -T and ps axf.  dmesg in particular may
> > > > provide
> > > > clues as it provides a number of stack traces indicating
> > > > stalling
> > > > at
> > > > IO time.
> > > > 
> > > > Once this has triggered, even commands such as "lvs" goes into
> > > > uninterruptable wait, I unfortunately didn't test "dmsetup ls"
> > > > now
> > > > and triggered a reboot already (system needs to be up).
> > > To me the call traces suggest that an I/O request got stuck.
> > > Unfortunately call traces are not sufficient to identify the root
> > > cause
> > > in case I/O gets stuck. Has debugfs been mounted? If so, how
> > > about
> > > dumping the contents of /sys/kernel/debug/block/ into a tar file
> > > after
> > > the lockup has been reproduced and sharing that information?
> > > 
> > > tar -czf- -C /sys/kernel/debug/block . >block.tgz
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > > 
> > One I am aware of is this
> > commit 106397376c0369fcc01c58dd189ff925a2724a57
> > Author: David Jeffery <djeffery@redhat.com>
> > 
> > Can we try get a vmcore (assuming its not a secure site)
> 
> Certainly.  Obviously on any host handling any kind of sensitive data
> there is a likelihood that sensitive data may be present in the
> vmcore, 
> as such I more than happy to create a vmcore, I'm assuming this will 
> create a kernel version of a core dump ... with 256GB of RAM (most of
> which goes towards disk caches) I'm further assuming this file can be
> potentially large.  Where will this get stored should the capture be 
> made?  (I need to ensure that the filesystem has sufficient storage 
> available)
> 
> > 
> > Add these to /etc/sysctl.conf
> > 
> > kernel.panic_on_io_nmi = 1
> > kernel.panic_on_unrecovered_nmi = 1
> > kernel.unknown_nmi_panic = 1
> > 
> > Run sysctl -p
> > Ensure kdump is running and can capture a vmcore
> Done.  Had to enable a few extra kernel options to get all the other 
> requirements, so scheduled a reboot to activate the new kernel. This 
> will happen on Saturday morning very early.
> > 
> > When it locks up again
> > send an NMI via the SuperMicro Web Managemnt interface
> 
> Possible to send from sysrq at the keyboard?  Otherwise I'll just
> need 
> to set up the RMI, will just be easier to do this from the keyboard
> if 
> possible, it's not always if it's left too late.
> 
> > 
> > Share the vmcore, or we can have you capture some specifics from it
> > to
> > triage.
> 
> I'd prefer you let me know what you need ... security concerns and
> all 
> ... frankly, I highly doubt there is any data that is really so 
> sensitive that it can be classified as "top secret" but we do have
> NDAs 
> in place prohibiting me from sharing anything that may potentially 
> contain customer related data ...
> 
> Kind regards,
> Jaco
> 

Hello, this would usually need an NMI sent from a management interface
as with it locked up no guarantee a sysrq c will get there from the
keyboard. 
You could try though.

As long as you have in /etc/kdump.conf 

path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31

This will get kernel only pages and would not be very big.

I could work with you privately to get what we need out of the vmcore
and we would avoid transferring it.

Thanks
Laurence


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24 20:19       ` Laurence Oberman
@ 2023-08-24 23:40         ` Jaco Kroon
  2023-08-25 12:01           ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-08-24 23:40 UTC (permalink / raw)
  To: Laurence Oberman, Bart Van Assche, linux-block@vger.kernel.org

Hi Laurence,
>>> One I am aware of is this
>>> commit 106397376c0369fcc01c58dd189ff925a2724a57
>>> Author: David Jeffery <djeffery@redhat.com>

I should have held off on replying until I finished looking into this.  
This looks very interesting indeed, that said, this is my first serious 
venture into the block layers of the kernel :), so the essay below is 
more for my understanding than anything else, would be great to have a 
better understanding of the underlying principles here and your feedback 
on my understanding thereof would be much appreciated.

If I understand this correctly (and that's a BIG IF) then it's possible 
that a bunch of IO requests goes into a wait queue for whatever reason 
(pending some other event?).  It's then possible that some of them 
should get woken up, and previously (prior to above) it could happen 
that only a single request gets woken up, and then that request would go 
straight back to the wait queue ... with the patch, isn't it still 
possible that all woken up requests could just go straight back to the 
wait queue (albeit less likely)?

Could the creation of a snapshot (which should based on my understanding 
of what a snapshot is block writes whilst the snapshot is being created, 
ie, make them go to the wait queue), and could it be that the process of 
setting up the snapshot (which itself involves writes) then potentially 
block due to this?  Ie, the write request that needs to get into the 
next batch to allow other writes to proceed gets blocked?

And as I write that it stops making sense to me because most likely the 
IO for creating a snapshot would only result in blocking writes to the 
LV, not to the underlying PVs which contains the metadata for the VG 
which is being updated.

But still ... if we think about this, the probability of that "bug" 
hitting would increase as the number of outstanding IO requests 
increase?  With iostat reporting r_await values upwards of 100 and 
w_await values periodically going up to 5000 (both generally in the 
20-50 range for the last few minutes that I've been watching them), it 
would make sense for me that the number of requests blocking in-kernel 
could be much higher than that, it makes perfect sense for me that it 
could be related to this.  On the other hand, IIRC iostat -dmx 1 usually 
showed only minimal if any requests in either [rw]_await during lockups.

Consider the AHCI controller on the other hand where we've got 7200 RPM 
SATA drives which are slow to begin with, now we've got traditional 
snapshots, which are also causing an IO bottleneck and artificially 
raising IO demand (much more so than thin snaps, really wish I could 
figure out the migration process to convert this whole host to thin 
pools but lvconvert scares me something crazy), so now having that first 
snapshot causes IO bottleneck (ignoring relevant metadata updates, every 
write to a not yet duplicated segment becomes a read + write + write to 
clone the written to segment to the snapshot - thin pools just a read + 
write for same), so already IO is more demanding, and now we try to 
create another snapshot.

What if some IO fails to finish (due to continually being put back into 
the wait queue), thus blocking the process of creating the snapshot to 
begin with?

I know there are numerous other people using snapshots, but I've often 
wondered how many use it quite as heavily as we do on this specific 
host?  Given the massive amount of virtual machine infrastructure on the 
one hand I think there must be quite a lot, but then I also think many 
of them use "enterprise" (for whatever your definition of that is) 
storage or something like ceph, so not based on LVM.  And more and more 
so either SSD/flash or even NVMe, which given the faster response times 
would also lower the risks of IO related problems from showing themselves.

The risk seems to be during the creation of snapshots, so IO not making 
progress makes sense.

I've back-ported the referenced path onto 6.4.12 now, which will go 
alive Saturday morning.  Perhaps we'll be sorted now.  Will also revert 
to mq-deadline which has been shown to more regularly trigger this, so 
let's see.

>
> Hello, this would usually need an NMI sent from a management interface
> as with it locked up no guarantee a sysrq c will get there from the
> keyboard.
> You could try though.
>
> As long as you have in /etc/kdump.conf
>
> path /var/crash
> core_collector makedumpfile -l --message-level 7 -d 31
>
> This will get kernel only pages and would not be very big.
>
> I could work with you privately to get what we need out of the vmcore
> and we would avoid transferring it.
Thanks.  This helps.  Let's get a core first (if it's going to happen 
again) and then take it from there.

Kind regards,
Jaco

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-24 23:40         ` Jaco Kroon
@ 2023-08-25 12:01           ` Laurence Oberman
  2023-08-26 18:18             ` Jaco Kroon
  2023-09-06 21:03             ` Jaco Kroon
  0 siblings, 2 replies; 26+ messages in thread
From: Laurence Oberman @ 2023-08-25 12:01 UTC (permalink / raw)
  To: Jaco Kroon, Bart Van Assche, linux-block@vger.kernel.org

On Fri, 2023-08-25 at 01:40 +0200, Jaco Kroon wrote:
> Hi Laurence,
> > > > One I am aware of is this
> > > > commit 106397376c0369fcc01c58dd189ff925a2724a57
> > > > Author: David Jeffery <djeffery@redhat.com>
> 
> I should have held off on replying until I finished looking into
> this.  
> This looks very interesting indeed, that said, this is my first
> serious 
> venture into the block layers of the kernel :), so the essay below is
> more for my understanding than anything else, would be great to have
> a 
> better understanding of the underlying principles here and your
> feedback 
> on my understanding thereof would be much appreciated.
> 
> If I understand this correctly (and that's a BIG IF) then it's
> possible 
> that a bunch of IO requests goes into a wait queue for whatever
> reason 
> (pending some other event?).  It's then possible that some of them 
> should get woken up, and previously (prior to above) it could happen 
> that only a single request gets woken up, and then that request would
> go 
> straight back to the wait queue ... with the patch, isn't it still 
> possible that all woken up requests could just go straight back to
> the 
> wait queue (albeit less likely)?
> 
> Could the creation of a snapshot (which should based on my
> understanding 
> of what a snapshot is block writes whilst the snapshot is being
> created, 
> ie, make them go to the wait queue), and could it be that the process
> of 
> setting up the snapshot (which itself involves writes) then
> potentially 
> block due to this?  Ie, the write request that needs to get into the 
> next batch to allow other writes to proceed gets blocked?
> 
> And as I write that it stops making sense to me because most likely
> the 
> IO for creating a snapshot would only result in blocking writes to
> the 
> LV, not to the underlying PVs which contains the metadata for the VG 
> which is being updated.
> 
> But still ... if we think about this, the probability of that "bug" 
> hitting would increase as the number of outstanding IO requests 
> increase?  With iostat reporting r_await values upwards of 100 and 
> w_await values periodically going up to 5000 (both generally in the 
> 20-50 range for the last few minutes that I've been watching them),
> it 
> would make sense for me that the number of requests blocking in-
> kernel 
> could be much higher than that, it makes perfect sense for me that it
> could be related to this.  On the other hand, IIRC iostat -dmx 1
> usually 
> showed only minimal if any requests in either [rw]_await during
> lockups.
> 
> Consider the AHCI controller on the other hand where we've got 7200
> RPM 
> SATA drives which are slow to begin with, now we've got traditional 
> snapshots, which are also causing an IO bottleneck and artificially 
> raising IO demand (much more so than thin snaps, really wish I could 
> figure out the migration process to convert this whole host to thin 
> pools but lvconvert scares me something crazy), so now having that
> first 
> snapshot causes IO bottleneck (ignoring relevant metadata updates,
> every 
> write to a not yet duplicated segment becomes a read + write + write
> to 
> clone the written to segment to the snapshot - thin pools just a read
> + 
> write for same), so already IO is more demanding, and now we try to 
> create another snapshot.
> 
> What if some IO fails to finish (due to continually being put back
> into 
> the wait queue), thus blocking the process of creating the snapshot
> to 
> begin with?
> 
> I know there are numerous other people using snapshots, but I've
> often 
> wondered how many use it quite as heavily as we do on this specific 
> host?  Given the massive amount of virtual machine infrastructure on
> the 
> one hand I think there must be quite a lot, but then I also think
> many 
> of them use "enterprise" (for whatever your definition of that is) 
> storage or something like ceph, so not based on LVM.  And more and
> more 
> so either SSD/flash or even NVMe, which given the faster response
> times 
> would also lower the risks of IO related problems from showing
> themselves.
> 
> The risk seems to be during the creation of snapshots, so IO not
> making 
> progress makes sense.
> 
> I've back-ported the referenced path onto 6.4.12 now, which will go 
> alive Saturday morning.  Perhaps we'll be sorted now.  Will also
> revert 
> to mq-deadline which has been shown to more regularly trigger this,
> so 
> let's see.
> 
> > 
> > Hello, this would usually need an NMI sent from a management
> > interface
> > as with it locked up no guarantee a sysrq c will get there from the
> > keyboard.
> > You could try though.
> > 
> > As long as you have in /etc/kdump.conf
> > 
> > path /var/crash
> > core_collector makedumpfile -l --message-level 7 -d 31
> > 
> > This will get kernel only pages and would not be very big.
> > 
> > I could work with you privately to get what we need out of the
> > vmcore
> > and we would avoid transferring it.
> Thanks.  This helps.  Let's get a core first (if it's going to happen
> again) and then take it from there.
> 
> Kind regards,
> Jaco
> 

Hello Jaco
These hangs usually require the stacks to see where and why we are
blocked. The vmcore will definitely help in that regard.

Regards
Laurence


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-25 12:01           ` Laurence Oberman
@ 2023-08-26 18:18             ` Jaco Kroon
  2023-09-06 21:03             ` Jaco Kroon
  1 sibling, 0 replies; 26+ messages in thread
From: Jaco Kroon @ 2023-08-26 18:18 UTC (permalink / raw)
  To: Laurence Oberman, Bart Van Assche, linux-block@vger.kernel.org

Hi Laurence,

On 2023/08/25 14:01, Laurence Oberman wrote:
>>> Hello, this would usually need an NMI sent from a management
>>> interface
>>> as with it locked up no guarantee a sysrq c will get there from the
>>> keyboard.
>>> You could try though.
>>>
>>> As long as you have in /etc/kdump.conf
>>>
>>> path /var/crash
>>> core_collector makedumpfile -l --message-level 7 -d 31
>>>
>>> This will get kernel only pages and would not be very big.
>>>
>>> I could work with you privately to get what we need out of the
>>> vmcore
>>> and we would avoid transferring it.
>> Thanks.  This helps.  Let's get a core first (if it's going to happen
>> again) and then take it from there.
>>
>> Kind regards,
>> Jaco
>>
> Hello Jaco
> These hangs usually require the stacks to see where and why we are
> blocked. The vmcore will definitely help in that regard.

Linux crowsnest 6.4.12-uls #1 SMP PREEMPT_DYNAMIC Fri Aug 25 02:46:44 
SAST 2023 x86_64 Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz GenuineIntel 
GNU/Linux

With the patch you referenced.

/proc/vmcore exists post kexec to the "new" kernel, if I just copy that 
do we need anything else?  Once I've copied /proc/vmcore and rebooted 
back into a more "normal" system, how do I start extracting information 
out of that core?

I don't have a kdump binary, or any other seemingly useful stuff even 
though I've got kexec-tools installed (which is where this comes from as 
far as I can tell) ... no /etc/kdump.conf either. Followed instructions 
here (with help from other sources):

https://www.kernel.org/doc/Documentation/kdump/kdump.txt

kdump references I can find w.r.t. /etc/kdump.conf seems to all be 
related to redhat and fedora ... neither of which applies (directly) to 
my Gentoo environment.

with 256G of RAM I'm assuming a crashkernel=512M should be sufficient?  
crashkernel=auto doesn't work.

The firmware upgrade on the controller killed reboot though ... BIOS no 
longer speak with the controller, but when performing the update the 
kernel immediately noticed that the firmware got upgraded.  So dead in 
the water at the moment.

Kind regards,
Jaco

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-08-25 12:01           ` Laurence Oberman
  2023-08-26 18:18             ` Jaco Kroon
@ 2023-09-06 21:03             ` Jaco Kroon
  2023-09-06 21:22               ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Jaco Kroon @ 2023-09-06 21:03 UTC (permalink / raw)
  To: Laurence Oberman, Bart Van Assche, linux-block@vger.kernel.org

Hi,

Just some feedback:

After an initial nightmare to get the system bootable again after it 
failed to boot after hot-upgrading the MPT3SAS firmware this has been 
the most stable the system has been in a long time.  Not to mention more 
performant.

Thanks for all the assistance, I'm hoping that this can now be put to rest.

The AHCI system as it turns out had a pending disk failure that is still 
not picked up by SMART, but kicking that drive out of the array after 
realising it was not on par with the other drives that system too seems 
much happier, so possibly related in that it's "underlying hardware" 
causing the problems, but could also be not related at all. For the time 
being we're just happy that everything is working as intended.  10 days 
is by far the longest we've managed on this host with mq-deadline as IO 
scheduler.


crowsnest [22:52:07] ~ # uptime
  22:52:08 up 10 days, 12:29,  2 users,  load average: 10.61, 10.74, 13.29

crowsnest [22:53:17] ~ # cat /sys/class/block/*/queue/scheduler| sort | 
uniq -c
      70 none
      32 none [mq-deadline] kyber bfq

crowsnest [22:54:12] ~ # iostat -dmx /dev/sd[a-z] /dev/sda[a-z]
Linux 6.4.12-uls (crowsnest)     09/06/23     _x86_64_    (6 CPU)

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     
w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s     dMB/s   drqm/s  
%drqm d_await dareq-sz     f/s f_await aqu-sz  %util
sda             50.56      2.98   218.07  81.18   69.31    60.41 
55.99      2.93   698.08  92.57   65.58    53.59    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    8.39    2.50 29.06
sdaa            43.59      2.95   214.92  83.14   58.59    69.38 
36.40      2.46   594.10  94.23   36.06    69.18    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.12    3.89 17.73
sdab            44.97      2.92   206.44  82.11   51.87    66.55 
38.25      2.46   592.74  93.94   23.42    65.89    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   16.99    3.26 17.98
sdac            44.52      2.90   200.94  81.86   42.48    66.71 
38.03      2.45   591.35  93.96   15.54    66.09    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   12.51    2.50 17.08
sdad            44.73      2.94   208.61  82.34   47.51    67.26 
37.48      2.45   591.42  94.04   18.53    67.01    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   16.59    2.85 18.12
sdae            44.19      2.90   200.65  81.95   49.42    67.19 
37.78      2.45   591.21  93.99   19.63    66.49    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   16.55    2.95 17.90
sdaf            54.23      3.02   219.60  80.20   46.23    56.96 
59.82      2.95   698.69  92.11   21.86    50.46    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    6.82    3.86 28.22
sdb             54.29      3.11   244.19  81.81   38.22    58.60 
60.49      2.99   708.72  92.14   14.89    50.60    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    5.83    3.01 27.55
sdc             65.14      4.23   196.15  75.07   45.48    66.55 
53.74      3.43   830.25  93.92   30.79    65.35    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    6.66    4.66 27.23
sdd             52.41      2.99   216.72  80.53   34.92    58.33 
59.28      2.94   697.06  92.16   12.50    50.77    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    5.20    2.60 26.97
sde             54.82      3.01   219.59  80.02   40.83    56.28 
61.64      2.96   699.08  91.90   16.62    49.11    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    5.57    3.30 27.39
sdf             54.27      3.11   244.50  81.83   35.74    58.61 
59.98      2.99   709.02  92.20   13.20    51.03    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    5.30    2.76 26.88
sdg             71.33      4.35   211.99  74.82   61.63    62.42 
59.98      3.49   837.89  93.32   65.93    59.50    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    8.57    3.68 28.50
sdh             50.62      2.98   218.18  81.17   71.07    60.35 
56.47      2.93   698.11  92.52   66.92    53.18    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    8.77    2.71 29.09
sdi             54.55      3.01   219.40  80.09   39.98    56.52 
60.81      2.95   699.33  92.00   15.53    49.75    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    5.58    3.16 27.26
sdj             53.87      3.09   241.35  81.75   51.94    58.69 
55.36      2.99   709.64  92.76   74.17    55.27    0.00 0.00     0.00   
0.00    0.00     0.00    0.00    0.00    2.18 26.72
sdk             53.31      3.09   242.76  81.99   53.54    59.40 
59.90      3.00   712.64  92.25   31.00    51.33    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    6.17    0.03 27.61
sdl             43.73      2.96   214.65  83.08   52.00    69.25 
36.42      2.46   593.91  94.22   34.34    69.13    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.72    3.55 17.67
sdm             43.58      2.94   209.23  82.76   50.34    69.08 
36.21      2.45   592.52  94.24   30.80    69.34    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.58    3.33 17.81
sdn             66.85      4.35   224.69  77.07   51.15    66.67 
54.92      3.31   797.51  93.56   43.70    61.67    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    7.57    1.15 28.54
sdo             71.51      4.35   211.60  74.74   61.11    62.25 
60.45      3.49   837.74  93.27   60.46    59.06    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    8.33    3.36 28.38
sdp             66.89      4.36   224.98  77.08   64.98    66.77 
54.75      3.31   797.56  93.58   55.14    61.83    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    9.22    2.70 28.08
sdq             68.06      4.22   184.72  73.08    2.88    63.45 
58.85      3.61   871.54  93.67   68.10    62.83    0.00 0.00     0.00   
0.00    0.00     0.00    6.66   10.21    4.27 28.96
sdr             43.88      2.96   214.93  83.05   52.68    68.96 
36.73      2.46   593.97  94.18   35.18    68.59    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.51    3.63 17.61
sds             43.61      2.94   209.03  82.74   50.47    68.97 
36.23      2.45   592.11  94.23   31.04    69.26    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.28    3.35 17.75
sdt             68.21      4.22   184.70  73.03   49.23    63.33 
59.34      3.61   870.36  93.62   23.89    62.26    0.00 0.00     0.00   
0.00    0.00     0.00    6.67    6.78    0.10 26.44
sdu             44.79      2.92   206.04  82.14   36.88    66.83 
37.95      2.46   593.06  93.99   10.00    66.41    0.00 0.00     0.00   
0.00    0.00     0.00    1.56    8.72    2.05 16.08
sdv             65.07      4.23   195.93  75.07   51.74    66.58 
53.77      3.43   830.24  93.92   36.03    65.32    0.00 0.00     0.00   
0.00    0.00     0.00    6.66    7.96    0.63 27.63
sdw             53.09      3.09   242.20  82.02   53.06    59.51 
59.61      3.00   711.79  92.27   30.23    51.50    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    6.05    4.66 27.56
sdx             43.76      2.92   207.55  82.59   64.77    68.37 
36.96      2.46   593.68  94.14   50.43    68.16    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   15.86    4.72 17.88
sdy             53.98      3.10   244.09  81.89   41.78    58.87 
60.11      2.99   709.16  92.19   15.81    50.93    0.00 0.00     0.00   
0.00    0.00     0.00    6.22    7.10    3.25 28.12
sdz             44.19      2.90   200.84  81.97   48.94    67.20 
37.61      2.45   591.62  94.02   19.10    66.81    0.00 0.00     0.00   
0.00    0.00     0.00    1.56   16.49    2.91 17.88

crowsnest [22:54:33] ~ # dmesg | grep mpt3sas
[    2.860853] mpt3sas version 43.100.00.00 loaded
[    2.861232] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED, 
total mem (263572916 kB)
[    2.920414] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k
[    2.920537] mpt3sas_cm0: MSI-X vectors supported: 96
[    2.920723] mpt3sas_cm0:  0 6 6
[    2.921071] mpt3sas_cm0: High IOPs queues : disabled
[    2.921163] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 45
[    2.921254] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 46
[    2.921345] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 47
[    2.921436] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 49
[    2.921526] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 50
[    2.921617] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 51
[    2.921707] mpt3sas_cm0: iomem(0x00000000fb240000), 
mapped(0x000000009b390d95), size(65536)
[    2.921808] mpt3sas_cm0: ioport(0x000000000000e000), size(256)
[    2.981165] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default 
host page size to 4k
[    2.981267] mpt3sas_cm0: sending message unit reset !!
[    2.982929] mpt3sas_cm0: message unit reset: SUCCESS
[    3.011138] mpt3sas_cm0: scatter gather: sge_in_main_msg(1), 
sge_per_chain(7), sge_per_io(128), chains_per_io(19)
[    3.011450] mpt3sas_cm0: request pool(0x00000000192e269b) - 
dma(0xfff00000): depth(3200), frame_size(128), pool_size(400 kB)
[    3.017719] mpt3sas_cm0: sense pool(0x000000004e7d07f8) - 
dma(0xff780000): depth(2939), element_size(96), pool_size (275 kB)
[    3.017919] mpt3sas_cm0: reply pool(0x0000000031a98fd2) - 
dma(0xff700000): depth(3264), frame_size(128), pool_size(408 kB)
[    3.018055] mpt3sas_cm0: config page(0x000000003932e626) - 
dma(0xff6fa000): size(512)
[    3.018153] mpt3sas_cm0: Allocated physical memory: size(8380 kB)
[    3.018247] mpt3sas_cm0: Current Controller Queue Depth(2936),Max 
Controller Queue Depth(3072)
[    3.018365] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[    3.186266] mpt3sas_cm0: _base_display_fwpkg_version: complete
[    3.186634] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.10.00), 
ChipRevision(0x02)
[    3.186734] mpt3sas_cm0: Protocol=(Initiator,Target), 
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[    3.187767] mpt3sas_cm0: sending port enable !!
[    3.188250] mpt3sas_cm0: hba_port entry: 000000007c8cd935, port: 255 
is added to hba_port list
[    3.189417] mpt3sas_cm0: host_add: handle(0x0001), 
sas_addr(0x5003048016846300), phys(8)
[    3.190895] mpt3sas_cm0: expander_add: handle(0x0009), 
parent(0x0001), sas_addr(0x500304800175f0bf), phys(51)

crowsnest [22:55:00] ~ # uname -a
Linux crowsnest 6.4.12 #2 SMP PREEMPT_DYNAMIC Sat Aug 26 08:10:42 SAST 
2023 x86_64 Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz GenuineIntel GNU/Linux

With the only extra patch being the "sbitmap: fix batching wakeup" patch 
from David Jeffery.

Kind regards,
Jaco

On 2023/08/25 14:01, Laurence Oberman wrote:

> On Fri, 2023-08-25 at 01:40 +0200, Jaco Kroon wrote:
>> Hi Laurence,
>>>>> One I am aware of is this
>>>>> commit 106397376c0369fcc01c58dd189ff925a2724a57
>>>>> Author: David Jeffery <djeffery@redhat.com>
>> I should have held off on replying until I finished looking into
>> this.
>> This looks very interesting indeed, that said, this is my first
>> serious
>> venture into the block layers of the kernel :), so the essay below is
>> more for my understanding than anything else, would be great to have
>> a
>> better understanding of the underlying principles here and your
>> feedback
>> on my understanding thereof would be much appreciated.
>>
>> If I understand this correctly (and that's a BIG IF) then it's
>> possible
>> that a bunch of IO requests goes into a wait queue for whatever
>> reason
>> (pending some other event?).  It's then possible that some of them
>> should get woken up, and previously (prior to above) it could happen
>> that only a single request gets woken up, and then that request would
>> go
>> straight back to the wait queue ... with the patch, isn't it still
>> possible that all woken up requests could just go straight back to
>> the
>> wait queue (albeit less likely)?
>>
>> Could the creation of a snapshot (which should based on my
>> understanding
>> of what a snapshot is block writes whilst the snapshot is being
>> created,
>> ie, make them go to the wait queue), and could it be that the process
>> of
>> setting up the snapshot (which itself involves writes) then
>> potentially
>> block due to this?  Ie, the write request that needs to get into the
>> next batch to allow other writes to proceed gets blocked?
>>
>> And as I write that it stops making sense to me because most likely
>> the
>> IO for creating a snapshot would only result in blocking writes to
>> the
>> LV, not to the underlying PVs which contains the metadata for the VG
>> which is being updated.
>>
>> But still ... if we think about this, the probability of that "bug"
>> hitting would increase as the number of outstanding IO requests
>> increase?  With iostat reporting r_await values upwards of 100 and
>> w_await values periodically going up to 5000 (both generally in the
>> 20-50 range for the last few minutes that I've been watching them),
>> it
>> would make sense for me that the number of requests blocking in-
>> kernel
>> could be much higher than that, it makes perfect sense for me that it
>> could be related to this.  On the other hand, IIRC iostat -dmx 1
>> usually
>> showed only minimal if any requests in either [rw]_await during
>> lockups.
>>
>> Consider the AHCI controller on the other hand where we've got 7200
>> RPM
>> SATA drives which are slow to begin with, now we've got traditional
>> snapshots, which are also causing an IO bottleneck and artificially
>> raising IO demand (much more so than thin snaps, really wish I could
>> figure out the migration process to convert this whole host to thin
>> pools but lvconvert scares me something crazy), so now having that
>> first
>> snapshot causes IO bottleneck (ignoring relevant metadata updates,
>> every
>> write to a not yet duplicated segment becomes a read + write + write
>> to
>> clone the written to segment to the snapshot - thin pools just a read
>> +
>> write for same), so already IO is more demanding, and now we try to
>> create another snapshot.
>>
>> What if some IO fails to finish (due to continually being put back
>> into
>> the wait queue), thus blocking the process of creating the snapshot
>> to
>> begin with?
>>
>> I know there are numerous other people using snapshots, but I've
>> often
>> wondered how many use it quite as heavily as we do on this specific
>> host?  Given the massive amount of virtual machine infrastructure on
>> the
>> one hand I think there must be quite a lot, but then I also think
>> many
>> of them use "enterprise" (for whatever your definition of that is)
>> storage or something like ceph, so not based on LVM.  And more and
>> more
>> so either SSD/flash or even NVMe, which given the faster response
>> times
>> would also lower the risks of IO related problems from showing
>> themselves.
>>
>> The risk seems to be during the creation of snapshots, so IO not
>> making
>> progress makes sense.
>>
>> I've back-ported the referenced path onto 6.4.12 now, which will go
>> alive Saturday morning.  Perhaps we'll be sorted now.  Will also
>> revert
>> to mq-deadline which has been shown to more regularly trigger this,
>> so
>> let's see.
>>
>>> Hello, this would usually need an NMI sent from a management
>>> interface
>>> as with it locked up no guarantee a sysrq c will get there from the
>>> keyboard.
>>> You could try though.
>>>
>>> As long as you have in /etc/kdump.conf
>>>
>>> path /var/crash
>>> core_collector makedumpfile -l --message-level 7 -d 31
>>>
>>> This will get kernel only pages and would not be very big.
>>>
>>> I could work with you privately to get what we need out of the
>>> vmcore
>>> and we would avoid transferring it.
>> Thanks.  This helps.  Let's get a core first (if it's going to happen
>> again) and then take it from there.
>>
>> Kind regards,
>> Jaco
>>
> Hello Jaco
> These hangs usually require the stacks to see where and why we are
> blocked. The vmcore will definitely help in that regard.
>
> Regards
> Laurence
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-09-06 21:03             ` Jaco Kroon
@ 2023-09-06 21:22               ` Jens Axboe
  2023-09-06 22:05                 ` Jaco Kroon
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2023-09-06 21:22 UTC (permalink / raw)
  To: Jaco Kroon, Laurence Oberman, Bart Van Assche,
	linux-block@vger.kernel.org

On 9/6/23 3:03 PM, Jaco Kroon wrote:
> With the only extra patch being the "sbitmap: fix batching wakeup"
> patch from David Jeffery.

Should we get that sent to stable, then? It applies cleanly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LVM kernel lockup scenario during lvcreate
  2023-09-06 21:22               ` Jens Axboe
@ 2023-09-06 22:05                 ` Jaco Kroon
  0 siblings, 0 replies; 26+ messages in thread
From: Jaco Kroon @ 2023-09-06 22:05 UTC (permalink / raw)
  To: Jens Axboe, Laurence Oberman, Bart Van Assche,
	linux-block@vger.kernel.org

Hi,

On 2023/09/06 23:22, Jens Axboe wrote:
> On 9/6/23 3:03 PM, Jaco Kroon wrote:
>> With the only extra patch being the "sbitmap: fix batching wakeup"
>> patch from David Jeffery.
> Should we get that sent to stable, then? It applies cleanly.

It's debatable whether that was what fixed the underlying problem.  
Personally it makes sense to me that it could well have been.

There are three things here that *could* have been the resolution:

1.  firmware bug on the MPT3SAS controller that was fixed by the 
firmware upgrade.
2.  the relevant patch (imho likely).
3.  other changes from 6.4.6 to 6.4.12 (I didn't spot anything that I 
think relates).

I'm also not convinced it was the controller, as such I think it likely 
that the patch in question was what did the trick.

Kind regards,
Jaco

>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-09-06 22:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-09  7:29 LVM kernel lockup scenario during lvcreate Jaco Kroon
2023-06-12 18:40 ` Bart Van Assche
2023-06-12 19:36   ` Jaco Kroon
2023-06-18 23:28     ` Bart Van Assche
2023-06-18 19:34   ` Jaco Kroon
2023-06-18 23:56     ` Bart Van Assche
2023-06-19  6:06       ` Jaco Kroon
2023-06-26  8:30       ` Jaco Kroon
2023-06-26 16:42         ` Bart Van Assche
2023-06-26 23:29           ` Jaco Kroon
2023-07-11 13:22             ` Jaco Kroon
2023-07-11 14:45               ` Bart Van Assche
2023-07-12 10:12                 ` Jaco Kroon
2023-07-12 13:43                   ` Bart Van Assche
     [not found]                     ` <ef2812b4-7853-9dda-85dd-210636840a59@uls.co.za>
2023-08-24  7:29                       ` Jaco Kroon
2023-08-24 17:13                         ` Bart Van Assche
2023-08-24 20:16                           ` Jaco Kroon
2023-08-24 17:29   ` Laurence Oberman
2023-08-24 20:01     ` Jaco Kroon
2023-08-24 20:19       ` Laurence Oberman
2023-08-24 23:40         ` Jaco Kroon
2023-08-25 12:01           ` Laurence Oberman
2023-08-26 18:18             ` Jaco Kroon
2023-09-06 21:03             ` Jaco Kroon
2023-09-06 21:22               ` Jens Axboe
2023-09-06 22:05                 ` Jaco Kroon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).