* FSTRIM timeout/errors on WD RED SA500 NAS SSD
@ 2022-06-24 15:23 Forza
2022-06-24 18:37 ` Roman Mamedov
2022-06-25 6:44 ` Qu Wenruo
0 siblings, 2 replies; 6+ messages in thread
From: Forza @ 2022-06-24 15:23 UTC (permalink / raw)
To: linux-btrfs
Hi,
I have discovered an odd issue where "fstrim" on an Btrfs filesystem
consistently fails, while "mkfs.btrfs" always succeeds with full device
discard.
Hardware:
* SuperMicro server
* LSI/Broadcom HBA 9500-8i SAS/SATA controller
* WD RED SA500 NAS SATA SSD 2TB (WDS200T1R0A-68A4W0)
Drive FW: 411000WR
* Alpine Linux kernel 5.15.48
* /sys/block/sdf/queue/
discard_granularity:512
discard_max_bytes:134217216
discard_max_hw_bytes:134217216
# btrfs fi us -T /mnt/nas_ssd/
Overall:
Device size: 7.13TiB
Device allocated: 90.06GiB
Device unallocated: 7.04TiB
Device missing: 0.00B
Used: 86.73GiB
Free (estimated): 3.52TiB (min: 3.52TiB)
Free (statfs, df): 3.52TiB
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 94.58MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID1 RAID1 RAID1 Unallocated
-- --------- -------- --------- -------- -----------
9 /dev/sdc1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
10 /dev/sdf1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
11 /dev/sdd1 - - - 894.25GiB
12 /dev/sde1 - - - 894.25GiB
13 /dev/sdg1 - - - 1.82TiB
14 /dev/sdh1 - - - 1.82TiB
-- --------- -------- --------- -------- -----------
Total 44.00GiB 1.00GiB 32.00MiB 8.94TiB
Used 43.23GiB 133.44MiB 16.00KiB
The root cause I believe is that the WD drives take 1.5-2.5 minutes to
do a full device discard. The Kingston DC500 drives only take 6-7
seconds for the same. I have 4 identical WD drives and 2 Kingston
drives. All WD drives have the same issue.
When issuing 'fstrim -v /mnt/btrfs' I get the following message in dmesg
after about 30 seconds:
# time fstrim -v /mnt/nas_ssd/
/mnt/nas_ssd/: 6.2 TiB (6834839748608 bytes) trimmed
real 4m21.356s
user 0m0.001s
sys 0m0.241s
[ +0.000003] scsi target6:0:4: handle(0x0029),
sas_address(0x5003048020db4543), phy(3)
[ +0.000003] scsi target6:0:4: enclosure logical
id(0x5003048020db457f), slot(3)
[ +0.000003] scsi target6:0:4: enclosure level(0x0000), connector name(
C0.1)
[ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
scmd(0x00000000eb0d9438) might have completed
[ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x00000000eb0d9438)
[ +0.000012] sd 6:0:4:0: attempting task
abort!scmd(0x0000000075f63919), outstanding for 30397 ms & timeout 30000 ms
[ +0.000003] sd 6:0:4:0: [sdg] tag#2762 CDB: opcode=0x42 42 00 00 00 00
00 00 00 18 00
[ +0.000002] scsi target6:0:4: handle(0x0029),
sas_address(0x5003048020db4543), phy(3)
[ +0.000004] scsi target6:0:4: enclosure logical
id(0x5003048020db457f), slot(3)
[ +0.000002] scsi target6:0:4: enclosure level(0x0000), connector name(
C0.1)
[ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
scmd(0x0000000075f63919) might have completed
[ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x0000000075f63919)
[ +0.255021] sd 6:0:4:0: Power-on or device reset occurred
An interesting observation is that "fstrim" works on the same device if
it is mounted as ext4. There are no errors in dmesg.
To sum up:
Works:
* "mkfs.btrfs"
* "btrfs replace"
* "btrfs device add"
* "fstrim" on ext4 mounted device.
Does not work:
* "fstrim" on Btrfs mounted device
* "blkdiscard" on /dev/sdX
The btrfs-progs code seems to do 'BLKDISCARD' on 1GiB chunks. This may
explain why "mkfs.btrfs" and "btrfs relace" and "btrfs device add"
works, while "fstrim" and "blkdiscard" tools do not.
https://github.com/kdave/btrfs-progs/blob/c0ad9bde429196db7e8710ea1abfab7a2bca2e43/common/device-utils.c#L79
Not exactly sure how ext4 handles the "fstrim" case, but it seems to
group trim requests in smaller batches, which may explain why the SSD
returns status before the 30s timeout of the HBA.
https://github.com/torvalds/linux/blob/92f20ff72066d8d7e2ffb655c2236259ac9d1c5d/fs/ext4/mballoc.c#L6467
Can we work around the Btrfs fstrim issue, for example by splitting up
fstrim requests in "discard_max_bytes" sized chunks?
Thanks,
Forza
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: FSTRIM timeout/errors on WD RED SA500 NAS SSD
2022-06-24 15:23 FSTRIM timeout/errors on WD RED SA500 NAS SSD Forza
@ 2022-06-24 18:37 ` Roman Mamedov
2022-06-24 22:45 ` Forza
2022-06-25 6:44 ` Qu Wenruo
1 sibling, 1 reply; 6+ messages in thread
From: Roman Mamedov @ 2022-06-24 18:37 UTC (permalink / raw)
To: Forza; +Cc: linux-btrfs
On Fri, 24 Jun 2022 17:23:27 +0200
Forza <forza@tnonline.net> wrote:
> Can we work around the Btrfs fstrim issue, for example by splitting up
> fstrim requests in "discard_max_bytes" sized chunks?
If I'm not mistaken, those discard_max_* are honoured automatically by a lower
level than the submitting filesystem (i.e. the block layer).
It seems like the linux-ide list (or is it linux-scsi for SAS?) could be better
suited to hunt down this issue. Especially if aside from Btrfs even just a
simple blkdiscard also shows trouble.
Btw, did you try lowering discard_max_bytes in sysfs, and then retrying
blkdiscard?
--
With respect,
Roman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FSTRIM timeout/errors on WD RED SA500 NAS SSD
2022-06-24 18:37 ` Roman Mamedov
@ 2022-06-24 22:45 ` Forza
2022-06-24 22:55 ` Roman Mamedov
0 siblings, 1 reply; 6+ messages in thread
From: Forza @ 2022-06-24 22:45 UTC (permalink / raw)
To: linux-btrfs; +Cc: Roman Mamedov
On 6/24/22 20:37, Roman Mamedov wrote:
> On Fri, 24 Jun 2022 17:23:27 +0200
> Forza <forza@tnonline.net> wrote:
>
>> Can we work around the Btrfs fstrim issue, for example by splitting up
>> fstrim requests in "discard_max_bytes" sized chunks?
>
> If I'm not mistaken, those discard_max_* are honoured automatically by a lower
> level than the submitting filesystem (i.e. the block layer).
>
> It seems like the linux-ide list (or is it linux-scsi for SAS?) could be better
> suited to hunt down this issue. Especially if aside from Btrfs even just a
> simple blkdiscard also shows trouble.
>
> Btw, did you try lowering discard_max_bytes in sysfs, and then retrying
> blkdiscard?
>
I have tried lowering the discard_max_bytes, but it did not help - on
the contrary it takes much longer to do the blkdiscard /dev/sdf and does
not solve the problem.
However, since btrfs-progs do split discard ranges into smaller chunks,
and that ext4 seems to handle this as well, I think it is worth looking
into handling.
The mpt3sas[1] driver seem to have a lot of hardcoded 30 second
timeouts[2], and fixing that might be a much bigger task. I will bring
this up to the linux-scsi mailing list to see if they have any suggestions.
Thanks,
Forza
[1]
https://github.com/torvalds/linux/blob/master/drivers/scsi/mpt3sas/mpt3sas_scsih.c
[2]
https://github.com/torvalds/linux/blob/6a0a17e6c6d1091ada18d43afd87fb26a82a9823/drivers/scsi/mpt3sas/mpt3sas_scsih.c#L3303-L3306
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FSTRIM timeout/errors on WD RED SA500 NAS SSD
2022-06-24 22:45 ` Forza
@ 2022-06-24 22:55 ` Roman Mamedov
0 siblings, 0 replies; 6+ messages in thread
From: Roman Mamedov @ 2022-06-24 22:55 UTC (permalink / raw)
To: Forza; +Cc: linux-btrfs
On Sat, 25 Jun 2022 00:45:55 +0200
Forza <forza@tnonline.net> wrote:
> I have tried lowering the discard_max_bytes, but it did not help - on
> the contrary it takes much longer to do the blkdiscard /dev/sdf and does
> not solve the problem.
Should be possible to verify exactly which size discard requests are being
submitted to the device, using https://linux.die.net/man/8/blktrace
In no event requests larger than discard_max_bytes should be seen.
If lowering it does not help, perhaps after a sustained stream of those,
some individual requests, even small, start to take longer than 30s?
> However, since btrfs-progs do split discard ranges into smaller chunks,
> and that ext4 seems to handle this as well, I think it is worth looking
> into handling.
That's not something to be handled on the FS side, even if one of them
happens to work, by luck. I suggest to focus on diagnosing with blkdiscard
only, and proceed to FSes only after that has been made to work reliably.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FSTRIM timeout/errors on WD RED SA500 NAS SSD
2022-06-24 15:23 FSTRIM timeout/errors on WD RED SA500 NAS SSD Forza
2022-06-24 18:37 ` Roman Mamedov
@ 2022-06-25 6:44 ` Qu Wenruo
2022-06-25 10:34 ` Forza
1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2022-06-25 6:44 UTC (permalink / raw)
To: Forza, linux-btrfs
On 2022/6/24 23:23, Forza wrote:
> Hi,
>
> I have discovered an odd issue where "fstrim" on an Btrfs filesystem
> consistently fails, while "mkfs.btrfs" always succeeds with full device
> discard.
>
> Hardware:
> * SuperMicro server
> * LSI/Broadcom HBA 9500-8i SAS/SATA controller
> * WD RED SA500 NAS SATA SSD 2TB (WDS200T1R0A-68A4W0)
> Drive FW: 411000WR
> * Alpine Linux kernel 5.15.48
>
> * /sys/block/sdf/queue/
> discard_granularity:512
> discard_max_bytes:134217216
> discard_max_hw_bytes:134217216
Weird, it's 128M - 512, not sure if this is related.
>
> # btrfs fi us -T /mnt/nas_ssd/
> Overall:
> Device size: 7.13TiB
> Device allocated: 90.06GiB
> Device unallocated: 7.04TiB
Btrfs will definitely try to submit a large bio to do discard on all
those unallocated space.
Considering EXT4 has its block group headers taking up space, I guess it
will not submit large enough discard bio to trigger the problem.
> Device missing: 0.00B
> Used: 86.73GiB
> Free (estimated): 3.52TiB (min: 3.52TiB)
> Free (statfs, df): 3.52TiB
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 94.58MiB (used: 0.00B)
> Multiple profiles: no
>
> Data Metadata System
> Id Path RAID1 RAID1 RAID1 Unallocated
> -- --------- -------- --------- -------- -----------
> 9 /dev/sdc1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
> 10 /dev/sdf1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
> 11 /dev/sdd1 - - - 894.25GiB
> 12 /dev/sde1 - - - 894.25GiB
> 13 /dev/sdg1 - - - 1.82TiB
> 14 /dev/sdh1 - - - 1.82TiB
> -- --------- -------- --------- -------- -----------
> Total 44.00GiB 1.00GiB 32.00MiB 8.94TiB
> Used 43.23GiB 133.44MiB 16.00KiB
>
>
>
>
> The root cause I believe is that the WD drives take 1.5-2.5 minutes to
> do a full device discard. The Kingston DC500 drives only take 6-7
> seconds for the same. I have 4 identical WD drives and 2 Kingston
> drives. All WD drives have the same issue.
>
> When issuing 'fstrim -v /mnt/btrfs' I get the following message in dmesg
> after about 30 seconds:
>
> # time fstrim -v /mnt/nas_ssd/
> /mnt/nas_ssd/: 6.2 TiB (6834839748608 bytes) trimmed
>
> real 4m21.356s
> user 0m0.001s
> sys 0m0.241s
>
> [ +0.000003] scsi target6:0:4: handle(0x0029),
> sas_address(0x5003048020db4543), phy(3)
> [ +0.000003] scsi target6:0:4: enclosure logical
> id(0x5003048020db457f), slot(3)
> [ +0.000003] scsi target6:0:4: enclosure level(0x0000), connector name(
> C0.1)
> [ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
> scmd(0x00000000eb0d9438) might have completed
> [ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x00000000eb0d9438)
> [ +0.000012] sd 6:0:4:0: attempting task
> abort!scmd(0x0000000075f63919), outstanding for 30397 ms & timeout 30000 ms
> [ +0.000003] sd 6:0:4:0: [sdg] tag#2762 CDB: opcode=0x42 42 00 00 00 00
> 00 00 00 18 00
> [ +0.000002] scsi target6:0:4: handle(0x0029),
> sas_address(0x5003048020db4543), phy(3)
> [ +0.000004] scsi target6:0:4: enclosure logical
> id(0x5003048020db457f), slot(3)
> [ +0.000002] scsi target6:0:4: enclosure level(0x0000), connector name(
> C0.1)
> [ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
> scmd(0x0000000075f63919) might have completed
> [ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x0000000075f63919)
> [ +0.255021] sd 6:0:4:0: Power-on or device reset occurred
Just want to make sure it's not btrfs screwing up things, mind to use
blktrace to trace the bio submitted so we can make sure btrfs is doing
its work correct?
Thanks,
Qu
>
>
>
>
> An interesting observation is that "fstrim" works on the same device if
> it is mounted as ext4. There are no errors in dmesg.
>
> To sum up:
>
> Works:
> * "mkfs.btrfs"
> * "btrfs replace"
> * "btrfs device add"
> * "fstrim" on ext4 mounted device.
>
> Does not work:
> * "fstrim" on Btrfs mounted device
> * "blkdiscard" on /dev/sdX
>
> The btrfs-progs code seems to do 'BLKDISCARD' on 1GiB chunks. This may
> explain why "mkfs.btrfs" and "btrfs relace" and "btrfs device add"
> works, while "fstrim" and "blkdiscard" tools do not.
> https://github.com/kdave/btrfs-progs/blob/c0ad9bde429196db7e8710ea1abfab7a2bca2e43/common/device-utils.c#L79
>
>
> Not exactly sure how ext4 handles the "fstrim" case, but it seems to
> group trim requests in smaller batches, which may explain why the SSD
> returns status before the 30s timeout of the HBA.
> https://github.com/torvalds/linux/blob/92f20ff72066d8d7e2ffb655c2236259ac9d1c5d/fs/ext4/mballoc.c#L6467
>
>
> Can we work around the Btrfs fstrim issue, for example by splitting up
> fstrim requests in "discard_max_bytes" sized chunks?
>
> Thanks,
> Forza
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FSTRIM timeout/errors on WD RED SA500 NAS SSD
2022-06-25 6:44 ` Qu Wenruo
@ 2022-06-25 10:34 ` Forza
0 siblings, 0 replies; 6+ messages in thread
From: Forza @ 2022-06-25 10:34 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
On 6/25/22 08:44, Qu Wenruo wrote:
>
>
> On 2022/6/24 23:23, Forza wrote:
>> Hi,
>>
>> I have discovered an odd issue where "fstrim" on an Btrfs filesystem
>> consistently fails, while "mkfs.btrfs" always succeeds with full device
>> discard.
>>
>> Hardware:
>> * SuperMicro server
>> * LSI/Broadcom HBA 9500-8i SAS/SATA controller
>> * WD RED SA500 NAS SATA SSD 2TB (WDS200T1R0A-68A4W0)
>> Drive FW: 411000WR
>> * Alpine Linux kernel 5.15.48
>>
>> * /sys/block/sdf/queue/
>> discard_granularity:512
>> discard_max_bytes:134217216
>> discard_max_hw_bytes:134217216
>
> Weird, it's 128M - 512, not sure if this is related.
The Kimgston SEDC500R960G SSD has the same values and there is no issue
with that one.
>
>>
>> # btrfs fi us -T /mnt/nas_ssd/
>> Overall:
>> Device size: 7.13TiB
>> Device allocated: 90.06GiB
>> Device unallocated: 7.04TiB
>
> Btrfs will definitely try to submit a large bio to do discard on all
> those unallocated space.
>
> Considering EXT4 has its block group headers taking up space, I guess it
> will not submit large enough discard bio to trigger the problem.
>
>> Device missing: 0.00B
>> Used: 86.73GiB
>> Free (estimated): 3.52TiB (min: 3.52TiB)
>> Free (statfs, df): 3.52TiB
>> Data ratio: 2.00
>> Metadata ratio: 2.00
>> Global reserve: 94.58MiB (used: 0.00B)
>> Multiple profiles: no
>>
>> Data Metadata System
>> Id Path RAID1 RAID1 RAID1 Unallocated
>> -- --------- -------- --------- -------- -----------
>> 9 /dev/sdc1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
>> 10 /dev/sdf1 44.00GiB 1.00GiB 32.00MiB 1.77TiB
>> 11 /dev/sdd1 - - - 894.25GiB
>> 12 /dev/sde1 - - - 894.25GiB
>> 13 /dev/sdg1 - - - 1.82TiB
>> 14 /dev/sdh1 - - - 1.82TiB
>> -- --------- -------- --------- -------- -----------
>> Total 44.00GiB 1.00GiB 32.00MiB 8.94TiB
>> Used 43.23GiB 133.44MiB 16.00KiB
>>
>>
>>
>>
>> The root cause I believe is that the WD drives take 1.5-2.5 minutes to
>> do a full device discard. The Kingston DC500 drives only take 6-7
>> seconds for the same. I have 4 identical WD drives and 2 Kingston
>> drives. All WD drives have the same issue.
>>
>> When issuing 'fstrim -v /mnt/btrfs' I get the following message in dmesg
>> after about 30 seconds:
>>
>> # time fstrim -v /mnt/nas_ssd/
>> /mnt/nas_ssd/: 6.2 TiB (6834839748608 bytes) trimmed
>>
>> real 4m21.356s
>> user 0m0.001s
>> sys 0m0.241s
>>
>> [ +0.000003] scsi target6:0:4: handle(0x0029),
>> sas_address(0x5003048020db4543), phy(3)
>> [ +0.000003] scsi target6:0:4: enclosure logical
>> id(0x5003048020db457f), slot(3)
>> [ +0.000003] scsi target6:0:4: enclosure level(0x0000), connector name(
>> C0.1)
>> [ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
>> scmd(0x00000000eb0d9438) might have completed
>> [ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x00000000eb0d9438)
>> [ +0.000012] sd 6:0:4:0: attempting task
>> abort!scmd(0x0000000075f63919), outstanding for 30397 ms & timeout
>> 30000 ms
>> [ +0.000003] sd 6:0:4:0: [sdg] tag#2762 CDB: opcode=0x42 42 00 00 00 00
>> 00 00 00 18 00
>> [ +0.000002] scsi target6:0:4: handle(0x0029),
>> sas_address(0x5003048020db4543), phy(3)
>> [ +0.000004] scsi target6:0:4: enclosure logical
>> id(0x5003048020db457f), slot(3)
>> [ +0.000002] scsi target6:0:4: enclosure level(0x0000), connector name(
>> C0.1)
>> [ +0.000003] sd 6:0:4:0: No reference found at driver, assuming
>> scmd(0x0000000075f63919) might have completed
>> [ +0.000003] sd 6:0:4:0: task abort: SUCCESS scmd(0x0000000075f63919)
>> [ +0.255021] sd 6:0:4:0: Power-on or device reset occurred
>
> Just want to make sure it's not btrfs screwing up things, mind to use
> blktrace to trace the bio submitted so we can make sure btrfs is doing
> its work correct?
Sure. Not an expert in making dev builds on Alpine but I think I got it
correct. Output is too large to be attached. See links to download the
trace files.
1) running blkdiscard /dev/sdf1
https://paste.tnonline.net/files/MYymVOfP4ROQ_blktrace.tar.xz
2) running mkfs.btrfs /dev/sdf1
https://paste.tnonline.net/files/5UUvxYlQD46Q_blktrace_mkfs.tar.xz
3) running fstrim on mounted btrfs
https://paste.tnonline.net/files/gultxaa7OP8P_blktrace_fstrim.tar.xz
https://paste.tnonline.net/files/8Vy7D0uR2Mk5_dmesg_fstrim.txt
Thanks,
Forza
>
> Thanks,
> Qu
>>
>>
>>
>>
>> An interesting observation is that "fstrim" works on the same device if
>> it is mounted as ext4. There are no errors in dmesg.
>>
>> To sum up:
>>
>> Works:
>> * "mkfs.btrfs"
>> * "btrfs replace"
>> * "btrfs device add"
>> * "fstrim" on ext4 mounted device.
>>
>> Does not work:
>> * "fstrim" on Btrfs mounted device
>> * "blkdiscard" on /dev/sdX
>>
>> The btrfs-progs code seems to do 'BLKDISCARD' on 1GiB chunks. This may
>> explain why "mkfs.btrfs" and "btrfs relace" and "btrfs device add"
>> works, while "fstrim" and "blkdiscard" tools do not.
>> https://github.com/kdave/btrfs-progs/blob/c0ad9bde429196db7e8710ea1abfab7a2bca2e43/common/device-utils.c#L79
>>
>>
>>
>> Not exactly sure how ext4 handles the "fstrim" case, but it seems to
>> group trim requests in smaller batches, which may explain why the SSD
>> returns status before the 30s timeout of the HBA.
>> https://github.com/torvalds/linux/blob/92f20ff72066d8d7e2ffb655c2236259ac9d1c5d/fs/ext4/mballoc.c#L6467
>>
>>
>>
>> Can we work around the Btrfs fstrim issue, for example by splitting up
>> fstrim requests in "discard_max_bytes" sized chunks?
>>
>> Thanks,
>> Forza
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-06-25 10:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-24 15:23 FSTRIM timeout/errors on WD RED SA500 NAS SSD Forza
2022-06-24 18:37 ` Roman Mamedov
2022-06-24 22:45 ` Forza
2022-06-24 22:55 ` Roman Mamedov
2022-06-25 6:44 ` Qu Wenruo
2022-06-25 10:34 ` Forza
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.