* [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
@ 2024-07-06 14:30 Mateusz Jończyk
2024-07-07 19:50 ` Mateusz Jończyk
2024-07-08 1:54 ` Yu Kuai
0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-06 14:30 UTC (permalink / raw)
To: linux-raid, linux-kernel
Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni,
Mateusz Jończyk
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7765 bytes --]
Hello,
Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
device has the write-mostly flag set. Linux 6.8.0 works fine, as does
6.1.96.
#regzbot introduced: v6.8.0..v6.9.0
In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA
SSD drives: /dev/md0 for /boot, /dev/md1 for remaining data. For
performance, I have marked the RAID component devices on the SATA SSD
drive write-mostly, which "means that the 'md' driver will avoid reading
from these devices if at all possible".
Recently, the NVMe drive started failing, so I removed it from the arrays:
$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb5[1](W)
471727104 blocks super 1.2 [2/1] [_U]
bitmap: 4/4 pages [16KB], 65536KB chunk
md0 : active raid1 sdb4[1](W)
2094080 blocks super 1.2 [2/1] [_U]
unused devices: <none>
and wiped it. Since then, Linux 6.9+ fails to assemble the arrays on startup
with the following stacktraces in dmesg:
md/raid1:md0: active with 1 out of 2 mirrors
md0: detected capacity change from 0 to 4188160
------------[ cut here ]------------
kernel BUG at block/bio.c:1659!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 174 Comm: mdadm Not tainted 6.10.0-rc6unif33 #493
Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.72 05/31/2024
RIP: 0010:bio_split+0x96/0xb0
Code: df ff ff 41 f6 45 14 80 74 08 66 41 81 4c 24 14 80 00 5b 4c 89 e0 41 5c 41 5d 5d c3 cc cc cc cc 41 c7 45 28 00 00 00 00 eb d9 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb dd 66 66 2e 0f 1f 84 00 00 00 00 00
RSP: 0018:ffffa7588041b330 EFLAGS: 00010246
RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffff9f22cb08f938
RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9f22c1199400
RBP: ffffa7588041b420 R08: ffff9f22c3587b30 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000008 R12: ffff9f22cc9da700
R13: ffff9f22cb08f800 R14: ffff9f22c6a35fa0 R15: ffff9f22c1846800
FS: 00007f5f88404740(0000) GS:ffff9f2621e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056299cb95000 CR3: 000000010c82a002 CR4: 00000000003706f0
Call Trace:
<TASK>
? show_regs+0x67/0x70
? __die_body+0x20/0x70
? die+0x3e/0x60
? do_trap+0xd6/0xf0
? do_error_trap+0x71/0x90
? bio_split+0x96/0xb0
? exc_invalid_op+0x53/0x70
? bio_split+0x96/0xb0
? asm_exc_invalid_op+0x1b/0x20
? bio_split+0x96/0xb0
? raid1_read_request+0x890/0xd20
? __call_rcu_common.constprop.0+0x97/0x260
raid1_make_request+0x81/0xce0
? __get_random_u32_below+0x17/0x70 // is not present in other stacktraces
? new_slab+0x2b3/0x580 // is not present in other stacktraces
md_handle_request+0x77/0x210
md_submit_bio+0x62/0xa0
__submit_bio+0x17b/0x230
submit_bio_noacct_nocheck+0x18e/0x3c0
submit_bio_noacct+0x244/0x670
submit_bio+0xac/0xe0
submit_bh_wbc+0x168/0x190
block_read_full_folio+0x203/0x420
? __mod_memcg_lruvec_state+0xcd/0x210
? __pfx_blkdev_get_block+0x10/0x10
? __lruvec_stat_mod_folio+0x63/0xb0
? __filemap_add_folio+0x24d/0x450
? __pfx_blkdev_read_folio+0x10/0x10
blkdev_read_folio+0x18/0x20
filemap_read_folio+0x45/0x290
? __pfx_workingset_update_node+0x10/0x10
? folio_add_lru+0x5a/0x80
? filemap_add_folio+0xba/0xe0
? __pfx_blkdev_read_folio+0x10/0x10
do_read_cache_folio+0x10a/0x3c0
read_cache_folio+0x12/0x20
read_part_sector+0x36/0xc0
read_lba+0x96/0x1b0
find_valid_gpt+0xe8/0x770
? get_page_from_freelist+0x615/0x12e0
? __pfx_efi_partition+0x10/0x10
efi_partition+0x80/0x4e0
? vsnprintf+0x297/0x4f0
? snprintf+0x49/0x70
? __pfx_efi_partition+0x10/0x10
bdev_disk_changed+0x270/0x760
blkdev_get_whole+0x8b/0xb0
bdev_open+0x2bd/0x390
? __pfx_blkdev_open+0x10/0x10
blkdev_open+0x8f/0xc0
do_dentry_open+0x174/0x570
vfs_open+0x2b/0x40
path_openat+0xb20/0x1150
do_filp_open+0xa8/0x120
? alloc_fd+0xc2/0x180
do_sys_openat2+0x250/0x2a0
do_sys_open+0x46/0x80
__x64_sys_openat+0x20/0x30
x64_sys_call+0xe55/0x20d0
do_syscall_64+0x47/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5f88514f5b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
RSP: 002b:00007ffd8839cbe0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007ffd8839dbe0 RCX: 00007f5f88514f5b
RDX: 0000000000004000 RSI: 00007ffd8839cc70 RDI: 00000000ffffff9c
RBP: 00007ffd8839cc70 R08: 0000000000000000 R09: 00007ffd8839cae0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000004000
R13: 0000000000004000 R14: 00007ffd8839cc68 R15: 000055942d9dabe0
</TASK>
Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 drm_buddy r8169 i2c_algo_bit psmouse i2c_i801 drm_display_helper i2c_mux video i2c_smbus
xhci_pci realtek cec xhci_pci_renesas i2c_hid_acpi i2c_hid hid wmi aesni_intel crypto_simd cryptd
---[ end trace 0000000000000000 ]---
which were logged twice (for two arrays).
The line
kernel BUG at block/bio.c:1659!
corresponds to
BUG_ON(sectors <= 0);
in bio_split().
After some investigation, I have determined that the bug is most likely in
choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
before returning early. A test patch (below) seems to fix this issue (Linux
boots and appears to be working correctly with it, but I didn't do any more
advanced experiments yet).
This points to
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
as the most likely culprit. However, I was running into other bugs in mdadm when
trying to test this commit directly.
Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.
Greetings,
Mateusz
---------------------------------------------------
>From e19348bc62eea385459ca1df67bd7c7c2afd7538 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mateusz=20Jo=C5=84czyk?= <mat.jonczyk@o2.pl>
Date: Sat, 6 Jul 2024 11:21:03 +0200
Subject: [RFC PATCH] md/raid1: fill in max_sectors
Not yet fully tested or carefully investigated.
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
---
drivers/md/raid1.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
len = r1_bio->sectors;
read_len = raid1_check_read_range(rdev, this_sector, &len);
if (read_len == r1_bio->sectors) {
+ *max_sectors = read_len;
update_read_sectors(conf, disk, this_sector, read_len);
return disk;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
@ 2024-07-07 19:50 ` Mateusz Jończyk
2024-07-08 1:54 ` Yu Kuai
1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-07 19:50 UTC (permalink / raw)
To: linux-raid, linux-kernel
Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni
W dniu 6.07.2024 o 16:30, Mateusz Jończyk pisze:
> Hello,
>
> Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
> device has the write-mostly flag set. Linux 6.8.0 works fine, as does
> 6.1.96.
[snip]
> After some investigation, I have determined that the bug is most likely in
> choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
> before returning early. A test patch (below) seems to fix this issue (Linux
> boots and appears to be working correctly with it, but I didn't do any more
> advanced experiments yet).
>
> This points to
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> as the most likely culprit. However, I was running into other bugs in mdadm when
> trying to test this commit directly.
>
> Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.
I have been testing this patch carefully:
1. I have been reliably getting deadlocks when adding / removing devices
on an array that contains a component with the write-mostly flag set
- while the array was loaded with fsstress. When the array was idle,
no such deadlocks happened. This occurred also on Linux 6.8.0
though, but not on 6.1.97-rc1, so this is likely an independent regression.
2. When adding a device to the array (/dev/sda1), I once got the following warnings in dmesg on patched 6.10-rc6:
[ 8253.337816] md: could not open device unknown-block(8,1).
[ 8253.337832] md: md_import_device returned -16
[ 8253.338152] md: could not open device unknown-block(8,1).
[ 8253.338169] md: md_import_device returned -16
[ 8253.674751] md: recovery of RAID array md2
(/dev/sda1 has device major/minor numbers = 8,1). This may be caused by some interaction with udev, though.
I have also seen this on Linux 6.8.
Additionally, on an unpatched 6.1.97-rc1 (which was handy for testing), I got a deadlock
when removing a bitmap from such an array while it was loaded with fsstress.
I'll file independent reports, but wanted to give a head's up.
Greetings,
Mateusz
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
2024-07-07 19:50 ` Mateusz Jończyk
@ 2024-07-08 1:54 ` Yu Kuai
2024-07-08 20:09 ` Mateusz Jończyk
1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-08 1:54 UTC (permalink / raw)
To: Mateusz Jończyk, linux-raid, linux-kernel
Cc: regressions, Song Liu, Paul Luse, yukuai (C)
Hi,
在 2024/07/06 22:30, Mateusz Jończyk 写道:
> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>
>
>
> Not yet fully tested or carefully investigated.
>
>
>
> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>
>
>
> ---
>
> drivers/md/raid1.c | 1 +
>
> 1 file changed, 1 insertion(+)
>
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>
> index 7b8a71ca66dd..82f70a4ce6ed 100644
>
> --- a/drivers/md/raid1.c
>
> +++ b/drivers/md/raid1.c
>
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>
> len = r1_bio->sectors;
>
> read_len = raid1_check_read_range(rdev, this_sector, &len);
>
> if (read_len == r1_bio->sectors) {
>
> + *max_sectors = read_len;
>
> update_read_sectors(conf, disk, this_sector, read_len);
>
> return disk;
>
> }
This looks correct, can you give it a test and cook a patch?
Thanks,
Kuai
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
2024-07-08 1:54 ` Yu Kuai
@ 2024-07-08 20:09 ` Mateusz Jończyk
2024-07-09 2:57 ` Yu Kuai
2024-07-09 6:49 ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-08 20:09 UTC (permalink / raw)
To: Yu Kuai, linux-raid, linux-kernel; +Cc: regressions, Song Liu, Paul Luse
W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>
>>
>>
>> Not yet fully tested or carefully investigated.
>>
>>
>>
>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>
>>
>>
>> ---
>>
>> drivers/md/raid1.c | 1 +
>>
>> 1 file changed, 1 insertion(+)
>>
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>
>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>
>> --- a/drivers/md/raid1.c
>>
>> +++ b/drivers/md/raid1.c
>>
>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>> len = r1_bio->sectors;
>>
>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>>
>> if (read_len == r1_bio->sectors) {
>>
>> + *max_sectors = read_len;
>>
>> update_read_sectors(conf, disk, this_sector, read_len);
>>
>> return disk;
>>
>> }
>
> This looks correct, can you give it a test and cook a patch?
>
> Thanks,
> Kuai
Hello,
Yes, I'm working on it. Patch description is nearly done.
Kernel with this patch works well with normal usage and
fsstress, except when modifying the array, as I have written
in my previous email. Will test some more.
I'm feeling nervous working on such sensitive code as md, though.
I'm not an experienced kernel dev.
Greetings,
Mateusz
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
2024-07-08 20:09 ` Mateusz Jończyk
@ 2024-07-09 2:57 ` Yu Kuai
2024-07-11 20:23 ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
2024-07-09 6:49 ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-09 2:57 UTC (permalink / raw)
To: Mateusz Jończyk, linux-raid, linux-kernel
Cc: regressions, Song Liu, Paul Luse, yukuai (C)
Hi,
在 2024/07/09 4:09, Mateusz Jończyk 写道:
> W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
>> Hi,
>>
>> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>>
>>>
>>>
>>> Not yet fully tested or carefully investigated.
>>>
>>>
>>>
>>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>>
>>>
>>>
>>> ---
>>>
>>> drivers/md/raid1.c | 1 +
>>>
>>> 1 file changed, 1 insertion(+)
>>>
>>>
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>
>>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>>
>>> --- a/drivers/md/raid1.c
>>>
>>> +++ b/drivers/md/raid1.c
>>>
>>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>>
>>> len = r1_bio->sectors;
>>>
>>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>>>
>>> if (read_len == r1_bio->sectors) {
>>>
>>> + *max_sectors = read_len;
>>>
>>> update_read_sectors(conf, disk, this_sector, read_len);
>>>
>>> return disk;
>>>
>>> }
>>
>> This looks correct, can you give it a test and cook a patch?
>>
>> Thanks,
>> Kuai
> Hello,
>
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.
Please run mdadm tests at least. And we may need to add a new test.
https://kernel.googlesource.com/pub/scm/utils/mdadm/mdadm.git
./test --dev=loop
Thanks,
Kuai
>
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
>
> Greetings,
>
> Mateusz
>
> .
>
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
2024-07-09 2:57 ` Yu Kuai
@ 2024-07-11 20:23 ` Mateusz Jończyk
2024-07-11 21:14 ` Paul E Luse
2024-07-12 1:16 ` Yu Kuai
0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-11 20:23 UTC (permalink / raw)
To: linux-raid, linux-kernel
Cc: Mateusz Jończyk, stable, Song Liu, Yu Kuai, Paul Luse,
Xiao Ni, Mariusz Tkaczyk
Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:
BUG_ON(sectors <= 0);
Call Trace:
? bio_split+0x96/0xb0
? exc_invalid_op+0x53/0x70
? bio_split+0x96/0xb0
? asm_exc_invalid_op+0x1b/0x20
? bio_split+0x96/0xb0
? raid1_read_request+0x890/0xd20
? __call_rcu_common.constprop.0+0x97/0x260
raid1_make_request+0x81/0xce0
? __get_random_u32_below+0x17/0x70
? new_slab+0x2b3/0x580
md_handle_request+0x77/0x210
md_submit_bio+0x62/0xa0
__submit_bio+0x17b/0x230
submit_bio_noacct_nocheck+0x18e/0x3c0
submit_bio_noacct+0x244/0x670
After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.
Fix it by filling in this variable.
This bug was introduced in
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.
Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
--
Tested on both Linux 6.10 and 6.9.8.
Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).
Notes:
- I was reliably getting deadlocks when adding / removing devices
on such an array - while the array was loaded with fsstress with 20
concurrent processes. When the array was idle or loaded with fsstress
with 8 processes, no such deadlocks happened in my tests.
This occurred also on unpatched Linux 6.8.0 though, but not on
6.1.97-rc1, so this is likely an independent regression (to be
investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
array in similar conditions - this happened on Linux 6.1.97-rc1
also though. fsstress with 8 concurrent processes did cause it only
once during many tests.
- in my testing, there was once a problem with hot adding an
internal bitmap to the array:
mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
mdadm: failed to set internal bitmap.
even though no such reshaping was happening according to /proc/mdstat.
This seems unrelated, though.
---
drivers/md/raid1.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
len = r1_bio->sectors;
read_len = raid1_check_read_range(rdev, this_sector, &len);
if (read_len == r1_bio->sectors) {
+ *max_sectors = read_len;
update_read_sectors(conf, disk, this_sector, read_len);
return disk;
}
base-commit: 256abd8e550ce977b728be79a74e1729438b4948
--
2.25.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
2024-07-11 20:23 ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
@ 2024-07-11 21:14 ` Paul E Luse
2024-07-12 1:16 ` Yu Kuai
1 sibling, 0 replies; 11+ messages in thread
From: Paul E Luse @ 2024-07-11 21:14 UTC (permalink / raw)
To: Mateusz Jończyk
Cc: linux-raid, linux-kernel, stable, Song Liu, Yu Kuai, Xiao Ni,
Mariusz Tkaczyk
On Thu, 11 Jul 2024 22:23:16 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
>
Nice catch and good patch :) Kwai?
-Paul
> BUG_ON(sectors <= 0);
>
> Call Trace:
> ? bio_split+0x96/0xb0
> ? exc_invalid_op+0x53/0x70
> ? bio_split+0x96/0xb0
> ? asm_exc_invalid_op+0x1b/0x20
> ? bio_split+0x96/0xb0
> ? raid1_read_request+0x890/0xd20
> ? __call_rcu_common.constprop.0+0x97/0x260
> raid1_make_request+0x81/0xce0
> ? __get_random_u32_below+0x17/0x70
> ? new_slab+0x2b3/0x580
> md_handle_request+0x77/0x210
> md_submit_bio+0x62/0xa0
> __submit_bio+0x17b/0x230
> submit_bio_noacct_nocheck+0x18e/0x3c0
> submit_bio_noacct+0x244/0x670
>
> After investigation, it turned out that choose_slow_rdev() does not
> set the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
>
> Fix it by filling in this variable.
>
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best
> rdev from read_balance()") shortly thereafter.
>
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link:
> https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
>
> --
>
> Tested on both Linux 6.10 and 6.9.8.
>
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any
> problems: ./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
>
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
> on such an array - while the array was loaded with fsstress with 20
> concurrent processes. When the array was idle or loaded with
> fsstress with 8 processes, no such deadlocks happened in my tests.
> This occurred also on unpatched Linux 6.8.0 though, but not on
> 6.1.97-rc1, so this is likely an independent regression (to be
> investigated).
> - I was also getting deadlocks when adding / removing the bitmap on
> the array in similar conditions - this happened on Linux 6.1.97-rc1
> also though. fsstress with 8 concurrent processes did cause it only
> once during many tests.
> - in my testing, there was once a problem with hot adding an
> internal bitmap to the array:
> mdadm: Cannot add bitmap while array is resyncing or
> reshaping etc. mdadm: failed to set internal bitmap.
> even though no such reshaping was happening according to
> /proc/mdstat. This seems unrelated, though.
> ---
> drivers/md/raid1.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf,
> struct r1bio *r1_bio, len = r1_bio->sectors;
> read_len = raid1_check_read_range(rdev, this_sector,
> &len); if (read_len == r1_bio->sectors) {
> + *max_sectors = read_len;
> update_read_sectors(conf, disk, this_sector,
> read_len); return disk;
> }
>
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
2024-07-11 20:23 ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
2024-07-11 21:14 ` Paul E Luse
@ 2024-07-12 1:16 ` Yu Kuai
2024-07-12 15:11 ` Song Liu
2024-07-13 12:40 ` Mateusz Jończyk
1 sibling, 2 replies; 11+ messages in thread
From: Yu Kuai @ 2024-07-12 1:16 UTC (permalink / raw)
To: Mateusz Jończyk, linux-raid, linux-kernel
Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)
Hi,
在 2024/07/12 4:23, Mateusz Jończyk 写道:
> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
>
> BUG_ON(sectors <= 0);
>
> Call Trace:
> ? bio_split+0x96/0xb0
> ? exc_invalid_op+0x53/0x70
> ? bio_split+0x96/0xb0
> ? asm_exc_invalid_op+0x1b/0x20
> ? bio_split+0x96/0xb0
> ? raid1_read_request+0x890/0xd20
> ? __call_rcu_common.constprop.0+0x97/0x260
> raid1_make_request+0x81/0xce0
> ? __get_random_u32_below+0x17/0x70
> ? new_slab+0x2b3/0x580
> md_handle_request+0x77/0x210
> md_submit_bio+0x62/0xa0
> __submit_bio+0x17b/0x230
> submit_bio_noacct_nocheck+0x18e/0x3c0
> submit_bio_noacct+0x244/0x670
>
> After investigation, it turned out that choose_slow_rdev() does not set
> the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
>
> Fix it by filling in this variable.
>
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> shortly thereafter.
>
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
>
> --
Thanks for the patch!
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
BTW, do you have plans to add a new test to mdadm tests? I'll
pick it up if you don't, just let me know.
Thanks,
Kuai
>
> Tested on both Linux 6.10 and 6.9.8.
>
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
> ./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
>
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
> on such an array - while the array was loaded with fsstress with 20
> concurrent processes. When the array was idle or loaded with fsstress
> with 8 processes, no such deadlocks happened in my tests.
> This occurred also on unpatched Linux 6.8.0 though, but not on
> 6.1.97-rc1, so this is likely an independent regression (to be
> investigated).
> - I was also getting deadlocks when adding / removing the bitmap on the
> array in similar conditions - this happened on Linux 6.1.97-rc1
> also though. fsstress with 8 concurrent processes did cause it only
> once during many tests.
> - in my testing, there was once a problem with hot adding an
> internal bitmap to the array:
> mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
> mdadm: failed to set internal bitmap.
> even though no such reshaping was happening according to /proc/mdstat.
> This seems unrelated, though.
> ---
> drivers/md/raid1.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> len = r1_bio->sectors;
> read_len = raid1_check_read_range(rdev, this_sector, &len);
> if (read_len == r1_bio->sectors) {
> + *max_sectors = read_len;
> update_read_sectors(conf, disk, this_sector, read_len);
> return disk;
> }
>
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948
>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
2024-07-12 1:16 ` Yu Kuai
@ 2024-07-12 15:11 ` Song Liu
2024-07-13 12:40 ` Mateusz Jończyk
1 sibling, 0 replies; 11+ messages in thread
From: Song Liu @ 2024-07-12 15:11 UTC (permalink / raw)
To: Yu Kuai
Cc: Mateusz Jończyk, linux-raid, linux-kernel, stable, Paul Luse,
Xiao Ni, Mariusz Tkaczyk, yukuai (C)
On Fri, Jul 12, 2024 at 9:17 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
[...]
> >
> > After investigation, it turned out that choose_slow_rdev() does not set
> > the value of max_sectors in some cases and because of it,
> > raid1_read_request calls bio_split with sectors == 0.
> >
> > Fix it by filling in this variable.
> >
> > This bug was introduced in
> > commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > but apparently hidden until
> > commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> > shortly thereafter.
> >
> > Cc: stable@vger.kernel.org # 6.9.x+
> > Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> > Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > Cc: Song Liu <song@kernel.org>
> > Cc: Yu Kuai <yukuai3@huawei.com>
> > Cc: Paul Luse <paul.e.luse@linux.intel.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> > Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> >
> > --
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Applied to md-6.11. Thanks!
Song
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
2024-07-12 1:16 ` Yu Kuai
2024-07-12 15:11 ` Song Liu
@ 2024-07-13 12:40 ` Mateusz Jończyk
1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-13 12:40 UTC (permalink / raw)
To: Yu Kuai, linux-raid, linux-kernel
Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)
W dniu 12.07.2024 o 03:16, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/12 4:23, Mateusz Jończyk 写道:
>> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
>> when that drive has a write-mostly flag set. During such an attempt,
>> the following assertion in bio_split() is hit:
>>
[snip]
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
> BTW, do you have plans to add a new test to mdadm tests? I'll
> pick it up if you don't, just let me know.
>
> Thanks,
> Kuai
Yes, I'm working on it.
Greetings,
Mateusz
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
2024-07-08 20:09 ` Mateusz Jończyk
2024-07-09 2:57 ` Yu Kuai
@ 2024-07-09 6:49 ` Mariusz Tkaczyk
1 sibling, 0 replies; 11+ messages in thread
From: Mariusz Tkaczyk @ 2024-07-09 6:49 UTC (permalink / raw)
To: Mateusz Jończyk; +Cc: linux-raid
On Mon, 8 Jul 2024 22:09:51 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> > This looks correct, can you give it a test and cook a patch?
> >
> > Thanks,
> > Kuai
> Hello,
>
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.
>
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
>
> Greetings,
>
> Mateusz
>
>
Hi Mateusz,
If there is something I can help with, fell free to ask (even in Polish).
You can reach me by the mail I sent it or mariusz.tkaczyk@intel.com
I cannot answer you directly (this is the first problem you have to solve):
The following message to <mat.jonczyk@o2.pl> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 554-'sorry, refused mailfrom because return MX
does not exist'
Please consider using different mail provider (so far I know, gmail works well).
Thanks,
Mariusz
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-07-13 12:47 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
2024-07-07 19:50 ` Mateusz Jończyk
2024-07-08 1:54 ` Yu Kuai
2024-07-08 20:09 ` Mateusz Jończyk
2024-07-09 2:57 ` Yu Kuai
2024-07-11 20:23 ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
2024-07-11 21:14 ` Paul E Luse
2024-07-12 1:16 ` Yu Kuai
2024-07-12 15:11 ` Song Liu
2024-07-13 12:40 ` Mateusz Jończyk
2024-07-09 6:49 ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.