[REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
@ 2024-07-06 14:30 Mateusz Jończyk
  2024-07-07 19:50 ` Mateusz Jończyk
  2024-07-08  1:54 ` Yu Kuai
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-06 14:30 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni,
	Mateusz Jończyk

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7765 bytes --]

Hello,

Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
device has the write-mostly flag set. Linux 6.8.0 works fine, as does
6.1.96.

#regzbot introduced: v6.8.0..v6.9.0

In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA
SSD drives: /dev/md0 for /boot, /dev/md1 for remaining data. For
performance, I have marked the RAID component devices on the SATA SSD
drive write-mostly, which "means that the 'md' driver will avoid reading
from these devices if at all possible".

Recently, the NVMe drive started failing, so I removed it from the arrays:

    $ cat /proc/mdstat
    Personalities : [raid1]
    md1 : active raid1 sdb5[1](W)
          471727104 blocks super 1.2 [2/1] [_U]
          bitmap: 4/4 pages [16KB], 65536KB chunk

    md0 : active raid1 sdb4[1](W)
          2094080 blocks super 1.2 [2/1] [_U]
         
    unused devices: <none>

and wiped it. Since then, Linux 6.9+ fails to assemble the arrays on startup
with the following stacktraces in dmesg:

    md/raid1:md0: active with 1 out of 2 mirrors
    md0: detected capacity change from 0 to 4188160
    ------------[ cut here ]------------
    kernel BUG at block/bio.c:1659!
    Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 174 Comm: mdadm Not tainted 6.10.0-rc6unif33 #493
    Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.72 05/31/2024
    RIP: 0010:bio_split+0x96/0xb0
    Code: df ff ff 41 f6 45 14 80 74 08 66 41 81 4c 24 14 80 00 5b 4c 89 e0 41 5c 41 5d 5d c3 cc cc cc cc 41 c7 45 28 00 00 00 00 eb d9 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb dd 66 66 2e 0f 1f 84 00 00 00 00 00
    RSP: 0018:ffffa7588041b330 EFLAGS: 00010246
    RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffff9f22cb08f938
    RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9f22c1199400
    RBP: ffffa7588041b420 R08: ffff9f22c3587b30 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000008 R12: ffff9f22cc9da700
    R13: ffff9f22cb08f800 R14: ffff9f22c6a35fa0 R15: ffff9f22c1846800
    FS:  00007f5f88404740(0000) GS:ffff9f2621e00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000056299cb95000 CR3: 000000010c82a002 CR4: 00000000003706f0
    Call Trace:
     <TASK>
     ? show_regs+0x67/0x70
     ? __die_body+0x20/0x70
     ? die+0x3e/0x60
     ? do_trap+0xd6/0xf0
     ? do_error_trap+0x71/0x90
     ? bio_split+0x96/0xb0
     ? exc_invalid_op+0x53/0x70
     ? bio_split+0x96/0xb0
     ? asm_exc_invalid_op+0x1b/0x20
     ? bio_split+0x96/0xb0
     ? raid1_read_request+0x890/0xd20
     ? __call_rcu_common.constprop.0+0x97/0x260
     raid1_make_request+0x81/0xce0
     ? __get_random_u32_below+0x17/0x70    // is not present in other stacktraces
     ? new_slab+0x2b3/0x580            // is not present in other stacktraces
     md_handle_request+0x77/0x210
     md_submit_bio+0x62/0xa0
     __submit_bio+0x17b/0x230
     submit_bio_noacct_nocheck+0x18e/0x3c0
     submit_bio_noacct+0x244/0x670
     submit_bio+0xac/0xe0
     submit_bh_wbc+0x168/0x190
     block_read_full_folio+0x203/0x420
     ? __mod_memcg_lruvec_state+0xcd/0x210
     ? __pfx_blkdev_get_block+0x10/0x10
     ? __lruvec_stat_mod_folio+0x63/0xb0
     ? __filemap_add_folio+0x24d/0x450
     ? __pfx_blkdev_read_folio+0x10/0x10
     blkdev_read_folio+0x18/0x20
     filemap_read_folio+0x45/0x290
     ? __pfx_workingset_update_node+0x10/0x10
     ? folio_add_lru+0x5a/0x80
     ? filemap_add_folio+0xba/0xe0
     ? __pfx_blkdev_read_folio+0x10/0x10
     do_read_cache_folio+0x10a/0x3c0
     read_cache_folio+0x12/0x20
     read_part_sector+0x36/0xc0
     read_lba+0x96/0x1b0
     find_valid_gpt+0xe8/0x770
     ? get_page_from_freelist+0x615/0x12e0
     ? __pfx_efi_partition+0x10/0x10
     efi_partition+0x80/0x4e0
     ? vsnprintf+0x297/0x4f0
     ? snprintf+0x49/0x70
     ? __pfx_efi_partition+0x10/0x10
     bdev_disk_changed+0x270/0x760
     blkdev_get_whole+0x8b/0xb0
     bdev_open+0x2bd/0x390
     ? __pfx_blkdev_open+0x10/0x10
     blkdev_open+0x8f/0xc0
     do_dentry_open+0x174/0x570
     vfs_open+0x2b/0x40
     path_openat+0xb20/0x1150
     do_filp_open+0xa8/0x120
     ? alloc_fd+0xc2/0x180
     do_sys_openat2+0x250/0x2a0
     do_sys_open+0x46/0x80
     __x64_sys_openat+0x20/0x30
     x64_sys_call+0xe55/0x20d0
     do_syscall_64+0x47/0x110
     entry_SYSCALL_64_after_hwframe+0x76/0x7e
    RIP: 0033:0x7f5f88514f5b
    Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
    RSP: 002b:00007ffd8839cbe0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
    RAX: ffffffffffffffda RBX: 00007ffd8839dbe0 RCX: 00007f5f88514f5b
    RDX: 0000000000004000 RSI: 00007ffd8839cc70 RDI: 00000000ffffff9c
    RBP: 00007ffd8839cc70 R08: 0000000000000000 R09: 00007ffd8839cae0
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000004000 R14: 00007ffd8839cc68 R15: 000055942d9dabe0
     </TASK>
    Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 drm_buddy r8169 i2c_algo_bit psmouse i2c_i801 drm_display_helper i2c_mux video i2c_smbus
xhci_pci realtek cec xhci_pci_renesas i2c_hid_acpi i2c_hid hid wmi aesni_intel crypto_simd cryptd
    ---[ end trace 0000000000000000 ]---

which were logged twice (for two arrays).

The line
    kernel BUG at block/bio.c:1659!
corresponds to
    BUG_ON(sectors <= 0);
in bio_split().

After some investigation, I have determined that the bug is most likely in
choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
before returning early. A test patch (below) seems to fix this issue (Linux
boots and appears to be working correctly with it, but I didn't do any more
advanced experiments yet).

This points to
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
as the most likely culprit. However, I was running into other bugs in mdadm when
trying to test this commit directly.

Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

Greetings,

Mateusz

---------------------------------------------------

>From e19348bc62eea385459ca1df67bd7c7c2afd7538 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mateusz=20Jo=C5=84czyk?= <mat.jonczyk@o2.pl>
Date: Sat, 6 Jul 2024 11:21:03 +0200
Subject: [RFC PATCH] md/raid1: fill in max_sectors

Not yet fully tested or carefully investigated.

Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>

---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
@ 2024-07-07 19:50 ` Mateusz Jończyk
  2024-07-08  1:54 ` Yu Kuai
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-07 19:50 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni

W dniu 6.07.2024 o 16:30, Mateusz Jończyk pisze:
> Hello,
>
> Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
> device has the write-mostly flag set. Linux 6.8.0 works fine, as does
> 6.1.96.
[snip]
> After some investigation, I have determined that the bug is most likely in
> choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
> before returning early. A test patch (below) seems to fix this issue (Linux
> boots and appears to be working correctly with it, but I didn't do any more
> advanced experiments yet).
>
> This points to
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> as the most likely culprit. However, I was running into other bugs in mdadm when
> trying to test this commit directly.
>
> Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

I have been testing this patch carefully:

1. I have been reliably getting deadlocks when adding / removing devices
on an array that contains a component with the write-mostly flag set
- while the array was loaded with fsstress. When the array was idle,
no such deadlocks happened. This occurred also on Linux 6.8.0
though, but not on 6.1.97-rc1, so this is likely an independent regression.

2. When adding a device to the array (/dev/sda1), I once got the following warnings in dmesg on patched 6.10-rc6:

        [ 8253.337816] md: could not open device unknown-block(8,1).
        [ 8253.337832] md: md_import_device returned -16
        [ 8253.338152] md: could not open device unknown-block(8,1).
        [ 8253.338169] md: md_import_device returned -16
        [ 8253.674751] md: recovery of RAID array md2

(/dev/sda1 has device major/minor numbers = 8,1). This may be caused by some interaction with udev, though.
I have also seen this on Linux 6.8.

Additionally, on an unpatched 6.1.97-rc1 (which was handy for testing), I got a deadlock
when removing a bitmap from such an array while it was loaded with fsstress.

I'll file independent reports, but wanted to give a head's up.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
  2024-07-07 19:50 ` Mateusz Jończyk
@ 2024-07-08  1:54 ` Yu Kuai
  2024-07-08 20:09   ` Mateusz Jończyk
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-08  1:54 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: regressions, Song Liu, Paul Luse, yukuai (C)

Hi,

在 2024/07/06 22:30, Mateusz Jończyk 写道:
> Subject: [RFC PATCH] md/raid1: fill in max_sectors
> 
> 
> 
> Not yet fully tested or carefully investigated.
> 
> 
> 
> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
> 
> 
> 
> ---
> 
>   drivers/md/raid1.c | 1 +
> 
>   1 file changed, 1 insertion(+)
> 
> 
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> 
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> 
> --- a/drivers/md/raid1.c
> 
> +++ b/drivers/md/raid1.c
> 
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> 
>   		len = r1_bio->sectors;
> 
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
> 
>   		if (read_len == r1_bio->sectors) {
> 
> +			*max_sectors = read_len;
> 
>   			update_read_sectors(conf, disk, this_sector, read_len);
> 
>   			return disk;
> 
>   		}

This looks correct, can you give it a test and cook a patch?

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08  1:54 ` Yu Kuai
@ 2024-07-08 20:09   ` Mateusz Jończyk
  2024-07-09  2:57     ` Yu Kuai
  2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-08 20:09 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel; +Cc: regressions, Song Liu, Paul Luse

W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>
>>
>>
>> Not yet fully tested or carefully investigated.
>>
>>
>>
>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>
>>
>>
>> ---
>>
>>   drivers/md/raid1.c | 1 +
>>
>>   1 file changed, 1 insertion(+)
>>
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>
>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>
>> --- a/drivers/md/raid1.c
>>
>> +++ b/drivers/md/raid1.c
>>
>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>>           len = r1_bio->sectors;
>>
>>           read_len = raid1_check_read_range(rdev, this_sector, &len);
>>
>>           if (read_len == r1_bio->sectors) {
>>
>> +            *max_sectors = read_len;
>>
>>               update_read_sectors(conf, disk, this_sector, read_len);
>>
>>               return disk;
>>
>>           }
>
> This looks correct, can you give it a test and cook a patch?
>
> Thanks,
> Kuai
Hello,

Yes, I'm working on it. Patch description is nearly done.
Kernel with this patch works well with normal usage and
fsstress, except when modifying the array, as I have written
in my previous email. Will test some more.

I'm feeling nervous working on such sensitive code as md, though.
I'm not an experienced kernel dev.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08 20:09   ` Mateusz Jończyk
@ 2024-07-09  2:57     ` Yu Kuai
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
  2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-09  2:57 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: regressions, Song Liu, Paul Luse, yukuai (C)

Hi,

在 2024/07/09 4:09, Mateusz Jończyk 写道:
> W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
>> Hi,
>>
>> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>>
>>>
>>>
>>> Not yet fully tested or carefully investigated.
>>>
>>>
>>>
>>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>>
>>>
>>>
>>> ---
>>>
>>>    drivers/md/raid1.c | 1 +
>>>
>>>    1 file changed, 1 insertion(+)
>>>
>>>
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>
>>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>>
>>> --- a/drivers/md/raid1.c
>>>
>>> +++ b/drivers/md/raid1.c
>>>
>>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>>
>>>            len = r1_bio->sectors;
>>>
>>>            read_len = raid1_check_read_range(rdev, this_sector, &len);
>>>
>>>            if (read_len == r1_bio->sectors) {
>>>
>>> +            *max_sectors = read_len;
>>>
>>>                update_read_sectors(conf, disk, this_sector, read_len);
>>>
>>>                return disk;
>>>
>>>            }
>>
>> This looks correct, can you give it a test and cook a patch?
>>
>> Thanks,
>> Kuai
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.

Please run mdadm tests at least. And we may need to add a new test.

https://kernel.googlesource.com/pub/scm/utils/mdadm/mdadm.git

./test --dev=loop

Thanks,
Kuai

> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-09  2:57     ` Yu Kuai
@ 2024-07-11 20:23       ` Mateusz Jończyk
  2024-07-11 21:14         ` Paul E Luse
  2024-07-12  1:16         ` Yu Kuai
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-11 20:23 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: Mateusz Jończyk, stable, Song Liu, Yu Kuai, Paul Luse,
	Xiao Ni, Mariusz Tkaczyk

Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:

	BUG_ON(sectors <= 0);

Call Trace:
	? bio_split+0x96/0xb0
	? exc_invalid_op+0x53/0x70
	? bio_split+0x96/0xb0
	? asm_exc_invalid_op+0x1b/0x20
	? bio_split+0x96/0xb0
	? raid1_read_request+0x890/0xd20
	? __call_rcu_common.constprop.0+0x97/0x260
	raid1_make_request+0x81/0xce0
	? __get_random_u32_below+0x17/0x70
	? new_slab+0x2b3/0x580
	md_handle_request+0x77/0x210
	md_submit_bio+0x62/0xa0
	__submit_bio+0x17b/0x230
	submit_bio_noacct_nocheck+0x18e/0x3c0
	submit_bio_noacct+0x244/0x670

After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.

Fix it by filling in this variable.

This bug was introduced in
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.

Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/

--

Tested on both Linux 6.10 and 6.9.8.

Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
	./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).

Notes:
- I was reliably getting deadlocks when adding / removing devices
  on such an array - while the array was loaded with fsstress with 20
  concurrent processes. When the array was idle or loaded with fsstress
  with 8 processes, no such deadlocks happened in my tests.
  This occurred also on unpatched Linux 6.8.0 though, but not on
  6.1.97-rc1, so this is likely an independent regression (to be
  investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
  array in similar conditions - this happened on Linux 6.1.97-rc1
  also though. fsstress with 8 concurrent processes did cause it only
  once during many tests.
- in my testing, there was once a problem with hot adding an
  internal bitmap to the array:
	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
	mdadm: failed to set internal bitmap.
  even though no such reshaping was happening according to /proc/mdstat.
  This seems unrelated, though.
---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}

base-commit: 256abd8e550ce977b728be79a74e1729438b4948
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
@ 2024-07-11 21:14         ` Paul E Luse
  2024-07-12  1:16         ` Yu Kuai
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E Luse @ 2024-07-11 21:14 UTC (permalink / raw)
  To: Mateusz Jończyk
  Cc: linux-raid, linux-kernel, stable, Song Liu, Yu Kuai, Xiao Ni,
	Mariusz Tkaczyk

On Thu, 11 Jul 2024 22:23:16 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 

Nice catch and good patch :)  Kwai?

-Paul

> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not
> set the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best
> rdev from read_balance()") shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link:
> https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --
> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any
> problems: ./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>   on such an array - while the array was loaded with fsstress with 20
>   concurrent processes. When the array was idle or loaded with
> fsstress with 8 processes, no such deadlocks happened in my tests.
>   This occurred also on unpatched Linux 6.8.0 though, but not on
>   6.1.97-rc1, so this is likely an independent regression (to be
>   investigated).
> - I was also getting deadlocks when adding / removing the bitmap on
> the array in similar conditions - this happened on Linux 6.1.97-rc1
>   also though. fsstress with 8 concurrent processes did cause it only
>   once during many tests.
> - in my testing, there was once a problem with hot adding an
>   internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or
> reshaping etc. mdadm: failed to set internal bitmap.
>   even though no such reshaping was happening according to
> /proc/mdstat. This seems unrelated, though.
> ---
>  drivers/md/raid1.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf,
> struct r1bio *r1_bio, len = r1_bio->sectors;
>  		read_len = raid1_check_read_range(rdev, this_sector,
> &len); if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>  			update_read_sectors(conf, disk, this_sector,
> read_len); return disk;
>  		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
  2024-07-11 21:14         ` Paul E Luse
@ 2024-07-12  1:16         ` Yu Kuai
  2024-07-12 15:11           ` Song Liu
  2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 2 replies; 11+ messages in thread
From: Yu Kuai @ 2024-07-12  1:16 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)

Hi,

在 2024/07/12 4:23, Mateusz Jończyk 写道:
> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 
> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not set
> the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --

Thanks for the patch!

Reviewed-by: Yu Kuai <yukuai3@huawei.com>

BTW, do you have plans to add a new test to mdadm tests? I'll
pick it up if you don't, just let me know.

Thanks,
Kuai

> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
> 	./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>    on such an array - while the array was loaded with fsstress with 20
>    concurrent processes. When the array was idle or loaded with fsstress
>    with 8 processes, no such deadlocks happened in my tests.
>    This occurred also on unpatched Linux 6.8.0 though, but not on
>    6.1.97-rc1, so this is likely an independent regression (to be
>    investigated).
> - I was also getting deadlocks when adding / removing the bitmap on the
>    array in similar conditions - this happened on Linux 6.1.97-rc1
>    also though. fsstress with 8 concurrent processes did cause it only
>    once during many tests.
> - in my testing, there was once a problem with hot adding an
>    internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
> 	mdadm: failed to set internal bitmap.
>    even though no such reshaping was happening according to /proc/mdstat.
>    This seems unrelated, though.
> ---
>   drivers/md/raid1.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>   		len = r1_bio->sectors;
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
>   		if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>   			update_read_sectors(conf, disk, this_sector, read_len);
>   			return disk;
>   		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-12  1:16         ` Yu Kuai
@ 2024-07-12 15:11           ` Song Liu
  2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 0 replies; 11+ messages in thread
From: Song Liu @ 2024-07-12 15:11 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Mateusz Jończyk, linux-raid, linux-kernel, stable, Paul Luse,
	Xiao Ni, Mariusz Tkaczyk, yukuai (C)

On Fri, Jul 12, 2024 at 9:17 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
[...]
> >
> > After investigation, it turned out that choose_slow_rdev() does not set
> > the value of max_sectors in some cases and because of it,
> > raid1_read_request calls bio_split with sectors == 0.
> >
> > Fix it by filling in this variable.
> >
> > This bug was introduced in
> > commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > but apparently hidden until
> > commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> > shortly thereafter.
> >
> > Cc: stable@vger.kernel.org # 6.9.x+
> > Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> > Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > Cc: Song Liu <song@kernel.org>
> > Cc: Yu Kuai <yukuai3@huawei.com>
> > Cc: Paul Luse <paul.e.luse@linux.intel.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> > Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> >
> > --
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>

Applied to md-6.11. Thanks!

Song

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-12  1:16         ` Yu Kuai
  2024-07-12 15:11           ` Song Liu
@ 2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-13 12:40 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel
  Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)

W dniu 12.07.2024 o 03:16, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/12 4:23, Mateusz Jończyk 写道:
>> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
>> when that drive has a write-mostly flag set. During such an attempt,
>> the following assertion in bio_split() is hit:
>>
[snip]
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
> BTW, do you have plans to add a new test to mdadm tests? I'll
> pick it up if you don't, just let me know.
>
> Thanks,
> Kuai

Yes, I'm working on it.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08 20:09   ` Mateusz Jończyk
  2024-07-09  2:57     ` Yu Kuai
@ 2024-07-09  6:49     ` Mariusz Tkaczyk
  1 sibling, 0 replies; 11+ messages in thread
From: Mariusz Tkaczyk @ 2024-07-09  6:49 UTC (permalink / raw)
  To: Mateusz Jończyk; +Cc: linux-raid

On Mon, 8 Jul 2024 22:09:51 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> > This looks correct, can you give it a test and cook a patch?
> >
> > Thanks,
> > Kuai  
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.
> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> 

Hi Mateusz,
If there is something I can help with, fell free to ask (even in Polish).
You can reach me by the mail I sent it or mariusz.tkaczyk@intel.com

I cannot answer you directly (this is the first problem you have to solve):
The following message to <mat.jonczyk@o2.pl> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 554-'sorry, refused mailfrom because return MX
does not exist'

Please consider using different mail provider (so far I know, gmail works well).

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-13 12:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
2024-07-07 19:50 ` Mateusz Jończyk
2024-07-08  1:54 ` Yu Kuai
2024-07-08 20:09   ` Mateusz Jończyk
2024-07-09  2:57     ` Yu Kuai
2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
2024-07-11 21:14         ` Paul E Luse
2024-07-12  1:16         ` Yu Kuai
2024-07-12 15:11           ` Song Liu
2024-07-13 12:40           ` Mateusz Jończyk
2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).