[REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag

All of lore.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
@ 2024-07-06 14:30 Mateusz Jończyk
  2024-07-07 19:50 ` Mateusz Jończyk
  2024-07-08  1:54 ` Yu Kuai
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-06 14:30 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni,
	Mateusz Jończyk

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7765 bytes --]

Hello,

Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
device has the write-mostly flag set. Linux 6.8.0 works fine, as does
6.1.96.

#regzbot introduced: v6.8.0..v6.9.0

In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA
SSD drives: /dev/md0 for /boot, /dev/md1 for remaining data. For
performance, I have marked the RAID component devices on the SATA SSD
drive write-mostly, which "means that the 'md' driver will avoid reading
from these devices if at all possible".

Recently, the NVMe drive started failing, so I removed it from the arrays:

    $ cat /proc/mdstat
    Personalities : [raid1]
    md1 : active raid1 sdb5[1](W)
          471727104 blocks super 1.2 [2/1] [_U]
          bitmap: 4/4 pages [16KB], 65536KB chunk

    md0 : active raid1 sdb4[1](W)
          2094080 blocks super 1.2 [2/1] [_U]
         
    unused devices: <none>

and wiped it. Since then, Linux 6.9+ fails to assemble the arrays on startup
with the following stacktraces in dmesg:

    md/raid1:md0: active with 1 out of 2 mirrors
    md0: detected capacity change from 0 to 4188160
    ------------[ cut here ]------------
    kernel BUG at block/bio.c:1659!
    Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 174 Comm: mdadm Not tainted 6.10.0-rc6unif33 #493
    Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.72 05/31/2024
    RIP: 0010:bio_split+0x96/0xb0
    Code: df ff ff 41 f6 45 14 80 74 08 66 41 81 4c 24 14 80 00 5b 4c 89 e0 41 5c 41 5d 5d c3 cc cc cc cc 41 c7 45 28 00 00 00 00 eb d9 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb dd 66 66 2e 0f 1f 84 00 00 00 00 00
    RSP: 0018:ffffa7588041b330 EFLAGS: 00010246
    RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffff9f22cb08f938
    RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9f22c1199400
    RBP: ffffa7588041b420 R08: ffff9f22c3587b30 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000008 R12: ffff9f22cc9da700
    R13: ffff9f22cb08f800 R14: ffff9f22c6a35fa0 R15: ffff9f22c1846800
    FS:  00007f5f88404740(0000) GS:ffff9f2621e00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000056299cb95000 CR3: 000000010c82a002 CR4: 00000000003706f0
    Call Trace:
     <TASK>
     ? show_regs+0x67/0x70
     ? __die_body+0x20/0x70
     ? die+0x3e/0x60
     ? do_trap+0xd6/0xf0
     ? do_error_trap+0x71/0x90
     ? bio_split+0x96/0xb0
     ? exc_invalid_op+0x53/0x70
     ? bio_split+0x96/0xb0
     ? asm_exc_invalid_op+0x1b/0x20
     ? bio_split+0x96/0xb0
     ? raid1_read_request+0x890/0xd20
     ? __call_rcu_common.constprop.0+0x97/0x260
     raid1_make_request+0x81/0xce0
     ? __get_random_u32_below+0x17/0x70    // is not present in other stacktraces
     ? new_slab+0x2b3/0x580            // is not present in other stacktraces
     md_handle_request+0x77/0x210
     md_submit_bio+0x62/0xa0
     __submit_bio+0x17b/0x230
     submit_bio_noacct_nocheck+0x18e/0x3c0
     submit_bio_noacct+0x244/0x670
     submit_bio+0xac/0xe0
     submit_bh_wbc+0x168/0x190
     block_read_full_folio+0x203/0x420
     ? __mod_memcg_lruvec_state+0xcd/0x210
     ? __pfx_blkdev_get_block+0x10/0x10
     ? __lruvec_stat_mod_folio+0x63/0xb0
     ? __filemap_add_folio+0x24d/0x450
     ? __pfx_blkdev_read_folio+0x10/0x10
     blkdev_read_folio+0x18/0x20
     filemap_read_folio+0x45/0x290
     ? __pfx_workingset_update_node+0x10/0x10
     ? folio_add_lru+0x5a/0x80
     ? filemap_add_folio+0xba/0xe0
     ? __pfx_blkdev_read_folio+0x10/0x10
     do_read_cache_folio+0x10a/0x3c0
     read_cache_folio+0x12/0x20
     read_part_sector+0x36/0xc0
     read_lba+0x96/0x1b0
     find_valid_gpt+0xe8/0x770
     ? get_page_from_freelist+0x615/0x12e0
     ? __pfx_efi_partition+0x10/0x10
     efi_partition+0x80/0x4e0
     ? vsnprintf+0x297/0x4f0
     ? snprintf+0x49/0x70
     ? __pfx_efi_partition+0x10/0x10
     bdev_disk_changed+0x270/0x760
     blkdev_get_whole+0x8b/0xb0
     bdev_open+0x2bd/0x390
     ? __pfx_blkdev_open+0x10/0x10
     blkdev_open+0x8f/0xc0
     do_dentry_open+0x174/0x570
     vfs_open+0x2b/0x40
     path_openat+0xb20/0x1150
     do_filp_open+0xa8/0x120
     ? alloc_fd+0xc2/0x180
     do_sys_openat2+0x250/0x2a0
     do_sys_open+0x46/0x80
     __x64_sys_openat+0x20/0x30
     x64_sys_call+0xe55/0x20d0
     do_syscall_64+0x47/0x110
     entry_SYSCALL_64_after_hwframe+0x76/0x7e
    RIP: 0033:0x7f5f88514f5b
    Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
    RSP: 002b:00007ffd8839cbe0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
    RAX: ffffffffffffffda RBX: 00007ffd8839dbe0 RCX: 00007f5f88514f5b
    RDX: 0000000000004000 RSI: 00007ffd8839cc70 RDI: 00000000ffffff9c
    RBP: 00007ffd8839cc70 R08: 0000000000000000 R09: 00007ffd8839cae0
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000004000 R14: 00007ffd8839cc68 R15: 000055942d9dabe0
     </TASK>
    Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 drm_buddy r8169 i2c_algo_bit psmouse i2c_i801 drm_display_helper i2c_mux video i2c_smbus
xhci_pci realtek cec xhci_pci_renesas i2c_hid_acpi i2c_hid hid wmi aesni_intel crypto_simd cryptd
    ---[ end trace 0000000000000000 ]---

which were logged twice (for two arrays).

The line
    kernel BUG at block/bio.c:1659!
corresponds to
    BUG_ON(sectors <= 0);
in bio_split().

After some investigation, I have determined that the bug is most likely in
choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
before returning early. A test patch (below) seems to fix this issue (Linux
boots and appears to be working correctly with it, but I didn't do any more
advanced experiments yet).

This points to
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
as the most likely culprit. However, I was running into other bugs in mdadm when
trying to test this commit directly.

Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

Greetings,

Mateusz

---------------------------------------------------

>From e19348bc62eea385459ca1df67bd7c7c2afd7538 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mateusz=20Jo=C5=84czyk?= <mat.jonczyk@o2.pl>
Date: Sat, 6 Jul 2024 11:21:03 +0200
Subject: [RFC PATCH] md/raid1: fill in max_sectors

Not yet fully tested or carefully investigated.

Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>

---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
@ 2024-07-07 19:50 ` Mateusz Jończyk
  2024-07-08  1:54 ` Yu Kuai
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-07 19:50 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: regressions, Song Liu, Yu Kuai, Paul Luse, Xiao Ni

W dniu 6.07.2024 o 16:30, Mateusz Jończyk pisze:
> Hello,
>
> Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
> device has the write-mostly flag set. Linux 6.8.0 works fine, as does
> 6.1.96.
[snip]
> After some investigation, I have determined that the bug is most likely in
> choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
> before returning early. A test patch (below) seems to fix this issue (Linux
> boots and appears to be working correctly with it, but I didn't do any more
> advanced experiments yet).
>
> This points to
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> as the most likely culprit. However, I was running into other bugs in mdadm when
> trying to test this commit directly.
>
> Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

I have been testing this patch carefully:

1. I have been reliably getting deadlocks when adding / removing devices
on an array that contains a component with the write-mostly flag set
- while the array was loaded with fsstress. When the array was idle,
no such deadlocks happened. This occurred also on Linux 6.8.0
though, but not on 6.1.97-rc1, so this is likely an independent regression.

2. When adding a device to the array (/dev/sda1), I once got the following warnings in dmesg on patched 6.10-rc6:

        [ 8253.337816] md: could not open device unknown-block(8,1).
        [ 8253.337832] md: md_import_device returned -16
        [ 8253.338152] md: could not open device unknown-block(8,1).
        [ 8253.338169] md: md_import_device returned -16
        [ 8253.674751] md: recovery of RAID array md2

(/dev/sda1 has device major/minor numbers = 8,1). This may be caused by some interaction with udev, though.
I have also seen this on Linux 6.8.

Additionally, on an unpatched 6.1.97-rc1 (which was handy for testing), I got a deadlock
when removing a bitmap from such an array while it was loaded with fsstress.

I'll file independent reports, but wanted to give a head's up.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
  2024-07-07 19:50 ` Mateusz Jończyk
@ 2024-07-08  1:54 ` Yu Kuai
  2024-07-08 20:09   ` Mateusz Jończyk
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-08  1:54 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: regressions, Song Liu, Paul Luse, yukuai (C)

Hi,

在 2024/07/06 22:30, Mateusz Jończyk 写道:
> Subject: [RFC PATCH] md/raid1: fill in max_sectors
> 
> 
> 
> Not yet fully tested or carefully investigated.
> 
> 
> 
> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
> 
> 
> 
> ---
> 
>   drivers/md/raid1.c | 1 +
> 
>   1 file changed, 1 insertion(+)
> 
> 
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> 
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> 
> --- a/drivers/md/raid1.c
> 
> +++ b/drivers/md/raid1.c
> 
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> 
>   		len = r1_bio->sectors;
> 
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
> 
>   		if (read_len == r1_bio->sectors) {
> 
> +			*max_sectors = read_len;
> 
>   			update_read_sectors(conf, disk, this_sector, read_len);
> 
>   			return disk;
> 
>   		}

This looks correct, can you give it a test and cook a patch?

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08  1:54 ` Yu Kuai
@ 2024-07-08 20:09   ` Mateusz Jończyk
  2024-07-09  2:57     ` Yu Kuai
  2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-08 20:09 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel; +Cc: regressions, Song Liu, Paul Luse

W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>
>>
>>
>> Not yet fully tested or carefully investigated.
>>
>>
>>
>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>
>>
>>
>> ---
>>
>>   drivers/md/raid1.c | 1 +
>>
>>   1 file changed, 1 insertion(+)
>>
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>
>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>
>> --- a/drivers/md/raid1.c
>>
>> +++ b/drivers/md/raid1.c
>>
>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>>           len = r1_bio->sectors;
>>
>>           read_len = raid1_check_read_range(rdev, this_sector, &len);
>>
>>           if (read_len == r1_bio->sectors) {
>>
>> +            *max_sectors = read_len;
>>
>>               update_read_sectors(conf, disk, this_sector, read_len);
>>
>>               return disk;
>>
>>           }
>
> This looks correct, can you give it a test and cook a patch?
>
> Thanks,
> Kuai
Hello,

Yes, I'm working on it. Patch description is nearly done.
Kernel with this patch works well with normal usage and
fsstress, except when modifying the array, as I have written
in my previous email. Will test some more.

I'm feeling nervous working on such sensitive code as md, though.
I'm not an experienced kernel dev.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08 20:09   ` Mateusz Jończyk
@ 2024-07-09  2:57     ` Yu Kuai
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
  2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2024-07-09  2:57 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: regressions, Song Liu, Paul Luse, yukuai (C)

Hi,

在 2024/07/09 4:09, Mateusz Jończyk 写道:
> W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
>> Hi,
>>
>> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>>
>>>
>>>
>>> Not yet fully tested or carefully investigated.
>>>
>>>
>>>
>>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>>
>>>
>>>
>>> ---
>>>
>>>    drivers/md/raid1.c | 1 +
>>>
>>>    1 file changed, 1 insertion(+)
>>>
>>>
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>
>>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>>
>>> --- a/drivers/md/raid1.c
>>>
>>> +++ b/drivers/md/raid1.c
>>>
>>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>>
>>>            len = r1_bio->sectors;
>>>
>>>            read_len = raid1_check_read_range(rdev, this_sector, &len);
>>>
>>>            if (read_len == r1_bio->sectors) {
>>>
>>> +            *max_sectors = read_len;
>>>
>>>                update_read_sectors(conf, disk, this_sector, read_len);
>>>
>>>                return disk;
>>>
>>>            }
>>
>> This looks correct, can you give it a test and cook a patch?
>>
>> Thanks,
>> Kuai
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.

Please run mdadm tests at least. And we may need to add a new test.

https://kernel.googlesource.com/pub/scm/utils/mdadm/mdadm.git

./test --dev=loop

Thanks,
Kuai

> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-09  2:57     ` Yu Kuai
@ 2024-07-11 20:23       ` Mateusz Jończyk
  2024-07-11 21:14         ` Paul E Luse
  2024-07-12  1:16         ` Yu Kuai
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-11 20:23 UTC (permalink / raw)
  To: linux-raid, linux-kernel
  Cc: Mateusz Jończyk, stable, Song Liu, Yu Kuai, Paul Luse,
	Xiao Ni, Mariusz Tkaczyk

Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:

	BUG_ON(sectors <= 0);

Call Trace:
	? bio_split+0x96/0xb0
	? exc_invalid_op+0x53/0x70
	? bio_split+0x96/0xb0
	? asm_exc_invalid_op+0x1b/0x20
	? bio_split+0x96/0xb0
	? raid1_read_request+0x890/0xd20
	? __call_rcu_common.constprop.0+0x97/0x260
	raid1_make_request+0x81/0xce0
	? __get_random_u32_below+0x17/0x70
	? new_slab+0x2b3/0x580
	md_handle_request+0x77/0x210
	md_submit_bio+0x62/0xa0
	__submit_bio+0x17b/0x230
	submit_bio_noacct_nocheck+0x18e/0x3c0
	submit_bio_noacct+0x244/0x670

After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.

Fix it by filling in this variable.

This bug was introduced in
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.

Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/

--

Tested on both Linux 6.10 and 6.9.8.

Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
	./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).

Notes:
- I was reliably getting deadlocks when adding / removing devices
  on such an array - while the array was loaded with fsstress with 20
  concurrent processes. When the array was idle or loaded with fsstress
  with 8 processes, no such deadlocks happened in my tests.
  This occurred also on unpatched Linux 6.8.0 though, but not on
  6.1.97-rc1, so this is likely an independent regression (to be
  investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
  array in similar conditions - this happened on Linux 6.1.97-rc1
  also though. fsstress with 8 concurrent processes did cause it only
  once during many tests.
- in my testing, there was once a problem with hot adding an
  internal bitmap to the array:
	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
	mdadm: failed to set internal bitmap.
  even though no such reshaping was happening according to /proc/mdstat.
  This seems unrelated, though.
---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}

base-commit: 256abd8e550ce977b728be79a74e1729438b4948
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
@ 2024-07-11 21:14         ` Paul E Luse
  2024-07-12  1:16         ` Yu Kuai
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E Luse @ 2024-07-11 21:14 UTC (permalink / raw)
  To: Mateusz Jończyk
  Cc: linux-raid, linux-kernel, stable, Song Liu, Yu Kuai, Xiao Ni,
	Mariusz Tkaczyk

On Thu, 11 Jul 2024 22:23:16 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 

Nice catch and good patch :)  Kwai?

-Paul

> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not
> set the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best
> rdev from read_balance()") shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link:
> https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --
> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any
> problems: ./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>   on such an array - while the array was loaded with fsstress with 20
>   concurrent processes. When the array was idle or loaded with
> fsstress with 8 processes, no such deadlocks happened in my tests.
>   This occurred also on unpatched Linux 6.8.0 though, but not on
>   6.1.97-rc1, so this is likely an independent regression (to be
>   investigated).
> - I was also getting deadlocks when adding / removing the bitmap on
> the array in similar conditions - this happened on Linux 6.1.97-rc1
>   also though. fsstress with 8 concurrent processes did cause it only
>   once during many tests.
> - in my testing, there was once a problem with hot adding an
>   internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or
> reshaping etc. mdadm: failed to set internal bitmap.
>   even though no such reshaping was happening according to
> /proc/mdstat. This seems unrelated, though.
> ---
>  drivers/md/raid1.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf,
> struct r1bio *r1_bio, len = r1_bio->sectors;
>  		read_len = raid1_check_read_range(rdev, this_sector,
> &len); if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>  			update_read_sectors(conf, disk, this_sector,
> read_len); return disk;
>  		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
  2024-07-11 21:14         ` Paul E Luse
@ 2024-07-12  1:16         ` Yu Kuai
  2024-07-12 15:11           ` Song Liu
  2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 2 replies; 11+ messages in thread
From: Yu Kuai @ 2024-07-12  1:16 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)

Hi,

在 2024/07/12 4:23, Mateusz Jończyk 写道:
> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 
> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not set
> the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --

Thanks for the patch!

Reviewed-by: Yu Kuai <yukuai3@huawei.com>

BTW, do you have plans to add a new test to mdadm tests? I'll
pick it up if you don't, just let me know.

Thanks,
Kuai

> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
> 	./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>    on such an array - while the array was loaded with fsstress with 20
>    concurrent processes. When the array was idle or loaded with fsstress
>    with 8 processes, no such deadlocks happened in my tests.
>    This occurred also on unpatched Linux 6.8.0 though, but not on
>    6.1.97-rc1, so this is likely an independent regression (to be
>    investigated).
> - I was also getting deadlocks when adding / removing the bitmap on the
>    array in similar conditions - this happened on Linux 6.1.97-rc1
>    also though. fsstress with 8 concurrent processes did cause it only
>    once during many tests.
> - in my testing, there was once a problem with hot adding an
>    internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
> 	mdadm: failed to set internal bitmap.
>    even though no such reshaping was happening according to /proc/mdstat.
>    This seems unrelated, though.
> ---
>   drivers/md/raid1.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>   		len = r1_bio->sectors;
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
>   		if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>   			update_read_sectors(conf, disk, this_sector, read_len);
>   			return disk;
>   		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-12  1:16         ` Yu Kuai
@ 2024-07-12 15:11           ` Song Liu
  2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 0 replies; 11+ messages in thread
From: Song Liu @ 2024-07-12 15:11 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Mateusz Jończyk, linux-raid, linux-kernel, stable, Paul Luse,
	Xiao Ni, Mariusz Tkaczyk, yukuai (C)

On Fri, Jul 12, 2024 at 9:17 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
[...]
> >
> > After investigation, it turned out that choose_slow_rdev() does not set
> > the value of max_sectors in some cases and because of it,
> > raid1_read_request calls bio_split with sectors == 0.
> >
> > Fix it by filling in this variable.
> >
> > This bug was introduced in
> > commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > but apparently hidden until
> > commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> > shortly thereafter.
> >
> > Cc: stable@vger.kernel.org # 6.9.x+
> > Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> > Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > Cc: Song Liu <song@kernel.org>
> > Cc: Yu Kuai <yukuai3@huawei.com>
> > Cc: Paul Luse <paul.e.luse@linux.intel.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> > Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> >
> > --
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>

Applied to md-6.11. Thanks!

Song

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()
  2024-07-12  1:16         ` Yu Kuai
  2024-07-12 15:11           ` Song Liu
@ 2024-07-13 12:40           ` Mateusz Jończyk
  1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-13 12:40 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel
  Cc: stable, Song Liu, Paul Luse, Xiao Ni, Mariusz Tkaczyk, yukuai (C)

W dniu 12.07.2024 o 03:16, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/12 4:23, Mateusz Jończyk 写道:
>> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
>> when that drive has a write-mostly flag set. During such an attempt,
>> the following assertion in bio_split() is hit:
>>
[snip]
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
> BTW, do you have plans to add a new test to mdadm tests? I'll
> pick it up if you don't, just let me know.
>
> Thanks,
> Kuai

Yes, I'm working on it.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag
  2024-07-08 20:09   ` Mateusz Jończyk
  2024-07-09  2:57     ` Yu Kuai
@ 2024-07-09  6:49     ` Mariusz Tkaczyk
  1 sibling, 0 replies; 11+ messages in thread
From: Mariusz Tkaczyk @ 2024-07-09  6:49 UTC (permalink / raw)
  To: Mateusz Jończyk; +Cc: linux-raid

On Mon, 8 Jul 2024 22:09:51 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> > This looks correct, can you give it a test and cook a patch?
> >
> > Thanks,
> > Kuai  
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.
> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> 

Hi Mateusz,
If there is something I can help with, fell free to ask (even in Polish).
You can reach me by the mail I sent it or mariusz.tkaczyk@intel.com

I cannot answer you directly (this is the first problem you have to solve):
The following message to <mat.jonczyk@o2.pl> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 554-'sorry, refused mailfrom because return MX
does not exist'

Please consider using different mail provider (so far I know, gmail works well).

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-13 12:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-06 14:30 [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mateusz Jończyk
2024-07-07 19:50 ` Mateusz Jończyk
2024-07-08  1:54 ` Yu Kuai
2024-07-08 20:09   ` Mateusz Jończyk
2024-07-09  2:57     ` Yu Kuai
2024-07-11 20:23       ` [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev() Mateusz Jończyk
2024-07-11 21:14         ` Paul E Luse
2024-07-12  1:16         ` Yu Kuai
2024-07-12 15:11           ` Song Liu
2024-07-13 12:40           ` Mateusz Jończyk
2024-07-09  6:49     ` [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag Mariusz Tkaczyk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.