Filesystem corruption when adding a new device (delayed-resync, write-mostly)

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Filesystem corruption when adding a new device (delayed-resync, write-mostly)
@ 2024-07-20 14:47 Mateusz Jończyk
  2024-07-22  5:39 ` Mateusz Jończyk
  2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
  0 siblings, 2 replies; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-20 14:47 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel; +Cc: Song Liu, Paul Luse

Hello,

In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA SSD
drives: /dev/md0 for /boot (not partitioned), /dev/md1 for remaining data (LUKS
+ LVM + ext4). For performance, I have marked the RAID component device for
/dev/md1 on the SATA SSD drive write-mostly, which "means that the 'md' driver
will avoid reading from these devices if at all possible" (man mdadm).

Recently, the NVMe drive started having problems (PCI AER errors and the
controller disappearing), so I removed it from the arrays and wiped it.
However, I have reseated the drive in the M.2 socket and this apparently fixed
it (verified with tests).

    $ cat /proc/mdstat
    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
    md1 : active raid1 sdb5[1](W)
          471727104 blocks super 1.2 [2/1] [_U]
          bitmap: 4/4 pages [16KB], 65536KB chunk

    md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
          3142656 blocks super 1.2 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md0 : active raid1 sdb4[3]
          2094080 blocks super 1.2 [2/1] [_U]

    unused devices: <none>

(md2 was used just for testing, ignore it).

Today, I have tried to add the drive back to the arrays by using a script that
executed in quick succession:

    mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
    mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3

This was on Linux 6.10.0, patched with my previous patch:

    https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/

(which fixed a regression in the kernel and allows it to start /dev/md1 with a
single drive in write-mostly mode).
In the background, I was running "rdiff-backup --compare" that was comparing
data between my array contents and a backup attached via USB.

This, however resulted in mayhem - I was unable to start any program with an
input-output error, etc. I used SysRQ + C to save a kernel log:

    [ 4940.891490] md: could not open device unknown-block(259,1).
    [ 4940.891498] md: md_import_device returned -16
    [ 4940.920440] md: could not open device unknown-block(259,1).
    [ 4940.920449] md: md_import_device returned -16
    [ 4940.934462] md: recovery of RAID array md0
    [ 4941.003041] EXT4-fs warning (device dm-4): ext4_dirblock_csum_verify:405: inode #16676515: comm rdiff-backup: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4941.003053] EXT4-fs error (device dm-4): htree_dirblock_to_tree:1082: inode #16676515: comm rdiff-backup: Directory block failed checksum
    [ 4941.003061] Aborting journal on device dm-4-8.
    [ 4941.066131] md: delaying recovery of md1 until md0 has finished (they share one or more physical units)
    [ 4941.092374] EXT4-fs (dm-4): Remounting filesystem read-only
    [ 4942.301499] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.301508] EXT4-fs error (device dm-2): __ext4_find_entry:1693: inode #156120: comm gnome-terminal-: checksumming directory block 0
    [ 4942.301515] Aborting journal on device dm-2-8.
    [ 4942.332393] EXT4-fs (dm-2): Remounting filesystem read-only
    [ 4942.333839] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.765422] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4942.766884] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.364466] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.366435] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.849698] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4947.851860] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.226094] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.227982] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.696095] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4949.697468] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm gnome-terminal-: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4950.105158] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm pool-gnome-shel: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4950.105645] EXT4-fs warning (device dm-2): ext4_dirblock_csum_verify:405: inode #156120: comm pool-gnome-shel: No space for directory leaf checksum. Please run e2fsck -D.
    [ 4951.559184] md: md0: recovery done.
    [ 4951.562154] md: recovery of RAID array md1
    [ 4952.636764] EXT4-fs warning: 6 callbacks suppressed

The interesting fact is that both dm-4 and dm-2 are on /dev/md1, which was
modified second, and should not have been touched (or read from) before the
resync of /dev/md0 was complete.

I have restarted the laptop on the same kernel, and this apparently
made things worse (perhaps fsck at boot corrupted my root filesystem).
All in all, other filesystems were relatively unaffected, but the root
filesystem (for / ) I had to recover from backups.

Greetings,

Mateusz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem corruption when adding a new device (delayed-resync, write-mostly)
  2024-07-20 14:47 Filesystem corruption when adding a new device (delayed-resync, write-mostly) Mateusz Jończyk
@ 2024-07-22  5:39 ` Mateusz Jończyk
  2024-07-24 20:35   ` Filesystem corruption when adding a new RAID " Mateusz Jończyk
  2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
  1 sibling, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-22  5:39 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel; +Cc: Song Liu, Paul Luse

W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> Hello,
>
> In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA SSD
> drives: /dev/md0 for /boot (not partitioned), /dev/md1 for remaining data (LUKS
> + LVM + ext4). For performance, I have marked the RAID component device for
> /dev/md1 on the SATA SSD drive write-mostly, which "means that the 'md' driver
> will avoid reading from these devices if at all possible" (man mdadm).
>
> Recently, the NVMe drive started having problems (PCI AER errors and the
> controller disappearing), so I removed it from the arrays and wiped it.
> However, I have reseated the drive in the M.2 socket and this apparently fixed
> it (verified with tests).
>
>     $ cat /proc/mdstat
>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
>     md1 : active raid1 sdb5[1](W)
>           471727104 blocks super 1.2 [2/1] [_U]
>           bitmap: 4/4 pages [16KB], 65536KB chunk
>
>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>           3142656 blocks super 1.2 [2/2] [UU]
>           bitmap: 0/1 pages [0KB], 65536KB chunk
>
>     md0 : active raid1 sdb4[3]
>           2094080 blocks super 1.2 [2/1] [_U]
>          
>     unused devices: <none>
>
> (md2 was used just for testing, ignore it).
>
> Today, I have tried to add the drive back to the arrays by using a script that
> executed in quick succession:
>
>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>
> This was on Linux 6.10.0, patched with my previous patch:
>
>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>
> (which fixed a regression in the kernel and allows it to start /dev/md1 with a
> single drive in write-mostly mode).
> In the background, I was running "rdiff-backup --compare" that was comparing
> data between my array contents and a backup attached via USB.
>
> This, however resulted in mayhem - I was unable to start any program with an
> input-output error, etc. I used SysRQ + C to save a kernel log:
>
Hello,

It is possible that my second SSD has some problems and high read activity
during RAID resync triggered it. Reads from that drive are now very slow (between
10 - 30 MB/s) and this suggests that something is not OK.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-22  5:39 ` Mateusz Jończyk
@ 2024-07-24 20:35   ` Mateusz Jończyk
  2024-07-24 21:19     ` Paul E Luse
  0 siblings, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-24 20:35 UTC (permalink / raw)
  To: Yu Kuai, linux-raid, linux-kernel; +Cc: Song Liu, Paul Luse, regressions

W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>> Hello,
>>
>> In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA SSD
>> drives: /dev/md0 for /boot (not partitioned), /dev/md1 for remaining data (LUKS
>> + LVM + ext4). For performance, I have marked the RAID component device for
>> /dev/md1 on the SATA SSD drive write-mostly, which "means that the 'md' driver
>> will avoid reading from these devices if at all possible" (man mdadm).
>>
>> Recently, the NVMe drive started having problems (PCI AER errors and the
>> controller disappearing), so I removed it from the arrays and wiped it.
>> However, I have reseated the drive in the M.2 socket and this apparently fixed
>> it (verified with tests).
>>
>>     $ cat /proc/mdstat
>>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
>>     md1 : active raid1 sdb5[1](W)
>>           471727104 blocks super 1.2 [2/1] [_U]
>>           bitmap: 4/4 pages [16KB], 65536KB chunk
>>
>>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>           3142656 blocks super 1.2 [2/2] [UU]
>>           bitmap: 0/1 pages [0KB], 65536KB chunk
>>
>>     md0 : active raid1 sdb4[3]
>>           2094080 blocks super 1.2 [2/1] [_U]
>>          
>>     unused devices: <none>
>>
>> (md2 was used just for testing, ignore it).
>>
>> Today, I have tried to add the drive back to the arrays by using a script that
>> executed in quick succession:
>>
>>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>
>> This was on Linux 6.10.0, patched with my previous patch:
>>
>>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>
>> (which fixed a regression in the kernel and allows it to start /dev/md1 with a
>> single drive in write-mostly mode).
>> In the background, I was running "rdiff-backup --compare" that was comparing
>> data between my array contents and a backup attached via USB.
>>
>> This, however resulted in mayhem - I was unable to start any program with an
>> input-output error, etc. I used SysRQ + C to save a kernel log:
>>
> Hello,
>
> It is possible that my second SSD has some problems and high read activity
> during RAID resync triggered it. Reads from that drive are now very slow (between
> 10 - 30 MB/s) and this suggests that something is not OK.

Hello,

Unfortunately, hardware failure seems not to be the case.

I did test it again on 6.10, twice, and in both cases I got filesystem corruption (but not as severe).

On Linux 6.1.96 it seems to be working well (also did two tries).

Please note: in my tests, I was using a RAID component device with
a write-mostly bit set. This setup does not work on 6.9+ out of the
box and requires the following patch:

commit 36a5c03f23271 ("md/raid1: set max_sectors during early return from choose_slow_rdev()")

that is in master now.

It is also heading into stable, which I'm going to interrupt.

Greetings,
Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-24 20:35   ` Filesystem corruption when adding a new RAID " Mateusz Jończyk
@ 2024-07-24 21:19     ` Paul E Luse
  2024-07-25  7:15       ` Mateusz Jończyk
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E Luse @ 2024-07-24 21:19 UTC (permalink / raw)
  To: Mateusz Jończyk
  Cc: Yu Kuai, linux-raid, linux-kernel, Song Liu, regressions

On Wed, 24 Jul 2024 22:35:49 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> >> Hello,
> >>
> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
> >> for remaining data (LUKS
> >> + LVM + ext4). For performance, I have marked the RAID component
> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
> >> "means that the 'md' driver will avoid reading from these devices
> >> if at all possible" (man mdadm).
> >>
> >> Recently, the NVMe drive started having problems (PCI AER errors
> >> and the controller disappearing), so I removed it from the arrays
> >> and wiped it. However, I have reseated the drive in the M.2 socket
> >> and this apparently fixed it (verified with tests).
> >>
> >>     $ cat /proc/mdstat
> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
> >>           471727104 blocks super 1.2 [2/1] [_U]
> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
> >>
> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
> >>           3142656 blocks super 1.2 [2/2] [UU]
> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
> >>
> >>     md0 : active raid1 sdb4[3]
> >>           2094080 blocks super 1.2 [2/1] [_U]
> >>          
> >>     unused devices: <none>
> >>
> >> (md2 was used just for testing, ignore it).
> >>
> >> Today, I have tried to add the drive back to the arrays by using a
> >> script that executed in quick succession:
> >>
> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
> >>
> >> This was on Linux 6.10.0, patched with my previous patch:
> >>
> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
> >>
> >> (which fixed a regression in the kernel and allows it to start
> >> /dev/md1 with a single drive in write-mostly mode).
> >> In the background, I was running "rdiff-backup --compare" that was
> >> comparing data between my array contents and a backup attached via
> >> USB.
> >>
> >> This, however resulted in mayhem - I was unable to start any
> >> program with an input-output error, etc. I used SysRQ + C to save
> >> a kernel log:
> >>
> > Hello,
> >
> > It is possible that my second SSD has some problems and high read
> > activity during RAID resync triggered it. Reads from that drive are
> > now very slow (between 10 - 30 MB/s) and this suggests that
> > something is not OK.
> 
> Hello,
> 
> Unfortunately, hardware failure seems not to be the case.
> 
> I did test it again on 6.10, twice, and in both cases I got
> filesystem corruption (but not as severe).
> 
> On Linux 6.1.96 it seems to be working well (also did two tries).
> 
> Please note: in my tests, I was using a RAID component device with
> a write-mostly bit set. This setup does not work on 6.9+ out of the
> box and requires the following patch:
> 
> commit 36a5c03f23271 ("md/raid1: set max_sectors during early return
> from choose_slow_rdev()")
> 
> that is in master now.
> 
> It is also heading into stable, which I'm going to interrupt.

Hi Mateusz,

I'm pretty interested in what is happening here especially as it
relates to write-mostly.  Couple of questions for you:

1) Are you able to find a simpler reproduction for this, for example
without mixing SATA and NVMe.  Maybe just using two known good NVMe
SSDs and follow your steps to repro?

2) I don't fully understand your last two statements, maybe you can
clarify?  With your max_sectors patch does it pass or fail?  If pass,
what do mean by "I'm going to interrupt"? It sounds like you mean the
patch doesn't work and you are trying to stop it??

thanks
Paul

> 
> Greetings,
> Mateusz
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-24 21:19     ` Paul E Luse
@ 2024-07-25  7:15       ` Mateusz Jończyk
  2024-07-25 14:27         ` Paul E Luse
  0 siblings, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-25  7:15 UTC (permalink / raw)
  To: Paul E Luse; +Cc: Yu Kuai, linux-raid, linux-kernel, Song Liu, regressions

Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse <paul.e.luse@linux.intel.com> napisał/a:
>On Wed, 24 Jul 2024 22:35:49 +0200
>Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>
>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>> >> Hello,
>> >>
>> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>> >> for remaining data (LUKS
>> >> + LVM + ext4). For performance, I have marked the RAID component
>> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
>> >> "means that the 'md' driver will avoid reading from these devices
>> >> if at all possible" (man mdadm).
>> >>
>> >> Recently, the NVMe drive started having problems (PCI AER errors
>> >> and the controller disappearing), so I removed it from the arrays
>> >> and wiped it. However, I have reseated the drive in the M.2 socket
>> >> and this apparently fixed it (verified with tests).
>> >>
>> >>     $ cat /proc/mdstat
>> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>> >>           471727104 blocks super 1.2 [2/1] [_U]
>> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
>> >>
>> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>> >>           3142656 blocks super 1.2 [2/2] [UU]
>> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
>> >>
>> >>     md0 : active raid1 sdb4[3]
>> >>           2094080 blocks super 1.2 [2/1] [_U]
>> >>          
>> >>     unused devices: <none>
>> >>
>> >> (md2 was used just for testing, ignore it).
>> >>
>> >> Today, I have tried to add the drive back to the arrays by using a
>> >> script that executed in quick succession:
>> >>
>> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>> >>
>> >> This was on Linux 6.10.0, patched with my previous patch:
>> >>
>> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>> >>
>> >> (which fixed a regression in the kernel and allows it to start
>> >> /dev/md1 with a single drive in write-mostly mode).
>> >> In the background, I was running "rdiff-backup --compare" that was
>> >> comparing data between my array contents and a backup attached via
>> >> USB.
>> >>
>> >> This, however resulted in mayhem - I was unable to start any
>> >> program with an input-output error, etc. I used SysRQ + C to save
>> >> a kernel log:
>> >>
>> > Hello,
>> >
>> > It is possible that my second SSD has some problems and high read
>> > activity during RAID resync triggered it. Reads from that drive are
>> > now very slow (between 10 - 30 MB/s) and this suggests that
>> > something is not OK.
>> 
>> Hello,
>> 
>> Unfortunately, hardware failure seems not to be the case.
>> 
>> I did test it again on 6.10, twice, and in both cases I got
>> filesystem corruption (but not as severe).
>> 
>> On Linux 6.1.96 it seems to be working well (also did two tries).
>> 
>> Please note: in my tests, I was using a RAID component device with
>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>> box and requires the following patch:
>> 
>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early return
>> from choose_slow_rdev()")
>> 
>> that is in master now.
>> 
>> It is also heading into stable, which I'm going to interrupt.
>
>Hi Mateusz,
>
>I'm pretty interested in what is happening here especially as it
>relates to write-mostly.  Couple of questions for you:
>
>1) Are you able to find a simpler reproduction for this, for example
>without mixing SATA and NVMe.  Maybe just using two known good NVMe
>SSDs and follow your steps to repro?

Hello,

Well, I have three drives in my laptop: NVMe, SATA SSD (in the DVD bay) and SATA HDD (platter). I could do tests on top of these two SATA drives.
But maybe it would be easier for me to bisect (or guess-bisect) in the current setup, I haven't made up my mind yet.

>
>2) I don't fully understand your last two statements, maybe you can
>clarify?  With your max_sectors patch does it pass or fail?  If pass,
>what do mean by "I'm going to interrupt"? It sounds like you mean the
>patch doesn't work and you are trying to stop it??

Without this patch I wouldn't be able to do the tests. Without it, degraded RAID1 with a single drive in write-mostly mode doesn’t start at all.

With my last statement I meant that I was going to stop this patch from going to stable kernels. At this point, it doesn’t seem to me that my patch
is the direct cause of the problems, that I missed something. However, I think that it is currently better to fail this setup outright rather than risk
somebody's data.

I have made further tests:

- vanilla 6.8.0 with a write-mostly drive works correctly,

- vanilla 6.10-rc6 without the write mostly bit set also works correctly. 

So it seems that the problem happens only with the write-mostly mode and after 6.8.0.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-25  7:15       ` Mateusz Jończyk
@ 2024-07-25 14:27         ` Paul E Luse
  2024-07-28 10:30           ` [REGRESSION] " Mateusz Jończyk
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E Luse @ 2024-07-25 14:27 UTC (permalink / raw)
  To: Mateusz Jończyk
  Cc: Yu Kuai, linux-raid, linux-kernel, Song Liu, regressions

On Thu, 25 Jul 2024 09:15:40 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
> <paul.e.luse@linux.intel.com> napisał/a:
> >On Wed, 24 Jul 2024 22:35:49 +0200
> >Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> >
> >> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> >> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> >> >> Hello,
> >> >>
> >> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
> >> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
> >> >> for remaining data (LUKS
> >> >> + LVM + ext4). For performance, I have marked the RAID component
> >> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
> >> >> "means that the 'md' driver will avoid reading from these
> >> >> devices if at all possible" (man mdadm).
> >> >>
> >> >> Recently, the NVMe drive started having problems (PCI AER errors
> >> >> and the controller disappearing), so I removed it from the
> >> >> arrays and wiped it. However, I have reseated the drive in the
> >> >> M.2 socket and this apparently fixed it (verified with tests).
> >> >>
> >> >>     $ cat /proc/mdstat
> >> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
> >> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
> >> >>           471727104 blocks super 1.2 [2/1] [_U]
> >> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
> >> >>
> >> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
> >> >>           3142656 blocks super 1.2 [2/2] [UU]
> >> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
> >> >>
> >> >>     md0 : active raid1 sdb4[3]
> >> >>           2094080 blocks super 1.2 [2/1] [_U]
> >> >>          
> >> >>     unused devices: <none>
> >> >>
> >> >> (md2 was used just for testing, ignore it).
> >> >>
> >> >> Today, I have tried to add the drive back to the arrays by
> >> >> using a script that executed in quick succession:
> >> >>
> >> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
> >> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
> >> >>
> >> >> This was on Linux 6.10.0, patched with my previous patch:
> >> >>
> >> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
> >> >>
> >> >> (which fixed a regression in the kernel and allows it to start
> >> >> /dev/md1 with a single drive in write-mostly mode).
> >> >> In the background, I was running "rdiff-backup --compare" that
> >> >> was comparing data between my array contents and a backup
> >> >> attached via USB.
> >> >>
> >> >> This, however resulted in mayhem - I was unable to start any
> >> >> program with an input-output error, etc. I used SysRQ + C to
> >> >> save a kernel log:
> >> >>
> >> > Hello,
> >> >
> >> > It is possible that my second SSD has some problems and high read
> >> > activity during RAID resync triggered it. Reads from that drive
> >> > are now very slow (between 10 - 30 MB/s) and this suggests that
> >> > something is not OK.
> >> 
> >> Hello,
> >> 
> >> Unfortunately, hardware failure seems not to be the case.
> >> 
> >> I did test it again on 6.10, twice, and in both cases I got
> >> filesystem corruption (but not as severe).
> >> 
> >> On Linux 6.1.96 it seems to be working well (also did two tries).
> >> 
> >> Please note: in my tests, I was using a RAID component device with
> >> a write-mostly bit set. This setup does not work on 6.9+ out of the
> >> box and requires the following patch:
> >> 
> >> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
> >> return from choose_slow_rdev()")
> >> 
> >> that is in master now.
> >> 
> >> It is also heading into stable, which I'm going to interrupt.
> >
> >Hi Mateusz,
> >
> >I'm pretty interested in what is happening here especially as it
> >relates to write-mostly.  Couple of questions for you:
> >
> >1) Are you able to find a simpler reproduction for this, for example
> >without mixing SATA and NVMe.  Maybe just using two known good NVMe
> >SSDs and follow your steps to repro?
> 
> Hello,
> 
> Well, I have three drives in my laptop: NVMe, SATA SSD (in the DVD
> bay) and SATA HDD (platter). I could do tests on top of these two
> SATA drives. But maybe it would be easier for me to bisect (or
> guess-bisect) in the current setup, I haven't made up my mind yet.
> 

OK, thanks.

> >
> >2) I don't fully understand your last two statements, maybe you can
> >clarify?  With your max_sectors patch does it pass or fail?  If pass,
> >what do mean by "I'm going to interrupt"? It sounds like you mean the
> >patch doesn't work and you are trying to stop it??
> 
> Without this patch I wouldn't be able to do the tests. Without it,
> degraded RAID1 with a single drive in write-mostly mode doesn’t start
> at all.
> 
> With my last statement I meant that I was going to stop this patch
> from going to stable kernels. At this point, it doesn’t seem to me
> that my patch is the direct cause of the problems, that I missed
> something. However, I think that it is currently better to fail this
> setup outright rather than risk somebody's data.
> 

OK, I would say please do not try to stop the patch, it is a good fix
although maybe not completely solving your problem it should land.

Unless Kwai has another opinion.

-Paul

> I have made further tests:
> 
> - vanilla 6.8.0 with a write-mostly drive works correctly,
> 
> - vanilla 6.10-rc6 without the write mostly bit set also works
> correctly. 
> 
> So it seems that the problem happens only with the write-mostly mode
> and after 6.8.0.
> 
> Greetings,
> 
> Mateusz
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-25 14:27         ` Paul E Luse
@ 2024-07-28 10:30           ` Mateusz Jończyk
  2024-07-30 20:35             ` Mateusz Jończyk
  0 siblings, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-28 10:30 UTC (permalink / raw)
  To: Paul E Luse
  Cc: linux-raid, Yu Kuai, linux-kernel, Song Liu, regressions,
	Mariusz Tkaczyk

W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
> On Thu, 25 Jul 2024 09:15:40 +0200
> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>
>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>> <paul.e.luse@linux.intel.com> napisał/a:
>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>>
>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>> Hello,
>>>>>>
>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>> for remaining data (LUKS
>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>> devices if at all possible" (man mdadm).
>>>>>>
>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>> and the controller disappearing), so I removed it from the
>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>
>>>>>>     $ cat /proc/mdstat
>>>>>>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>>           471727104 blocks super 1.2 [2/1] [_U]
>>>>>>           bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>
>>>>>>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>>           3142656 blocks super 1.2 [2/2] [UU]
>>>>>>           bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>
>>>>>>     md0 : active raid1 sdb4[3]
>>>>>>           2094080 blocks super 1.2 [2/1] [_U]
>>>>>>          
>>>>>>     unused devices: <none>
>>>>>>
>>>>>> (md2 was used just for testing, ignore it).
>>>>>>
>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>> using a script that executed in quick succession:
>>>>>>
>>>>>>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>
>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>
>>>>>>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>>>>>
>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>> was comparing data between my array contents and a backup
>>>>>> attached via USB.
>>>>>>
>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>> save a kernel log:
>>>>>>
>>>> Hello,
>>>>
>>>> Unfortunately, hardware failure seems not to be the case.
>>>>
>>>> I did test it again on 6.10, twice, and in both cases I got
>>>> filesystem corruption (but not as severe).
>>>>
>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>
>>>> Please note: in my tests, I was using a RAID component device with
>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>> box and requires the following patch:
>>>>
>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>> return from choose_slow_rdev()")
>>>>
>>>> that is in master now.
>>>>
>>>> It is also heading into stable, which I'm going to interrupt.

Hello,

With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:

86b1e613eb3b Merge tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block e81faa91a580 Merge branch 'raid1-read_balance' into md-6.9 0091c5a269ec md/raid1:
factor out helpers to choose the best rdev from read_balance() ba58f57fdf98 md/raid1: factor out the code to manage sequential IO 9f3ced792203 md/raid1: factor out choose_bb_rdev() from read_balance()
dfa8ecd167c1 md/raid1: factor out choose_slow_rdev() from read_balance() 31a73331752d md/raid1: factor out read_first_rdev() from read_balance() f10920762955 md/raid1-10: factor out a new helper
raid1_should_read_first() f29841ff3b27 md/raid1-10: add a helper raid1_check_read_range() 257ac239ffcf md/raid1: fix choose next idle in read_balance() 2c27d09d3a76 md/raid1: record nonrot rdevs while
adding/removing rdevs to conf 969d6589abcb md/raid1: factor out helpers to add rdev to conf 3a0f007b6979 md: add a new helper rdev_has_badblock()

In particular, 86b1e613eb3b is definitely bad, and 13fe8e6825e4 is 95% good.

I was testing with the following two commits on top of the series to make this setup work for me:

commit 36a5c03f23271 ("md/raid1: set max_sectors during early return from choose_slow_rdev()") commit b561ea56a264 ("block: allow device to have both virt_boundary_mask and max segment size")

After code analysis, I have noticed that the following check that was present in old
read_balance() is not present (in equivalent form in the new code):

                if (!test_bit(In_sync, &rdev->flags) &&
                    rdev->recovery_offset < this_sector + sectors)
                        continue;

(in choose_slow_rdev() and choose_first_rdev() and possibly other functions)

which would cause the kernel to read from the device being synced to before
it is ready.

In my debug patch (I'll send in a while), I have copied the check to raid1_check_read_range and it seems
that the problems do not happen any longer with it.

I'm not so sure now that this bug is limited to write-mostly though - previous tests may have been unreliable.

#regzbot introduced: 13fe8e6825e4..86b1e613eb3b

#regzbot monitor: https://lore.kernel.org/lkml/20240724141906.10b4fc4e@peluse-desk5/T/#m671d6d3a7eda44d39d0882864a98824f52c52917

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range
  2024-07-20 14:47 Filesystem corruption when adding a new device (delayed-resync, write-mostly) Mateusz Jończyk
  2024-07-22  5:39 ` Mateusz Jończyk
@ 2024-07-28 10:36 ` Mateusz Jończyk
  2024-07-29  1:30   ` Yu Kuai
  1 sibling, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-28 10:36 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Mateusz Jończyk, Yu Kuai, Song Liu, stable

This should fix the filesystem corruption during RAID resync.

Checking this condition in raid1_check_read_range is not ideal, but this
is only a debug patch.

Link: https://lore.kernel.org/lkml/20240724141906.10b4fc4e@peluse-desk5/T/#m671d6d3a7eda44d39d0882864a98824f52c52917
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Song Liu <song@kernel.org>
Cc: stable@vger.kernel.org
---
 drivers/md/raid1-10.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 2ea1710a3b70..4ab896e8cb12 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -252,6 +252,10 @@ static inline int raid1_check_read_range(struct md_rdev *rdev,
 	sector_t first_bad;
 	int bad_sectors;
 
+	if (!test_bit(In_sync, &rdev->flags) &&
+	    rdev->recovery_offset < this_sector + *len)
+		return 0;
+
 	/* no bad block overlap */
 	if (!is_badblock(rdev, this_sector, *len, &first_bad, &bad_sectors))
 		return *len;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range
  2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
@ 2024-07-29  1:30   ` Yu Kuai
  0 siblings, 0 replies; 11+ messages in thread
From: Yu Kuai @ 2024-07-29  1:30 UTC (permalink / raw)
  To: Mateusz Jończyk, linux-raid, linux-kernel
  Cc: Song Liu, stable, yukuai (C)

Hi,

在 2024/07/28 18:36, Mateusz Jończyk 写道:
> This should fix the filesystem corruption during RAID resync.
> 
> Checking this condition in raid1_check_read_range is not ideal, but this
> is only a debug patch.
> 
> Link: https://lore.kernel.org/lkml/20240724141906.10b4fc4e@peluse-desk5/T/#m671d6d3a7eda44d39d0882864a98824f52c52917
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Song Liu <song@kernel.org>
> Cc: stable@vger.kernel.org
> ---
>   drivers/md/raid1-10.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> index 2ea1710a3b70..4ab896e8cb12 100644
> --- a/drivers/md/raid1-10.c
> +++ b/drivers/md/raid1-10.c
> @@ -252,6 +252,10 @@ static inline int raid1_check_read_range(struct md_rdev *rdev,
>   	sector_t first_bad;
>   	int bad_sectors;
>   
> +	if (!test_bit(In_sync, &rdev->flags) &&
> +	    rdev->recovery_offset < this_sector + *len)
> +		return 0;
> +

It's right this check is necessary, which means should not read from
rdev that the array is still in recovery and the read range is not
recoveried yet.

However, choose_first_rdev() is called during array resync, hence
the array will not be in recovery, and this is just dead code and not
needed.

And choose_bb_rdev() will only be called when the read range has bb,
which means the read range msut be accessed before and IO error must
occur first, and which indicates this rdev can't still be in recovery
for the read range.

So I think this check is only needed for slow disks.

BTW, looks like we don't have much tests for the case slow disks in
raid1.

Thanks,
Kuai

>   	/* no bad block overlap */
>   	if (!is_badblock(rdev, this_sector, *len, &first_bad, &bad_sectors))
>   		return *len;
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-28 10:30           ` [REGRESSION] " Mateusz Jończyk
@ 2024-07-30 20:35             ` Mateusz Jończyk
  2024-07-31  1:10               ` Yu Kuai
  0 siblings, 1 reply; 11+ messages in thread
From: Mateusz Jończyk @ 2024-07-30 20:35 UTC (permalink / raw)
  To: Paul E Luse
  Cc: linux-raid, Yu Kuai, linux-kernel, Song Liu, regressions,
	Mariusz Tkaczyk

W dniu 28.07.2024 o 12:30, Mateusz Jończyk pisze:
> W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
>> On Thu, 25 Jul 2024 09:15:40 +0200
>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>
>>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>>> <paul.e.luse@linux.intel.com> napisał/a:
>>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>>>
>>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>>> Hello,
>>>>>>>
>>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>>> for remaining data (LUKS
>>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>>> devices if at all possible" (man mdadm).
>>>>>>>
>>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>>> and the controller disappearing), so I removed it from the
>>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>>
>>>>>>>     $ cat /proc/mdstat
>>>>>>>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>>>           471727104 blocks super 1.2 [2/1] [_U]
>>>>>>>           bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>>
>>>>>>>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>>>           3142656 blocks super 1.2 [2/2] [UU]
>>>>>>>           bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>>
>>>>>>>     md0 : active raid1 sdb4[3]
>>>>>>>           2094080 blocks super 1.2 [2/1] [_U]
>>>>>>>          
>>>>>>>     unused devices: <none>
>>>>>>>
>>>>>>> (md2 was used just for testing, ignore it).
>>>>>>>
>>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>>> using a script that executed in quick succession:
>>>>>>>
>>>>>>>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>>>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>>
>>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>>
>>>>>>>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>>>>>>
>>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>>> was comparing data between my array contents and a backup
>>>>>>> attached via USB.
>>>>>>>
>>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>>> save a kernel log:
>>>>>>>
>>>>> Hello,
>>>>>
>>>>> Unfortunately, hardware failure seems not to be the case.
>>>>>
>>>>> I did test it again on 6.10, twice, and in both cases I got
>>>>> filesystem corruption (but not as severe).
>>>>>
>>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>>
>>>>> Please note: in my tests, I was using a RAID component device with
>>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>>> box and requires the following patch:
>>>>>
>>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>>> return from choose_slow_rdev()")
>>>>>
>>>>> that is in master now.
>>>>>
>>>>> It is also heading into stable, which I'm going to interrupt.
> Hello,
>
> With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
[snip]
> After code analysis, I have noticed that the following check that was present in old
> read_balance() is not present (in equivalent form in the new code):
>
>                 if (!test_bit(In_sync, &rdev->flags) &&
>                     rdev->recovery_offset < this_sector + sectors)
>                         continue;
>
> (in choose_slow_rdev() and choose_first_rdev() and possibly other functions)
>
> which would cause the kernel to read from the device being synced to before
> it is ready.

Hello,

I think have made a reliable (and safe) reproducer for this bug:

Prerequisite: create an array on top of 2 devices 1GB+ large:

mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/nvme0n1p5 --write-mostly /dev/sdb8
The script:
-------------------------------8<------------------------

#!/bin/bash

mdadm /dev/md4 --fail /dev/nvme0n1p5
sleep 1
mdadm /dev/md4 --remove failed
sleep 1

# fill with random data
shred -n1 -v /dev/md4
# fill with zeros
shred -n0 -zv /dev/nvme0n1p5

sha256sum /dev/md4

echo 1 > /proc/sys/vm/drop_caches

date

# calculate a shasum while the array is being synced
( sha256sum /dev/md4; date ) &
mdadm /dev/md4 --add --readwrite /dev/nvme0n1p5
date

-------------------------------8<------------------------

The two shasums should be equal, but they were different in my tests on affected kernels.

Also, in my tests with the script, *without* a write-mostly device in the array, the problems did not happen.

Greetings,

Mateusz


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
  2024-07-30 20:35             ` Mateusz Jończyk
@ 2024-07-31  1:10               ` Yu Kuai
  0 siblings, 0 replies; 11+ messages in thread
From: Yu Kuai @ 2024-07-31  1:10 UTC (permalink / raw)
  To: Mateusz Jończyk, Paul E Luse
  Cc: linux-raid, linux-kernel, Song Liu, regressions, Mariusz Tkaczyk,
	yukuai (C)

Hi,

在 2024/07/31 4:35, Mateusz Jończyk 写道:
> W dniu 28.07.2024 o 12:30, Mateusz Jończyk pisze:
>> W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
>>> On Thu, 25 Jul 2024 09:15:40 +0200
>>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>>
>>>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>>>> <paul.e.luse@linux.intel.com> napisał/a:
>>>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>>>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>>>>
>>>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>>>> for remaining data (LUKS
>>>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>>>> devices if at all possible" (man mdadm).
>>>>>>>>
>>>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>>>> and the controller disappearing), so I removed it from the
>>>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>>>
>>>>>>>>      $ cat /proc/mdstat
>>>>>>>>      Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>>>>            471727104 blocks super 1.2 [2/1] [_U]
>>>>>>>>            bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>>>
>>>>>>>>      md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>>>>            3142656 blocks super 1.2 [2/2] [UU]
>>>>>>>>            bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>>>
>>>>>>>>      md0 : active raid1 sdb4[3]
>>>>>>>>            2094080 blocks super 1.2 [2/1] [_U]
>>>>>>>>           
>>>>>>>>      unused devices: <none>
>>>>>>>>
>>>>>>>> (md2 was used just for testing, ignore it).
>>>>>>>>
>>>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>>>> using a script that executed in quick succession:
>>>>>>>>
>>>>>>>>      mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>>>>      mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>>>
>>>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>>>
>>>>>>>>      https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>>>>>>>
>>>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>>>> was comparing data between my array contents and a backup
>>>>>>>> attached via USB.
>>>>>>>>
>>>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>>>> save a kernel log:
>>>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Unfortunately, hardware failure seems not to be the case.
>>>>>>
>>>>>> I did test it again on 6.10, twice, and in both cases I got
>>>>>> filesystem corruption (but not as severe).
>>>>>>
>>>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>>>
>>>>>> Please note: in my tests, I was using a RAID component device with
>>>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>>>> box and requires the following patch:
>>>>>>
>>>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>>>> return from choose_slow_rdev()")
>>>>>>
>>>>>> that is in master now.
>>>>>>
>>>>>> It is also heading into stable, which I'm going to interrupt.
>> Hello,
>>
>> With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
> [snip]
>> After code analysis, I have noticed that the following check that was present in old
>> read_balance() is not present (in equivalent form in the new code):
>>
>>                  if (!test_bit(In_sync, &rdev->flags) &&
>>                      rdev->recovery_offset < this_sector + sectors)
>>                          continue;
>>
>> (in choose_slow_rdev() and choose_first_rdev() and possibly other functions)
>>
>> which would cause the kernel to read from the device being synced to before
>> it is ready.
> 
> Hello,
> 
> I think have made a reliable (and safe) reproducer for this bug:
> 
> Prerequisite: create an array on top of 2 devices 1GB+ large:
> 
> mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/nvme0n1p5 --write-mostly /dev/sdb8
> The script:
> -------------------------------8<------------------------
> 
> #!/bin/bash
> 
> mdadm /dev/md4 --fail /dev/nvme0n1p5
> sleep 1
> mdadm /dev/md4 --remove failed
> sleep 1
> 
> # fill with random data
> shred -n1 -v /dev/md4
> # fill with zeros
> shred -n0 -zv /dev/nvme0n1p5
> 
> sha256sum /dev/md4
> 
> echo 1 > /proc/sys/vm/drop_caches
> 
> date
> 
> # calculate a shasum while the array is being synced
> ( sha256sum /dev/md4; date ) &
> mdadm /dev/md4 --add --readwrite /dev/nvme0n1p5
> date
> 
> -------------------------------8<------------------------
> 
> The two shasums should be equal, but they were different in my tests on affected kernels.
> 
> Also, in my tests with the script, *without* a write-mostly device in the array, the problems did not happen.

Thanks for the test,

Can you send a new version of patch, and this test to mdadm?
Kuai

> 
> Greetings,
> 
> Mateusz
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-31  1:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-20 14:47 Filesystem corruption when adding a new device (delayed-resync, write-mostly) Mateusz Jończyk
2024-07-22  5:39 ` Mateusz Jończyk
2024-07-24 20:35   ` Filesystem corruption when adding a new RAID " Mateusz Jończyk
2024-07-24 21:19     ` Paul E Luse
2024-07-25  7:15       ` Mateusz Jończyk
2024-07-25 14:27         ` Paul E Luse
2024-07-28 10:30           ` [REGRESSION] " Mateusz Jończyk
2024-07-30 20:35             ` Mateusz Jończyk
2024-07-31  1:10               ` Yu Kuai
2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
2024-07-29  1:30   ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox