From: "Mateusz Jończyk" <mat.jonczyk@o2.pl>
To: Paul E Luse <paul.e.luse@linux.intel.com>
Cc: linux-raid@vger.kernel.org, Yu Kuai <yukuai3@huawei.com>,
linux-kernel@vger.kernel.org, Song Liu <song@kernel.org>,
regressions@lists.linux.dev,
Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Subject: Re: [REGRESSION] Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
Date: Sun, 28 Jul 2024 12:30:02 +0200 [thread overview]
Message-ID: <713ce46b-8751-49fb-b61f-2eb3e19459dc@o2.pl> (raw)
In-Reply-To: <20240725072742.1664beec@peluse-desk5>
W dniu 25.07.2024 o 16:27, Paul E Luse pisze:
> On Thu, 25 Jul 2024 09:15:40 +0200
> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>
>> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
>> <paul.e.luse@linux.intel.com> napisał/a:
>>> On Wed, 24 Jul 2024 22:35:49 +0200
>>> Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
>>>
>>>> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
>>>>> W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
>>>>>> Hello,
>>>>>>
>>>>>> In my laptop, I used to have two RAID1 arrays on top of NVMe and
>>>>>> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
>>>>>> for remaining data (LUKS
>>>>>> + LVM + ext4). For performance, I have marked the RAID component
>>>>>> device for /dev/md1 on the SATA SSD drive write-mostly, which
>>>>>> "means that the 'md' driver will avoid reading from these
>>>>>> devices if at all possible" (man mdadm).
>>>>>>
>>>>>> Recently, the NVMe drive started having problems (PCI AER errors
>>>>>> and the controller disappearing), so I removed it from the
>>>>>> arrays and wiped it. However, I have reseated the drive in the
>>>>>> M.2 socket and this apparently fixed it (verified with tests).
>>>>>>
>>>>>> $ cat /proc/mdstat
>>>>>> Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
>>>>>> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
>>>>>> 471727104 blocks super 1.2 [2/1] [_U]
>>>>>> bitmap: 4/4 pages [16KB], 65536KB chunk
>>>>>>
>>>>>> md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
>>>>>> 3142656 blocks super 1.2 [2/2] [UU]
>>>>>> bitmap: 0/1 pages [0KB], 65536KB chunk
>>>>>>
>>>>>> md0 : active raid1 sdb4[3]
>>>>>> 2094080 blocks super 1.2 [2/1] [_U]
>>>>>>
>>>>>> unused devices: <none>
>>>>>>
>>>>>> (md2 was used just for testing, ignore it).
>>>>>>
>>>>>> Today, I have tried to add the drive back to the arrays by
>>>>>> using a script that executed in quick succession:
>>>>>>
>>>>>> mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
>>>>>> mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
>>>>>>
>>>>>> This was on Linux 6.10.0, patched with my previous patch:
>>>>>>
>>>>>> https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
>>>>>>
>>>>>> (which fixed a regression in the kernel and allows it to start
>>>>>> /dev/md1 with a single drive in write-mostly mode).
>>>>>> In the background, I was running "rdiff-backup --compare" that
>>>>>> was comparing data between my array contents and a backup
>>>>>> attached via USB.
>>>>>>
>>>>>> This, however resulted in mayhem - I was unable to start any
>>>>>> program with an input-output error, etc. I used SysRQ + C to
>>>>>> save a kernel log:
>>>>>>
>>>> Hello,
>>>>
>>>> Unfortunately, hardware failure seems not to be the case.
>>>>
>>>> I did test it again on 6.10, twice, and in both cases I got
>>>> filesystem corruption (but not as severe).
>>>>
>>>> On Linux 6.1.96 it seems to be working well (also did two tries).
>>>>
>>>> Please note: in my tests, I was using a RAID component device with
>>>> a write-mostly bit set. This setup does not work on 6.9+ out of the
>>>> box and requires the following patch:
>>>>
>>>> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
>>>> return from choose_slow_rdev()")
>>>>
>>>> that is in master now.
>>>>
>>>> It is also heading into stable, which I'm going to interrupt.
Hello,
With much effort (challenging to reproduce reliably) I think have nailed down the issue to the read_balance refactoring series in 6.9:
86b1e613eb3b Merge tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block e81faa91a580 Merge branch 'raid1-read_balance' into md-6.9 0091c5a269ec md/raid1:
factor out helpers to choose the best rdev from read_balance() ba58f57fdf98 md/raid1: factor out the code to manage sequential IO 9f3ced792203 md/raid1: factor out choose_bb_rdev() from read_balance()
dfa8ecd167c1 md/raid1: factor out choose_slow_rdev() from read_balance() 31a73331752d md/raid1: factor out read_first_rdev() from read_balance() f10920762955 md/raid1-10: factor out a new helper
raid1_should_read_first() f29841ff3b27 md/raid1-10: add a helper raid1_check_read_range() 257ac239ffcf md/raid1: fix choose next idle in read_balance() 2c27d09d3a76 md/raid1: record nonrot rdevs while
adding/removing rdevs to conf 969d6589abcb md/raid1: factor out helpers to add rdev to conf 3a0f007b6979 md: add a new helper rdev_has_badblock()
In particular, 86b1e613eb3b is definitely bad, and 13fe8e6825e4 is 95% good.
I was testing with the following two commits on top of the series to make this setup work for me:
commit 36a5c03f23271 ("md/raid1: set max_sectors during early return from choose_slow_rdev()") commit b561ea56a264 ("block: allow device to have both virt_boundary_mask and max segment size")
After code analysis, I have noticed that the following check that was present in old
read_balance() is not present (in equivalent form in the new code):
if (!test_bit(In_sync, &rdev->flags) &&
rdev->recovery_offset < this_sector + sectors)
continue;
(in choose_slow_rdev() and choose_first_rdev() and possibly other functions)
which would cause the kernel to read from the device being synced to before
it is ready.
In my debug patch (I'll send in a while), I have copied the check to raid1_check_read_range and it seems
that the problems do not happen any longer with it.
I'm not so sure now that this bug is limited to write-mostly though - previous tests may have been unreliable.
#regzbot introduced: 13fe8e6825e4..86b1e613eb3b
#regzbot monitor: https://lore.kernel.org/lkml/20240724141906.10b4fc4e@peluse-desk5/T/#m671d6d3a7eda44d39d0882864a98824f52c52917
Greetings,
Mateusz
next prev parent reply other threads:[~2024-07-28 10:30 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-20 14:47 Filesystem corruption when adding a new device (delayed-resync, write-mostly) Mateusz Jończyk
2024-07-22 5:39 ` Mateusz Jończyk
2024-07-24 20:35 ` Filesystem corruption when adding a new RAID " Mateusz Jończyk
2024-07-24 21:19 ` Paul E Luse
2024-07-25 7:15 ` Mateusz Jończyk
2024-07-25 14:27 ` Paul E Luse
2024-07-28 10:30 ` Mateusz Jończyk [this message]
2024-07-30 20:35 ` [REGRESSION] " Mateusz Jończyk
2024-07-31 1:10 ` Yu Kuai
2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
2024-07-29 1:30 ` Yu Kuai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=713ce46b-8751-49fb-b61f-2eb3e19459dc@o2.pl \
--to=mat.jonczyk@o2.pl \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=mariusz.tkaczyk@linux.intel.com \
--cc=paul.e.luse@linux.intel.com \
--cc=regressions@lists.linux.dev \
--cc=song@kernel.org \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox