Linux RAID subsystem development
 help / color / mirror / Atom feed
From: Paul E Luse <paul.e.luse@linux.intel.com>
To: "Mateusz Jończyk" <mat.jonczyk@o2.pl>
Cc: Yu Kuai <yukuai3@huawei.com>,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
	Song Liu <song@kernel.org>,
	regressions@lists.linux.dev
Subject: Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)
Date: Thu, 25 Jul 2024 07:27:42 -0700	[thread overview]
Message-ID: <20240725072742.1664beec@peluse-desk5> (raw)
In-Reply-To: <2123BF84-5F16-4938-915B-B1EE0931AC03@o2.pl>

On Thu, 25 Jul 2024 09:15:40 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> Dnia 24 lipca 2024 23:19:06 CEST, Paul E Luse
> <paul.e.luse@linux.intel.com> napisał/a:
> >On Wed, 24 Jul 2024 22:35:49 +0200
> >Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> >
> >> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> >> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> >> >> Hello,
> >> >>
> >> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
> >> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
> >> >> for remaining data (LUKS
> >> >> + LVM + ext4). For performance, I have marked the RAID component
> >> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
> >> >> "means that the 'md' driver will avoid reading from these
> >> >> devices if at all possible" (man mdadm).
> >> >>
> >> >> Recently, the NVMe drive started having problems (PCI AER errors
> >> >> and the controller disappearing), so I removed it from the
> >> >> arrays and wiped it. However, I have reseated the drive in the
> >> >> M.2 socket and this apparently fixed it (verified with tests).
> >> >>
> >> >>     $ cat /proc/mdstat
> >> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
> >> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
> >> >>           471727104 blocks super 1.2 [2/1] [_U]
> >> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
> >> >>
> >> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
> >> >>           3142656 blocks super 1.2 [2/2] [UU]
> >> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
> >> >>
> >> >>     md0 : active raid1 sdb4[3]
> >> >>           2094080 blocks super 1.2 [2/1] [_U]
> >> >>          
> >> >>     unused devices: <none>
> >> >>
> >> >> (md2 was used just for testing, ignore it).
> >> >>
> >> >> Today, I have tried to add the drive back to the arrays by
> >> >> using a script that executed in quick succession:
> >> >>
> >> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
> >> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
> >> >>
> >> >> This was on Linux 6.10.0, patched with my previous patch:
> >> >>
> >> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@o2.pl/
> >> >>
> >> >> (which fixed a regression in the kernel and allows it to start
> >> >> /dev/md1 with a single drive in write-mostly mode).
> >> >> In the background, I was running "rdiff-backup --compare" that
> >> >> was comparing data between my array contents and a backup
> >> >> attached via USB.
> >> >>
> >> >> This, however resulted in mayhem - I was unable to start any
> >> >> program with an input-output error, etc. I used SysRQ + C to
> >> >> save a kernel log:
> >> >>
> >> > Hello,
> >> >
> >> > It is possible that my second SSD has some problems and high read
> >> > activity during RAID resync triggered it. Reads from that drive
> >> > are now very slow (between 10 - 30 MB/s) and this suggests that
> >> > something is not OK.
> >> 
> >> Hello,
> >> 
> >> Unfortunately, hardware failure seems not to be the case.
> >> 
> >> I did test it again on 6.10, twice, and in both cases I got
> >> filesystem corruption (but not as severe).
> >> 
> >> On Linux 6.1.96 it seems to be working well (also did two tries).
> >> 
> >> Please note: in my tests, I was using a RAID component device with
> >> a write-mostly bit set. This setup does not work on 6.9+ out of the
> >> box and requires the following patch:
> >> 
> >> commit 36a5c03f23271 ("md/raid1: set max_sectors during early
> >> return from choose_slow_rdev()")
> >> 
> >> that is in master now.
> >> 
> >> It is also heading into stable, which I'm going to interrupt.
> >
> >Hi Mateusz,
> >
> >I'm pretty interested in what is happening here especially as it
> >relates to write-mostly.  Couple of questions for you:
> >
> >1) Are you able to find a simpler reproduction for this, for example
> >without mixing SATA and NVMe.  Maybe just using two known good NVMe
> >SSDs and follow your steps to repro?
> 
> Hello,
> 
> Well, I have three drives in my laptop: NVMe, SATA SSD (in the DVD
> bay) and SATA HDD (platter). I could do tests on top of these two
> SATA drives. But maybe it would be easier for me to bisect (or
> guess-bisect) in the current setup, I haven't made up my mind yet.
> 

OK, thanks.

> >
> >2) I don't fully understand your last two statements, maybe you can
> >clarify?  With your max_sectors patch does it pass or fail?  If pass,
> >what do mean by "I'm going to interrupt"? It sounds like you mean the
> >patch doesn't work and you are trying to stop it??
> 
> Without this patch I wouldn't be able to do the tests. Without it,
> degraded RAID1 with a single drive in write-mostly mode doesn’t start
> at all.
> 
> With my last statement I meant that I was going to stop this patch
> from going to stable kernels. At this point, it doesn’t seem to me
> that my patch is the direct cause of the problems, that I missed
> something. However, I think that it is currently better to fail this
> setup outright rather than risk somebody's data.
> 

OK, I would say please do not try to stop the patch, it is a good fix
although maybe not completely solving your problem it should land.

Unless Kwai has another opinion.

-Paul

> I have made further tests:
> 
> - vanilla 6.8.0 with a write-mostly drive works correctly,
> 
> - vanilla 6.10-rc6 without the write mostly bit set also works
> correctly. 
> 
> So it seems that the problem happens only with the write-mostly mode
> and after 6.8.0.
> 
> Greetings,
> 
> Mateusz
> 
> 


  reply	other threads:[~2024-07-25 14:27 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-20 14:47 Filesystem corruption when adding a new device (delayed-resync, write-mostly) Mateusz Jończyk
2024-07-22  5:39 ` Mateusz Jończyk
2024-07-24 20:35   ` Filesystem corruption when adding a new RAID " Mateusz Jończyk
2024-07-24 21:19     ` Paul E Luse
2024-07-25  7:15       ` Mateusz Jończyk
2024-07-25 14:27         ` Paul E Luse [this message]
2024-07-28 10:30           ` [REGRESSION] " Mateusz Jończyk
2024-07-30 20:35             ` Mateusz Jończyk
2024-07-31  1:10               ` Yu Kuai
2024-07-28 10:36 ` [PATCH] [DEBUG] md/raid1: check recovery_offset in raid1_check_read_range Mateusz Jończyk
2024-07-29  1:30   ` Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240725072742.1664beec@peluse-desk5 \
    --to=paul.e.luse@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mat.jonczyk@o2.pl \
    --cc=regressions@lists.linux.dev \
    --cc=song@kernel.org \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox