* Possible issue with software RAID 1 in case of disks with different speed
@ 2025-12-07 14:41 Christian Focke-Kiss
2025-12-07 20:41 ` John Stoffel
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Christian Focke-Kiss @ 2025-12-07 14:41 UTC (permalink / raw)
To: linux-raid@vger.kernel.org; +Cc: yukuai3@huawei.com, song@kernel.org
Hello,
I am using Debian stable for years (currently, release 13; before, 12,
11, 10, etc. starting with 3.0, if my long-time backups are complete).
For some years now, I am using software RAID with two RAID 1 sets with
three 8TB disks each, one for my production data (/home, /var/lib,
/var/www, etc.) and the second one for a nightly backup of my
production data (three disks each because HDDs failed quite frequently
then, and recovery was slow).
Initially, I used external disks connected mainly via USB 3.x, and also
via USB 2.0, when I ran out of USB ports. Reason for USB was that I
couldn't/wouldn't afford a server with slots for accomodating six 3.5"
HDDs.
After migrating the failed 3.5" HDDs to 2.5" SSDs one-by-one over the
years, I finally migrated five disks into a rack-mount server and
connected the sixth disk via USB 3.x (I couldn't migrate all six disks
because one slot was still occupied by the boot disk).
I run nightly rsync jobs ('managed' by Back In Time) to backup
everything from /dev/md0 (boot) to /dev/md1 (production) and then to
/dev/md2 (backup).
Now, as long as the sixth disk was still connected via USB 3.x, the
rsync job and a kworker job were 'blocked' after some hours of rsync-
ing, and the console displayed some 'sync' errors, and I had to press
Ctrl+Alt+Del to reboot the system because login didn't work anymore.
I flagged the USB 3.x disk 'write-mostly' and 'nofailfast' but this
didn't resolve the issue.
Only after I added two NVMe SSDs as boot disks and migrated the sixth
SSD into the sixth slot, everything runs fine.
Conclusion:
I suspect software RAID 1 has issues if one disk of a three disk RAID 1
set is significantly slower than the other two disks.
Everything works fine for me now, but in case ...
Kind regards, Christian
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Possible issue with software RAID 1 in case of disks with different speed
2025-12-07 14:41 Possible issue with software RAID 1 in case of disks with different speed Christian Focke-Kiss
@ 2025-12-07 20:41 ` John Stoffel
2025-12-07 22:00 ` Roger Heflin
2025-12-11 0:25 ` Andy Smith
2 siblings, 0 replies; 4+ messages in thread
From: John Stoffel @ 2025-12-07 20:41 UTC (permalink / raw)
To: Christian Focke-Kiss
Cc: linux-raid@vger.kernel.org, yukuai3@huawei.com, song@kernel.org
>>>>> "Christian" == Christian Focke-Kiss <christian.focke-kiss@mail.fernfh.ac.at> writes:
> I am using Debian stable for years (currently, release 13; before, 12,
> 11, 10, etc. starting with 3.0, if my long-time backups are complete).
Same here, though with different backup tools
> For some years now, I am using software RAID with two RAID 1 sets with
> three 8TB disks each, one for my production data (/home, /var/lib,
> /var/www, etc.) and the second one for a nightly backup of my
> production data (three disks each because HDDs failed quite frequently
> then, and recovery was slow).
I like this, three way mirrors are cheap insurance.
> Initially, I used external disks connected mainly via USB 3.x, and also
> via USB 2.0, when I ran out of USB ports. Reason for USB was that I
> couldn't/wouldn't afford a server with slots for accomodating six 3.5"
> HDDs.
There's the problem, USB connections are notoriously crappy. It would
be better to go with eSATA or some other more reliable transport.
> After migrating the failed 3.5" HDDs to 2.5" SSDs one-by-one over
> the years, I finally migrated five disks into a rack-mount server
> and connected the sixth disk via USB 3.x (I couldn't migrate all six
> disks because one slot was still occupied by the boot disk).
> I run nightly rsync jobs ('managed' by Back In Time) to backup
> everything from /dev/md0 (boot) to /dev/md1 (production) and then to
> /dev/md2 (backup).
> Now, as long as the sixth disk was still connected via USB 3.x, the
> rsync job and a kworker job were 'blocked' after some hours of
> rsync- ing, and the console displayed some 'sync' errors, and I had
> to press Ctrl+Alt+Del to reboot the system because login didn't work
> anymore.
What did you see in the logs? I.e. did you kick off the logs and then
maybe send the logs to another system (if possible) or to the console
which you could then switch the display to, so you might have a chance
of looking at things?
> I flagged the USB 3.x disk 'write-mostly' and 'nofailfast' but this
> didn't resolve the issue.
I suspect your external USB enclosure is crap. I've never found a
good one in my experience. It's too prone to losing connectivity, or
having a crappy controller chip which just doesn't handle long term
writes well, etc.
> Only after I added two NVMe SSDs as boot disks and migrated the
> sixth SSD into the sixth slot, everything runs fine.
It's the USB connection, not the disk or RAID.
> Conclusion:
> I suspect software RAID 1 has issues if one disk of a three disk RAID 1
> set is significantly slower than the other two disks.
Logs? Warnings?
Glad to hear it's up, but in the future, just use eSATA or regular
SATA in your case. I've got some old big cases with 8 drive bays for
3.5" disks and it works great. I've thought about server case with
2.5" disks that could be hot-swapped, but in reality for a home
system, I don't need it up five 9s, I can handle downtime.
But thanks for the anecdote!
John
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Possible issue with software RAID 1 in case of disks with different speed
2025-12-07 14:41 Possible issue with software RAID 1 in case of disks with different speed Christian Focke-Kiss
2025-12-07 20:41 ` John Stoffel
@ 2025-12-07 22:00 ` Roger Heflin
2025-12-11 0:25 ` Andy Smith
2 siblings, 0 replies; 4+ messages in thread
From: Roger Heflin @ 2025-12-07 22:00 UTC (permalink / raw)
To: Christian Focke-Kiss
Cc: linux-raid@vger.kernel.org, yukuai3@huawei.com, song@kernel.org
On Sun, Dec 7, 2025 at 8:41 AM Christian Focke-Kiss
<christian.focke-kiss@mail.fernfh.ac.at> wrote:
>
> Hello,
>
>
> Now, as long as the sixth disk was still connected via USB 3.x, the
> rsync job and a kworker job were 'blocked' after some hours of rsync-
> ing, and the console displayed some 'sync' errors, and I had to press
> Ctrl+Alt+Del to reboot the system because login didn't work anymore.
>
> I flagged the USB 3.x disk 'write-mostly' and 'nofailfast' but this
> didn't resolve the issue.
>
> Only after I added two NVMe SSDs as boot disks and migrated the sixth
> SSD into the sixth slot, everything runs fine.
>
> Conclusion:
> I suspect software RAID 1 has issues if one disk of a three disk RAID 1
> set is significantly slower than the other two disks.
>
> Everything works fine for me now, but in case ...
>
> Kind regards, Christian
I have seen enterprise grade array/controllers (SAN) that "defeated"
multipathd (older kernels, dm-mapper, same layer used by mdraid).
The way the "enterprise" array defeated multipath was that the array
controller was programmed to return a TUR (test unit ready) when it
was alive and "believed" the "disk" it was managing was ok via its
paths, even when the "disk" it was managing was not really working.
The issue with that is some of the lower layers(the SCSI layer used
to/may still use a TUR as a health check) when a timeout occurs sends
a TUR and if it gets the TUR response back than it retries the IO, and
it can get stuck in this loop under the right conditions and not fail
IO even though IO is not fuctioning because the TUR always works. I
have seen multipath layers not process IO and not timeout/fail io from
this issue. for minutes to hours (often it never reports a timeout in
the logs because the TUR keeps works). I had a discussion with the
enterprise raid people suggesting that during a failover the failing
controller should stop responding to TURs first and then stop
processing the IO instead of only stopping TURs when the controller
was rebooted. It is quite possible that a USB controllers could be
doing the exact same thing(short circuiting and sending the TUR
itself, ie not a TUR from the actual disk) and defeating the
timeout/failure code in MDRAID even though the disk itself is not
responding.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Possible issue with software RAID 1 in case of disks with different speed
2025-12-07 14:41 Possible issue with software RAID 1 in case of disks with different speed Christian Focke-Kiss
2025-12-07 20:41 ` John Stoffel
2025-12-07 22:00 ` Roger Heflin
@ 2025-12-11 0:25 ` Andy Smith
2 siblings, 0 replies; 4+ messages in thread
From: Andy Smith @ 2025-12-11 0:25 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
Hi,
On Sun, Dec 07, 2025 at 02:41:37PM +0000, Christian Focke-Kiss wrote:
> Conclusion:
> I suspect software RAID 1 has issues if one disk of a three disk RAID 1
> set is significantly slower than the other two disks.
As others have mentioned I suspect your problem is USB. At no point in
the last 25 years have I found it a reliable way to have permanently
attached storage.
Storage with wildly different latency isn't a great setup but I haven't
found MD RAID-1 to have a big problem with it, not to the point of
instability.
write-mostly hasn't tended to have a huge effect for me unless the two
devices are radically different in performance. MD RAID already sends
reads to the device with the lowest amount of pending IO so all else
being equal, if you pair a SSD with a HDD, the SSD will get more reads
because they will complete sooner. Though do note that with RAID-1 a
single sequential read will all come from one device. It is only when
there are multiple threads reading that balancing can take place.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-11 0:42 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-07 14:41 Possible issue with software RAID 1 in case of disks with different speed Christian Focke-Kiss
2025-12-07 20:41 ` John Stoffel
2025-12-07 22:00 ` Roger Heflin
2025-12-11 0:25 ` Andy Smith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox