WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
@ 2025-11-25 14:42 Justin Piszcz
  2025-11-25 14:48 ` Dr. David Alan Gilbert
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Justin Piszcz @ 2025-11-25 14:42 UTC (permalink / raw)
  To: LKML, linux-nvme, linux-raid, Btrfs BTRFS

Hello,

Issue/Summary:
1. Usually once a month, a random WD Red SN700 4TB NVME drive will
drop out of a NAS array, after power cycling the device, it rebuilds
successfully.

Details:
0. I use an NVME NAS (FS6712X) with WD Red SN700 4TB drives (WDS400T1R0C):
1. Ever since I installed the drives, there will be a random drive
that drops offline every month or so, almost always when the system is
idle.
2. I have troubleshot this with Asustor and WD/SanDisk.
3. Asustor noted that they did have other users with the same
configuration running into this problem.
4. When troubleshooting with WD/SanDisk's it was noted my main option
is to replace the drive, even though the issue occurs across nearly
all of the drives.
5. The drives are up to date currently according to the WD Dashboard
(when removing them and checking them on another system).
6. As for the device/filesystem, the FS6712X's configuration is
MD-RAID6 device with BTRFS on-top of it.
7. The "workaround" is to power cycle the FS6712X and when it boots up
the MD-RAID6 re-syncs back to a healthy state.

I am using the latest Asus ADM/OS which uses the 6.6.x kernel:
1. Linux FS6712X-EB92 6.6.x #1 SMP PREEMPT_DYNAMIC Tue Nov  4 00:53:39
CST 2025 x86_64 GNU/Linux

Questions:
1. Have others experienced this failure scenario?
2. Are there identified workarounds for this issue outside of power
cycling the device when this happens?
3. Are there any debug options that can be enabled that could help to
pinpoint the root cause?
4. Within the BIOS settings, which starts 2:18 below there are some
advanced settings that are shown, could there be a power saving
feature or other setting that can be modified to address this issue?
4a. https://www.youtube.com/watch?v=YytWFtgqVy0

[1] The last failures have been at random times on the following days:
1. August 27, 2025
2. September 19th, 2025
3. September 29th, 2025
4. October 28th, 2025
5. November 24, 2025

Chipset being used:
1. ASMedia Technology Inc.:ASM2806 4-Port PCIe x2 Gen3 Packet Switch

Details:

1. August 27, 2025
[1156824.598513] nvme nvme2: I/O 5 QID 0 timeout, reset controller
[1156896.035355] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1156906.057936] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1158185.737571] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
[1158185.744188] md/raid:md1: Operation continuing on 11 devices.

2. September 19th, 2025
[2001664.727044] nvme nvme9: I/O 26 QID 0 timeout, reset controller
[2001736.159123] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
[2001746.180813] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
[2002368.631788] md/raid:md1: Disk failure on nvme9n1p4, disabling device.
[2002368.638414] md/raid:md1: Operation continuing on 11 devices.
[2003213.517965] md/raid1:md0: Disk failure on nvme9n1p2, disabling device.
[2003213.517965] md/raid1:md0: Operation continuing on 11 devices.

3.  September 29th, 2025
[858305.408049] nvme nvme3: I/O 8 QID 0 timeout, reset controller
[858376.843140] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[858386.865240] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[858386.883053] md/raid:md1: Disk failure on nvme3n1p4, disabling device.
[858386.889586] md/raid:md1: Operation continuing on 11 devices.

4. October 28th, 2025
[502963.821407] nvme nvme4: I/O 0 QID 0 timeout, reset controller
[503035.257391] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
[503045.282923] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
[503142.226962] md/raid:md1: Disk failure on nvme4n1p4, disabling device.
[503142.233496] md/raid:md1: Operation continuing on 11 devices.

5. November 24th, 2025
[1658454.034633] nvme nvme2: I/O 24 QID 0 timeout, reset controller
[1658525.470287] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1658535.491803] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[1658535.517638] md/raid1:md0: Disk failure on nvme2n1p2, disabling device.
[1658535.517638] md/raid1:md0: Operation continuing on 11 devices.
[1659258.368386] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
[1659258.375012] md/raid:md1: Operation continuing on 11 devices.

Justin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:42 WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1) Justin Piszcz
@ 2025-11-25 14:48 ` Dr. David Alan Gilbert
  2025-12-01 14:09   ` Justin Piszcz
  2025-11-25 14:56 ` Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Dr. David Alan Gilbert @ 2025-11-25 14:48 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

* Justin Piszcz (jpiszcz@lucidpixels.com) wrote:
> Hello,
> 
> Issue/Summary:
> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> drop out of a NAS array, after power cycling the device, it rebuilds
> successfully.
> 
> Questions:

> 3. Are there any debug options that can be enabled that could help to
> pinpoint the root cause?

Have you tried using the 'nvme' command to see if the drives have
anything in their smart or error logs?

(I don't know any more than suggesting looking at those logs to see
if the drive is showing errors or what it says during the timeouts,
so I'll leave it to others to dig deeper).

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:42 WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1) Justin Piszcz
  2025-11-25 14:48 ` Dr. David Alan Gilbert
@ 2025-11-25 14:56 ` Keith Busch
  2025-12-01 14:13   ` Justin Piszcz
  2025-11-25 15:19 ` Dragan Milivojević
  2025-11-25 18:25 ` Wol
  3 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2025-11-25 14:56 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 09:42:11AM -0500, Justin Piszcz wrote:
> I am using the latest Asus ADM/OS which uses the 6.6.x kernel:

It may be a long shot, but there is an update in 6.17 that attempts to
restart the device after a pci function level reset when we detect it's
stuck in nvme level reset. For some devices, that's sufficient to get it
operational again, but it doesn't always work.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:42 WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1) Justin Piszcz
  2025-11-25 14:48 ` Dr. David Alan Gilbert
  2025-11-25 14:56 ` Keith Busch
@ 2025-11-25 15:19 ` Dragan Milivojević
  2025-11-25 16:57   ` Paul Rolland
                     ` (2 more replies)
  2025-11-25 18:25 ` Wol
  3 siblings, 3 replies; 16+ messages in thread
From: Dragan Milivojević @ 2025-11-25 15:19 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: LKML, linux-nvme, linux-raid

> Issue/Summary:
> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> drop out of a NAS array, after power cycling the device, it rebuilds
> successfully.
>

Seen the same, although far less frequent, with Samsung SSD 980 PRO on
a Dell PowerEdge R7525.
It's the nature of consumer grade drives, I guess.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 15:19 ` Dragan Milivojević
@ 2025-11-25 16:57   ` Paul Rolland
  2025-12-08 13:37     ` Sinisa
  2025-11-25 22:37   ` Jani Partanen
  2025-12-01 14:14   ` Justin Piszcz
  2 siblings, 1 reply; 16+ messages in thread
From: Paul Rolland @ 2025-11-25 16:57 UTC (permalink / raw)
  To: Dragan Milivojević; +Cc: LKML, linux-nvme, linux-raid

Hello,

On Tue, 25 Nov 2025 16:19:27 +0100
Dragan Milivojević <galileo@pkm-inc.com> wrote:

> > Issue/Summary:
> > 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> > drop out of a NAS array, after power cycling the device, it rebuilds
> > successfully.
> >  
> 
> Seen the same, although far less frequent, with Samsung SSD 980 PRO on
> a Dell PowerEdge R7525.
> It's the nature of consumer grade drives, I guess.
> 

Got some issue long time ago, and used :

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

to boot the kernel. That fixed issue with SN700 2TB.

Regards,
Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:42 WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1) Justin Piszcz
                   ` (2 preceding siblings ...)
  2025-11-25 15:19 ` Dragan Milivojević
@ 2025-11-25 18:25 ` Wol
  2025-11-26  0:15   ` David Sterba
  2025-12-01 14:16   ` Justin Piszcz
  3 siblings, 2 replies; 16+ messages in thread
From: Wol @ 2025-11-25 18:25 UTC (permalink / raw)
  To: Justin Piszcz, LKML, linux-nvme, linux-raid, Btrfs BTRFS

Probably not the problem, but how old are the drives? About 2020, WD 
started shingling the Red line (you had to move to Red Pro to get 
conventional drives). Shingled is bad news for linux raid, but the fact 
your drives tend to drop out when idling makes it unlikely this is the 
problem.

Cheers,
Wol

On 25/11/2025 14:42, Justin Piszcz wrote:
> Hello,
> 
> Issue/Summary:
> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> drop out of a NAS array, after power cycling the device, it rebuilds
> successfully.
> 
> Details:
> 0. I use an NVME NAS (FS6712X) with WD Red SN700 4TB drives (WDS400T1R0C):
> 1. Ever since I installed the drives, there will be a random drive
> that drops offline every month or so, almost always when the system is
> idle.
> 2. I have troubleshot this with Asustor and WD/SanDisk.
> 3. Asustor noted that they did have other users with the same
> configuration running into this problem.
> 4. When troubleshooting with WD/SanDisk's it was noted my main option
> is to replace the drive, even though the issue occurs across nearly
> all of the drives.
> 5. The drives are up to date currently according to the WD Dashboard
> (when removing them and checking them on another system).
> 6. As for the device/filesystem, the FS6712X's configuration is
> MD-RAID6 device with BTRFS on-top of it.
> 7. The "workaround" is to power cycle the FS6712X and when it boots up
> the MD-RAID6 re-syncs back to a healthy state.
> 
> I am using the latest Asus ADM/OS which uses the 6.6.x kernel:
> 1. Linux FS6712X-EB92 6.6.x #1 SMP PREEMPT_DYNAMIC Tue Nov  4 00:53:39
> CST 2025 x86_64 GNU/Linux
> 
> Questions:
> 1. Have others experienced this failure scenario?
> 2. Are there identified workarounds for this issue outside of power
> cycling the device when this happens?
> 3. Are there any debug options that can be enabled that could help to
> pinpoint the root cause?
> 4. Within the BIOS settings, which starts 2:18 below there are some
> advanced settings that are shown, could there be a power saving
> feature or other setting that can be modified to address this issue?
> 4a. https://www.youtube.com/watch?v=YytWFtgqVy0
> 
> [1] The last failures have been at random times on the following days:
> 1. August 27, 2025
> 2. September 19th, 2025
> 3. September 29th, 2025
> 4. October 28th, 2025
> 5. November 24, 2025
> 
> Chipset being used:
> 1. ASMedia Technology Inc.:ASM2806 4-Port PCIe x2 Gen3 Packet Switch
> 
> Details:
> 
> 1. August 27, 2025
> [1156824.598513] nvme nvme2: I/O 5 QID 0 timeout, reset controller
> [1156896.035355] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
> [1156906.057936] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
> [1158185.737571] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
> [1158185.744188] md/raid:md1: Operation continuing on 11 devices.
> 
> 2. September 19th, 2025
> [2001664.727044] nvme nvme9: I/O 26 QID 0 timeout, reset controller
> [2001736.159123] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
> [2001746.180813] nvme nvme9: Device not ready; aborting reset, CSTS=0x1
> [2002368.631788] md/raid:md1: Disk failure on nvme9n1p4, disabling device.
> [2002368.638414] md/raid:md1: Operation continuing on 11 devices.
> [2003213.517965] md/raid1:md0: Disk failure on nvme9n1p2, disabling device.
> [2003213.517965] md/raid1:md0: Operation continuing on 11 devices.
> 
> 3.  September 29th, 2025
> [858305.408049] nvme nvme3: I/O 8 QID 0 timeout, reset controller
> [858376.843140] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
> [858386.865240] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
> [858386.883053] md/raid:md1: Disk failure on nvme3n1p4, disabling device.
> [858386.889586] md/raid:md1: Operation continuing on 11 devices.
> 
> 4. October 28th, 2025
> [502963.821407] nvme nvme4: I/O 0 QID 0 timeout, reset controller
> [503035.257391] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
> [503045.282923] nvme nvme4: Device not ready; aborting reset, CSTS=0x1
> [503142.226962] md/raid:md1: Disk failure on nvme4n1p4, disabling device.
> [503142.233496] md/raid:md1: Operation continuing on 11 devices.
> 
> 5. November 24th, 2025
> [1658454.034633] nvme nvme2: I/O 24 QID 0 timeout, reset controller
> [1658525.470287] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
> [1658535.491803] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
> [1658535.517638] md/raid1:md0: Disk failure on nvme2n1p2, disabling device.
> [1658535.517638] md/raid1:md0: Operation continuing on 11 devices.
> [1659258.368386] md/raid:md1: Disk failure on nvme2n1p4, disabling device.
> [1659258.375012] md/raid:md1: Operation continuing on 11 devices.
> 
> 
> Justin
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 15:19 ` Dragan Milivojević
  2025-11-25 16:57   ` Paul Rolland
@ 2025-11-25 22:37   ` Jani Partanen
  2025-12-01 14:14   ` Justin Piszcz
  2 siblings, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2025-11-25 22:37 UTC (permalink / raw)
  To: Dragan Milivojević, Justin Piszcz; +Cc: LKML, linux-nvme, linux-raid

On 25/11/2025 17.19, Dragan Milivojević wrote:
>> Issue/Summary:
>> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
>> drop out of a NAS array, after power cycling the device, it rebuilds
>> successfully.
>>
> Seen the same, although far less frequent, with Samsung SSD 980 PRO on
> a Dell PowerEdge R7525.
> It's the nature of consumer grade drives, I guess.

I dont know if this WD issue is same issue that I had with WD_BLACK 
SN770 2TB drive, but I share what I know anyway.

That drive cannot handle 4K LBA. It works fine with 512k LBA, but with 
4K it will die sooner or later, usually when you hit it with heavy IO.

WD has know this for years and havent done anything. So I think it is 
safe to say that its hardware issue what cannot be fixed with firmware, 
other than disabling ability to even switch to 4K LBA but they havent 
done that. What it turns tell me how much they care. Good thing is that 
there is other brands.

Lot of talk about that issue here: 
https://github.com/openzfs/zfs/discussions/14793

I think there was also some other WD drives suffering this same issue.

And this is nothing to do with linux or zfs, its purely WD drive issue 
because I originally got it in windows, then switched drive to linux to 
test it out and same happened with linux.

// Jani

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 18:25 ` Wol
@ 2025-11-26  0:15   ` David Sterba
  2025-11-26 12:51     ` Roger Heflin
  2025-12-01 14:16   ` Justin Piszcz
  1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2025-11-26  0:15 UTC (permalink / raw)
  To: Wol; +Cc: Justin Piszcz, LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 06:25:41PM +0000, Wol wrote:
> Probably not the problem, but how old are the drives? About 2020, WD 
> started shingling the Red line (you had to move to Red Pro to get 
> conventional drives).

The WD SN700 is an NVMe, though one can imagine how they could be
actually shingled with slightly tilted overlapping sockets.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-26  0:15   ` David Sterba
@ 2025-11-26 12:51     ` Roger Heflin
  0 siblings, 0 replies; 16+ messages in thread
From: Roger Heflin @ 2025-11-26 12:51 UTC (permalink / raw)
  To: dsterba; +Cc: Wol, Justin Piszcz, LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 6:15 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Nov 25, 2025 at 06:25:41PM +0000, Wol wrote:
> > Probably not the problem, but how old are the drives? About 2020, WD
> > started shingling the Red line (you had to move to Red Pro to get
> > conventional drives).
>
> The WD SN700 is an NVMe, though one can imagine how they could be
> actually shingled with slightly tilted overlapping sockets.
>

You have to get a red plus or red pro in spinning disk to be conventional.

On the nvme: I would check for a firmware update.  I have seen
multiple SSD drives from several manufacturers that had timed
housekeeping processes where something went badly wrong with that
housekeeping process and locked the drive up.    Some of the issues
lock the drive up until a reset and then when it fires again (hours or
days later after the reset) it locks up again, and some of those
firmware bugs brick the drive.    If the timed housekeeping it would
likely be some set time after you powered it up.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:48 ` Dr. David Alan Gilbert
@ 2025-12-01 14:09   ` Justin Piszcz
  0 siblings, 0 replies; 16+ messages in thread
From: Justin Piszcz @ 2025-12-01 14:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 9:48 AM Dr. David Alan Gilbert
<linux@treblig.org> wrote:
>
> * Justin Piszcz (jpiszcz@lucidpixels.com) wrote:
> > Hello,
> >
> > Issue/Summary:
> > 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> > drop out of a NAS array, after power cycling the device, it rebuilds
> > successfully.
> >
> > Questions:
>
> > 3. Are there any debug options that can be enabled that could help to
> > pinpoint the root cause?
>
> Have you tried using the 'nvme' command to see if the drives have
> anything in their smart or error logs?

Thanks, I did check on this and ran the usual tests, everything passed
- as part of this thread it was mentioned that some consumer NVME
drives have these issues:
https://github.com/openzfs/zfs/discussions/14793

For those using an multiple NVME drives with BTRFS/ZFS, are there
known good NVME drives that will work and not have this consistent
timeout issue where the drives drop offline?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 14:56 ` Keith Busch
@ 2025-12-01 14:13   ` Justin Piszcz
  2025-12-02  2:47     ` Keith Busch
  0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2025-12-01 14:13 UTC (permalink / raw)
  To: Keith Busch; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 9:56 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Nov 25, 2025 at 09:42:11AM -0500, Justin Piszcz wrote:
> > I am using the latest Asus ADM/OS which uses the 6.6.x kernel:
>
> It may be a long shot, but there is an update in 6.17 that attempts to
> restart the device after a pci function level reset when we detect it's
> stuck in nvme level reset. For some devices, that's sufficient to get it
> operational again, but it doesn't always work.

Nice, I was not aware of this, thanks!  As this issue appears to
affect different consumer-level NVME drives, any efforts to address
the quirks in various NVME drives to restart the device while keeping
the volume intact would be awesome if it is possible to- get that
point in the future.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 15:19 ` Dragan Milivojević
  2025-11-25 16:57   ` Paul Rolland
  2025-11-25 22:37   ` Jani Partanen
@ 2025-12-01 14:14   ` Justin Piszcz
  2025-12-08  8:43     ` Brad Campbell
  2 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2025-12-01 14:14 UTC (permalink / raw)
  To: Dragan Milivojević; +Cc: LKML, linux-nvme, linux-raid

On Tue, Nov 25, 2025 at 10:19 AM Dragan Milivojević <galileo@pkm-inc.com> wrote:
>
> > Issue/Summary:
> > 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
> > drop out of a NAS array, after power cycling the device, it rebuilds
> > successfully.
> >
>
> Seen the same, although far less frequent, with Samsung SSD 980 PRO on
> a Dell PowerEdge R7525.
> It's the nature of consumer grade drives, I guess.

As this affects multiple NVME manufacturers, are you aware of any
consumer-level NVME drives that do not have this problem or is moving
to U.2/U.3 server drives necessary to avoid drives dropping offline?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 18:25 ` Wol
  2025-11-26  0:15   ` David Sterba
@ 2025-12-01 14:16   ` Justin Piszcz
  1 sibling, 0 replies; 16+ messages in thread
From: Justin Piszcz @ 2025-12-01 14:16 UTC (permalink / raw)
  To: Wol; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Tue, Nov 25, 2025 at 1:24 PM Wol <antlists@youngman.org.uk> wrote:
>
> Probably not the problem, but how old are the drives? About 2020, WD
> started shingling the Red line (you had to move to Red Pro to get
> conventional drives). Shingled is bad news for linux raid, but the fact
> your drives tend to drop out when idling makes it unlikely this is the
> problem.

For awareness and tracking purposes the drives are a little over a
year old but this issue has been occurring since they were put into
production/use.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-12-01 14:13   ` Justin Piszcz
@ 2025-12-02  2:47     ` Keith Busch
  0 siblings, 0 replies; 16+ messages in thread
From: Keith Busch @ 2025-12-02  2:47 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: LKML, linux-nvme, linux-raid, Btrfs BTRFS

On Mon, Dec 01, 2025 at 09:13:01AM -0500, Justin Piszcz wrote:
> Nice, I was not aware of this, thanks!  As this issue appears to
> affect different consumer-level NVME drives, any efforts to address
> the quirks in various NVME drives to restart the device while keeping
> the volume intact would be awesome if it is possible to- get that
> point in the future.

Various consumer NVMe's use 3rd party controllers, so maybe that's the
common denominator. There aren't very many to choose from.

Anyway, the suggestion won't fix random IO stalls when the situation
happens, but if it is successful, it should keep the volume in an
optimal state, albeit with exceptionally high latency during the
recovery window.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-12-01 14:14   ` Justin Piszcz
@ 2025-12-08  8:43     ` Brad Campbell
  0 siblings, 0 replies; 16+ messages in thread
From: Brad Campbell @ 2025-12-08  8:43 UTC (permalink / raw)
  To: Justin Piszcz, Dragan Milivojević; +Cc: LKML, linux-nvme, linux-raid

On 1/12/25 22:14, Justin Piszcz wrote:
> On Tue, Nov 25, 2025 at 10:19 AM Dragan Milivojević <galileo@pkm-inc.com> wrote:
>>
>>> Issue/Summary:
>>> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
>>> drop out of a NAS array, after power cycling the device, it rebuilds
>>> successfully.
>>>
>>
>> Seen the same, although far less frequent, with Samsung SSD 980 PRO on
>> a Dell PowerEdge R7525.
>> It's the nature of consumer grade drives, I guess.
> 
> As this affects multiple NVME manufacturers, are you aware of any
> consumer-level NVME drives that do not have this problem or is moving
> to U.2/U.3 server drives necessary to avoid drives dropping offline?
> 
> 

Late to the party, but I've got Samsung 960Pro/970evo/980Pro, Crucial P2/T500, Kingston KC2000 and SKHynix P31 nvme drives in 24/7 machines and have *never* had an issue like this. 
The 960Pro suffers from long term slowdown, and the P2 is just an awful performer, but none of them have ever suffered from connectivity issues. The Samsungs, Kingston and the T500 are all in arrays.

None of these machines get power cycled and they might get rebooted once a year (or thereabouts).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1)
  2025-11-25 16:57   ` Paul Rolland
@ 2025-12-08 13:37     ` Sinisa
  0 siblings, 0 replies; 16+ messages in thread
From: Sinisa @ 2025-12-08 13:37 UTC (permalink / raw)
  To: Paul Rolland, Dragan Milivojević; +Cc: LKML, linux-nvme, linux-raid

Hello Dragan (and others),

Just to add my ¢2: I have also had NVMe drives dropping out of md RAID10, after reboot SMART says that they are perfectly fine and I am able to re-add them to 
RAID, just for the same situation to happen a few weeks/months later again.

I have seen this on consumer grade motherboards from ASUS, MSI and Gigabyte, but also on Supermicro servers (actually on only one Supermicro SYS-6029P-TR, but 
multiple times, as far as I can remember).

Affected drives are Samsung 980 Pro and Samsung 990 Pro, but I think there were also some Kingston ones (I have replaced them all in the meantime).

Now, I try to always run the latest stable kernel on those machines/servers, so all of them are now on 6.17 and I think that I haven't seen this problem since I 
upgraded to it.


Btw.

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

didn't seem to help, I have tried with those parameters before, but the problem would appear after some time, although maybe less frequently.


Btw2.
I don't know if that is related, but I have also had this happen with rotating SATA disks, most recently yesterday on my home/office "server" (MSI PRO B650-P 
WIFI (MS-7D78), 128GB RAM, kernel 6.17.9):
[Sun Dec  7 10:12:18 2025] [    T772] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Sun Dec  7 10:12:18 2025] [    T772] ata6.00: failed command: FLUSH CACHE EXT
[Sun Dec  7 10:12:18 2025] [    T772] ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 3
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Dec  7 10:12:18 2025] [    T772] ata6.00: status: { DRDY }
[Sun Dec  7 10:12:18 2025] [    T772] ata6: hard resetting link
[Sun Dec  7 10:12:24 2025] [    T772] ata6: link is slow to respond, please be patient (ready=0)
[Sun Dec  7 10:12:28 2025] [    T772] ata6: found unknown device (class 0)
[Sun Dec  7 10:12:28 2025] [    T772] ata6: softreset failed (device not ready)
... (repeat last 4 rows 4 more times)
[Sun Dec  7 10:13:19 2025] [    T772] ata6.00: disable device
[Sun Dec  7 10:13:19 2025] [    T772] ata6: EH complete
[Sun Dec  7 10:13:19 2025] [     C14] sd 5:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=123s
[Sun Dec  7 10:13:19 2025] [     C14] sd 5:0:0:0: [sdb] tag#5 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[Sun Dec  7 10:13:19 2025] [     C14] I/O error, dev sdb, sector 2064 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[Sun Dec  7 10:13:19 2025] [     C14] md: super_written gets error=-5
[Sun Dec  7 10:13:19 2025] [     C14] md/raid10:md3: Disk failure on sdb1, disabling device.
md/raid10:md3: Operation continuing on 1 devices.
[Sun Dec  7 10:13:19 2025] [     C14] sd 5:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
.... (many, many I/O errors)

So this morning I just ran (without reboot):
     for I in /sys/class/scsi_host/host*/scan
       echo "- - -" > $I
     done
and the drive is back, no errors logged in SMART, re-added to RAID, currently re-syncing.


Srdačan pozdrav / Best regards / Freundliche Grüße / Cordialement / よろしくお願いします
Siniša Bandin


On 11/25/25 5:57 PM, Paul Rolland wrote:
> Hello,
>
> On Tue, 25 Nov 2025 16:19:27 +0100
> Dragan Milivojević <galileo@pkm-inc.com> wrote:
>
>>> Issue/Summary:
>>> 1. Usually once a month, a random WD Red SN700 4TB NVME drive will
>>> drop out of a NAS array, after power cycling the device, it rebuilds
>>> successfully.
>>>   
>> Seen the same, although far less frequent, with Samsung SSD 980 PRO on
>> a Dell PowerEdge R7525.
>> It's the nature of consumer grade drives, I guess.
>>
> Got some issue long time ago, and used :
>
> nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off
>
> to boot the kernel. That fixed issue with SN700 2TB.
>
> Regards,
> Paul
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-12-08 13:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 14:42 WD Red SN700 4000GB, F/W: 11C120WD (Device not ready; aborting reset, CSTS=0x1) Justin Piszcz
2025-11-25 14:48 ` Dr. David Alan Gilbert
2025-12-01 14:09   ` Justin Piszcz
2025-11-25 14:56 ` Keith Busch
2025-12-01 14:13   ` Justin Piszcz
2025-12-02  2:47     ` Keith Busch
2025-11-25 15:19 ` Dragan Milivojević
2025-11-25 16:57   ` Paul Rolland
2025-12-08 13:37     ` Sinisa
2025-11-25 22:37   ` Jani Partanen
2025-12-01 14:14   ` Justin Piszcz
2025-12-08  8:43     ` Brad Campbell
2025-11-25 18:25 ` Wol
2025-11-26  0:15   ` David Sterba
2025-11-26 12:51     ` Roger Heflin
2025-12-01 14:16   ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox