[PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
@ 2025-06-06  5:27 Damien Le Moal
  2025-06-06  5:27 ` [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors Damien Le Moal
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Damien Le Moal @ 2025-06-06  5:27 UTC (permalink / raw)
  To: Martin K . Petersen, linux-scsi
  Cc: Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

Martin,

Two similar patches for the mpt3sas and mpi3mr drivers to improve the
handling of NCQ command terminated due to an NCQ command failure. These
so-called collateral aborts must be retried immediately but that must be
done without incrementing the command retry counter. Otherwise, these
collateral abort commands may endup being failed due to other NCQ
command errors.

This issue is especially easy to trigger with the mpi3mr driver with a
drive subject to a mixed workload of commands with a short CDL limit and
commands without limits. The failures due to the limit being exceeded,
which are normal, endup also failing commands without a limit, which is
incorrect.

Broadcom people,

I am working in the dark here, with zero information on how your HBA
handle ATA NCQ collateral aborts. I am patching against what I am
seeing, which may be only a partial picture of the problem. So please
check this !

Damien Le Moal (2):
  scsi: mpi3mr: Correctly handle ATA device errors
  scsi: mpt3sas: Correctly handle ATA device errors

 drivers/scsi/mpi3mr/mpi3mr_os.c      | 20 +++++++++++++++++++-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
@ 2025-06-06  5:27 ` Damien Le Moal
  2025-06-06  5:27 ` [PATCH 2/2] scsi: mpt3sas: " Damien Le Moal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Damien Le Moal @ 2025-06-06  5:27 UTC (permalink / raw)
  To: Martin K . Petersen, linux-scsi
  Cc: Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

With the ATA error model, an NCQ command failure always triggers an
abort (termination) of all NCQ commands queued on the device. In such
case, the SAT or the host must handle the failed command according to
the command sense data and immediately retry all other NCQ commands that
were aborted due to the failed NCQ command.

For SAS HBAs controlled by the mpi3mr driver, NCQ command aborts are not
handled by the HBA SAT and sent back to the host, with an ioc log
information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code
PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function
mpi3mr_process_op_reply_desc() always forces a retry of commands
terminated with the status MPI3_IOCSTATUS_SCSI_IOC_TERMINATED using the
SCSI result DID_SOFT_ERROR, regardless of the ioc_loginfo for the
command. This correctly forces the retry of collateral NCQ abort
commands, but with the retry counter for the command being incremented.
If a command to an ATA device is subject to too many retries due to
other NCQ commands failing (e.g. read commands trying to access
unreadable sectors), the collateral NCQ abort commands may be terminated
with an error as they run out of retries. This violates the SAT
specifications and causes hard-to-debug command errors.

Solve this issue by modifying the handling of the
MPI3_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for
an ATA device and if the command ioc_loginfo indicates an NCQ collateral
abort. If that is the case, force the command retry using the
SCSI result DID_IMM_RETRY to avoid incrementing the command retry count.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/scsi/mpi3mr/mpi3mr_os.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index ce444efd859e..87983ea4e06e 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -49,6 +49,13 @@ static void mpi3mr_send_event_ack(struct mpi3mr_ioc *mrioc, u8 event,

 #define MPI3_EVENT_WAIT_FOR_DEVICES_TO_REFRESH	(0xFFFE)

+/*
+ * SAS Log info code for a NCQ collateral abort after an NCQ error:
+ * IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR
+ * See: drivers/message/fusion/lsi/mpi_log_sas.h
+ */
+#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR	0x31080000
+
 /**
  * mpi3mr_host_tag_for_scmd - Get host tag for a scmd
  * @mrioc: Adapter instance reference
@@ -3430,7 +3437,18 @@ void mpi3mr_process_op_reply_desc(struct mpi3mr_ioc *mrioc,
 		scmd->result = DID_NO_CONNECT << 16;
 		break;
 	case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED:
-		scmd->result = DID_SOFT_ERROR << 16;
+		if (ioc_loginfo == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) {
+			/*
+			 * This is a ATA NCQ command aborted due to another NCQ
+			 * command failure. We must retry this command
+			 * immediately but without incrementing its retry
+			 * counter.
+			 */
+			WARN_ON_ONCE(xfer_count != 0);
+			scmd->result = DID_IMM_RETRY << 16;
+		} else {
+			scmd->result = DID_SOFT_ERROR << 16;
+		}
 		break;
 	case MPI3_IOCSTATUS_SCSI_TASK_TERMINATED:
 	case MPI3_IOCSTATUS_SCSI_EXT_TERMINATED:
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] scsi: mpt3sas: Correctly handle ATA device errors
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
  2025-06-06  5:27 ` [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors Damien Le Moal
@ 2025-06-06  5:27 ` Damien Le Moal
  2025-06-09  5:50 ` [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Damien Le Moal @ 2025-06-06  5:27 UTC (permalink / raw)
  To: Martin K . Petersen, linux-scsi
  Cc: Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

With the ATA error model, an NCQ command failure always triggers an
abort (termination) of all NCQ commands queued on the device. In such
case, the SAT or the host must handle the failed command according to
the command sense data and immediately retry all other NCQ commands that
were aborted due to the failed NCQ command.

For SAS HBAs controlled by the mpt3sas driver, NCQ command aborts are
not handled by the HBA SAT and sent back to the host, with an ioc log
information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code
PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function
_scsih_io_done() always forces a retry of commands terminated with the
status MPI2_IOCSTATUS_SCSI_IOC_TERMINATED using the SCSI result
DID_SOFT_ERROR, regardless of the log_info for the command.
This correctly forces the retry of collateral NCQ abort commands, but
with the retry counter for the command being incremented. If a command
to an ATA device is subject to too many retries due to other NCQ
commands failing (e.g. read commands trying to access unreadable
sectors), the collateral NCQ abort commands may be terminated with an
error as they run out of retries. This violates the SAT specifications
and causes hard-to-debug command errors.

Solve this issue by modifying the handling of the
MPI2_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for
an ATA device and if the command loginfo indicates an NCQ collateral
abort. If that is the case, force the command retry using the
SCSI result DID_IMM_RETRY to avoid incrementing the command retry count.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 508861e88d9f..d7d8244dfedc 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -195,6 +195,14 @@ struct sense_info {
 #define MPT3SAS_PORT_ENABLE_COMPLETE (0xFFFD)
 #define MPT3SAS_ABRT_TASK_SET (0xFFFE)
 #define MPT3SAS_REMOVE_UNRESPONDING_DEVICES (0xFFFF)
+
+/*
+ * SAS Log info code for a NCQ collateral abort after an NCQ error:
+ * IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR
+ * See: drivers/message/fusion/lsi/mpi_log_sas.h
+ */
+#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR	0x31080000
+
 /**
  * struct fw_event_work - firmware event struct
  * @list: link list framework
@@ -5814,6 +5822,17 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 			scmd->result = DID_TRANSPORT_DISRUPTED << 16;
 			goto out;
 		}
+		if (log_info == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) {
+			/*
+			 * This is a ATA NCQ command aborted due to another NCQ
+			 * command failure. We must retry this command
+			 * immediately but without incrementing its retry
+			 * counter.
+			 */
+			WARN_ON_ONCE(xfer_cnt != 0);
+			scmd->result = DID_IMM_RETRY << 16;
+			break;
+		}
 		if (log_info == 0x31110630) {
 			if (scmd->retries > 2) {
 				scmd->result = DID_NO_CONNECT << 16;
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
  2025-06-06  5:27 ` [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors Damien Le Moal
  2025-06-06  5:27 ` [PATCH 2/2] scsi: mpt3sas: " Damien Le Moal
@ 2025-06-09  5:50 ` Christoph Hellwig
  2025-06-09  7:09   ` Yafang Shao
  2025-06-16 20:51 ` Martin K. Petersen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2025-06-09  5:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K . Petersen, linux-scsi, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl, laoar.shao

Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
I think promted this.

Yafang, can you check if this makes the writeback errors you're seeing
go away?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-09  5:50 ` [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Christoph Hellwig
@ 2025-06-09  7:09   ` Yafang Shao
  2025-06-09  7:17     ` Damien Le Moal
  2025-06-16  2:13     ` Yafang Shao
  0 siblings, 2 replies; 16+ messages in thread
From: Yafang Shao @ 2025-06-09  7:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Martin K . Petersen, linux-scsi, Sathya Prakash,
	Kashyap Desai, Sreekanth Reddy, Suganath Prabu Subramani,
	mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl

On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> I think promted this.

Thank you for the information and for addressing this so quickly!

>
> Yafang, can you check if this makes the writeback errors you're seeing
> go away?

I’m happy to test the fix and will share the results as soon as I have them.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-09  7:09   ` Yafang Shao
@ 2025-06-09  7:17     ` Damien Le Moal
  2025-06-11  3:27       ` Yafang Shao
  2025-06-16  2:13     ` Yafang Shao
  1 sibling, 1 reply; 16+ messages in thread
From: Damien Le Moal @ 2025-06-09  7:17 UTC (permalink / raw)
  To: Yafang Shao, Christoph Hellwig
  Cc: Martin K . Petersen, linux-scsi, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 6/9/25 16:09, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>> I think promted this.

Note that cdl-tools test suite has many test cases that do not pass without the
mpi3mr patch. CDL makes it easy to trigger the issue.

> 
> Thank you for the information and for addressing this so quickly!
> 
>>
>> Yafang, can you check if this makes the writeback errors you're seeing
>> go away?
> 
> I’m happy to test the fix and will share the results as soon as I have them.

Thanks. And my apologies for forgetting to CC you on these patches.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-09  7:17     ` Damien Le Moal
@ 2025-06-11  3:27       ` Yafang Shao
  2025-06-11  3:57         ` Damien Le Moal
  0 siblings, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-06-11  3:27 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/9/25 16:09, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >> I think promted this.
>
> Note that cdl-tools test suite has many test cases that do not pass without the
> mpi3mr patch. CDL makes it easy to trigger the issue.
>
> >
> > Thank you for the information and for addressing this so quickly!
> >
> >>
> >> Yafang, can you check if this makes the writeback errors you're seeing
> >> go away?
> >
> > I’m happy to test the fix and will share the results as soon as I have them.
>
> Thanks. And my apologies for forgetting to CC you on these patches.

We have developed and deployed a hotfix to hundreds of our production
servers equipped with this drive, and it has been functioning as
expected so far. We are currently rolling it out to the remaining
production servers, though the process may take some time to complete.

Additionally, we encountered a writeback error in the libata driver,
which appears to be related to DID_TIME_OUT. Do you have any insights
into this issue?

[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a
00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op
0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a
00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op
0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a
00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op
0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88
00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88
00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op
0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88
00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a
00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op
0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88
00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op
0x0:(READ) flags 0x80700 phys_seg 19 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88
00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op
0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88
00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op
0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in
"xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5
[Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599,
offset 0, sector 35113332208

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-11  3:27       ` Yafang Shao
@ 2025-06-11  3:57         ` Damien Le Moal
  2025-06-11  5:42           ` Yafang Shao
  0 siblings, 1 reply; 16+ messages in thread
From: Damien Le Moal @ 2025-06-11  3:57 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Christoph Hellwig, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 6/11/25 12:27 PM, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 6/9/25 16:09, Yafang Shao wrote:
>>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>
>>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>>>> I think promted this.
>>
>> Note that cdl-tools test suite has many test cases that do not pass without the
>> mpi3mr patch. CDL makes it easy to trigger the issue.
>>
>>>
>>> Thank you for the information and for addressing this so quickly!
>>>
>>>>
>>>> Yafang, can you check if this makes the writeback errors you're seeing
>>>> go away?
>>>
>>> I’m happy to test the fix and will share the results as soon as I have them.
>>
>> Thanks. And my apologies for forgetting to CC you on these patches.
> 
> We have developed and deployed a hotfix to hundreds of our production
> servers equipped with this drive, and it has been functioning as
> expected so far. We are currently rolling it out to the remaining
> production servers, though the process may take some time to complete.
> 
> Additionally, we encountered a writeback error in the libata driver,
> which appears to be related to DID_TIME_OUT. Do you have any insights
> into this issue?
> 
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s

No idea. Timeout is hard to debug... A bus trace would help. It could be that
the drives are really bad and taking too long per command, thus causing time
out for commands that are at the end of the queue. E.g., if you hit many bad
sectors with read commands, the drive internal retry mechanism may cause such
read command to take several seconds. So if you are running the drives at high
queue depth, this can easily cause timeouts for the commands that are at the
end of the queue.

> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op
> 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op
> 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op
> 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op
> 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op
> 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88
> 00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op
> 0x0:(READ) flags 0x80700 phys_seg 19 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88
> 00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op
> 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op
> 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in
> "xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5
> [Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599,
> offset 0, sector 35113332208
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-11  3:57         ` Damien Le Moal
@ 2025-06-11  5:42           ` Yafang Shao
  0 siblings, 0 replies; 16+ messages in thread
From: Yafang Shao @ 2025-06-11  5:42 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Wed, Jun 11, 2025 at 11:59 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/11/25 12:27 PM, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
> >>
> >> On 6/9/25 16:09, Yafang Shao wrote:
> >>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>>>
> >>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >>>> I think promted this.
> >>
> >> Note that cdl-tools test suite has many test cases that do not pass without the
> >> mpi3mr patch. CDL makes it easy to trigger the issue.
> >>
> >>>
> >>> Thank you for the information and for addressing this so quickly!
> >>>
> >>>>
> >>>> Yafang, can you check if this makes the writeback errors you're seeing
> >>>> go away?
> >>>
> >>> I’m happy to test the fix and will share the results as soon as I have them.
> >>
> >> Thanks. And my apologies for forgetting to CC you on these patches.
> >
> > We have developed and deployed a hotfix to hundreds of our production
> > servers equipped with this drive, and it has been functioning as
> > expected so far. We are currently rolling it out to the remaining
> > production servers, though the process may take some time to complete.
> >
> > Additionally, we encountered a writeback error in the libata driver,
> > which appears to be related to DID_TIME_OUT. Do you have any insights
> > into this issue?
> >
> > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
> > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
>
> No idea. Timeout is hard to debug... A bus trace would help. It could be that
> the drives are really bad and taking too long per command, thus causing time
> out for commands that are at the end of the queue. E.g., if you hit many bad
> sectors with read commands, the drive internal retry mechanism may cause such
> read command to take several seconds. So if you are running the drives at high
> queue depth, this can easily cause timeouts for the commands that are at the
> end of the queue.

Thank you for the detailed explanation. We will investigate this further.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-09  7:09   ` Yafang Shao
  2025-06-09  7:17     ` Damien Le Moal
@ 2025-06-16  2:13     ` Yafang Shao
  2025-06-16  2:28       ` Damien Le Moal
  1 sibling, 1 reply; 16+ messages in thread
From: Yafang Shao @ 2025-06-16  2:13 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: Martin K . Petersen, linux-scsi, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> > I think promted this.
>
> Thank you for the information and for addressing this so quickly!
>
> >
> > Yafang, can you check if this makes the writeback errors you're seeing
> > go away?
>
> I’m happy to test the fix and will share the results as soon as I have them.

We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
applying this patch series to our production servers. Since our
production servers only use mpt3sas drives, we can verify the fix
specifically for this driver:

Tested-by: Yafang Shao <laoar.shao@gmail.com>


Hello Damien,

We’ve encountered another instance of DID_SOFT_ERROR triggered by a
reset on an mpt3sas drive. Do you have any insights into what might be
causing this failure? Below are the error details:

[Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
originator(PL), code(0x11), sub_code(0x0610)
[Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
originator(PL), code(0x11), sub_code(0x0610)
[Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred
[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout
[Thu Jun 12 20:07:53 2025] mf:

[Thu Jun 12 20:07:53 2025] 00000013
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 9fcda615
[Thu Jun 12 20:07:53 2025] 00fc0000
[Thu Jun 12 20:07:53 2025] 00000018
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 000000ff
[Thu Jun 12 20:07:53 2025]

[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000006
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 02000000
[Thu Jun 12 20:07:53 2025]

[Thu Jun 12 20:07:53 2025] 00000012
[Thu Jun 12 20:07:53 2025] 000000ff
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000

[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013)
[Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed
[Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op
0x0:(READ) flags 0x80700 phys_seg 28 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88
00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88
00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88
00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88
00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op
0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88
00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88
00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88
00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88
00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88
00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op
0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88
00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in
"xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5
[Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred
[Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175,
offset 1285156864, sector 13979483112


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-16  2:13     ` Yafang Shao
@ 2025-06-16  2:28       ` Damien Le Moal
  2025-06-16 12:40         ` Yafang Shao
  0 siblings, 1 reply; 16+ messages in thread
From: Damien Le Moal @ 2025-06-16  2:28 UTC (permalink / raw)
  To: Yafang Shao, Christoph Hellwig
  Cc: Martin K . Petersen, linux-scsi, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On 6/16/25 11:13, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>>> I think promted this.
>>
>> Thank you for the information and for addressing this so quickly!
>>
>>>
>>> Yafang, can you check if this makes the writeback errors you're seeing
>>> go away?
>>
>> I’m happy to test the fix and will share the results as soon as I have them.
> 
> We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
> applying this patch series to our production servers. Since our
> production servers only use mpt3sas drives, we can verify the fix
> specifically for this driver:
> 
> Tested-by: Yafang Shao <laoar.shao@gmail.com>

Thansk. I tested with the mpi3mr driver.


> We’ve encountered another instance of DID_SOFT_ERROR triggered by a
> reset on an mpt3sas drive. Do you have any insights into what might be
> causing this failure? Below are the error details:
> 
> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)

This decodes to:

Value     	31110610h
Type:     	30000000h	SAS
Origin:   	01000000h	PL
Code:     	00110000h	PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code: 	00000600h	PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET

So it looks like a non-NCQ command failed. What were you doing when this happened ?

> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)
> [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred

And this command failure is trigerring a device reset (the HBA may be doing
that, or the drive did not like what you were doing and reset).

> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout

This looks like an ioctl() to the adapter diver, and it never got its reply
apparently. This is 3 hours after the above power-on-reset, so these 2 events
are likely not related...

> [Thu Jun 12 20:07:53 2025] mf:

This is dumping the mpi_request bytes. Maybe you can try to decode that to get
hints as to what action triggered this.

I would love to get feedback from the Broadcom folks on this kind of problems,
but apparently, debugging issues with their HBA does not seem to be high on
their to-do list.

Broadcom folks,

Could you please comment on these issues ? Not the first time I ask. A reply
would be welcome so that we all know that you at least care about issues with
your drivers/HBAs. Thank you.


> 
> [Thu Jun 12 20:07:53 2025] 00000013
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 9fcda615
> [Thu Jun 12 20:07:53 2025] 00fc0000
> [Thu Jun 12 20:07:53 2025] 00000018
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025]
> 
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000006
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 02000000
> [Thu Jun 12 20:07:53 2025]
> 
> [Thu Jun 12 20:07:53 2025] 00000012
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> 
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013)
> [Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op
> 0x0:(READ) flags 0x80700 phys_seg 28 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op
> 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88
> 00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op
> 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in
> "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5
> [Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred
> [Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175,
> offset 1285156864, sector 13979483112
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-16  2:28       ` Damien Le Moal
@ 2025-06-16 12:40         ` Yafang Shao
  0 siblings, 0 replies; 16+ messages in thread
From: Yafang Shao @ 2025-06-16 12:40 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Mon, Jun 16, 2025 at 10:28 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/16/25 11:13, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>>
> >>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >>> I think promted this.
> >>
> >> Thank you for the information and for addressing this so quickly!
> >>
> >>>
> >>> Yafang, can you check if this makes the writeback errors you're seeing
> >>> go away?
> >>
> >> I’m happy to test the fix and will share the results as soon as I have them.
> >
> > We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
> > applying this patch series to our production servers. Since our
> > production servers only use mpt3sas drives, we can verify the fix
> > specifically for this driver:
> >
> > Tested-by: Yafang Shao <laoar.shao@gmail.com>
>
> Thansk. I tested with the mpi3mr driver.
>
>
> > We’ve encountered another instance of DID_SOFT_ERROR triggered by a
> > reset on an mpt3sas drive. Do you have any insights into what might be
> > causing this failure? Below are the error details:
> >
> > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> > originator(PL), code(0x11), sub_code(0x0610)
>
> This decodes to:
>
> Value           31110610h
> Type:           30000000h       SAS
> Origin:         01000000h       PL
> Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
> Sub Code:       00000600h       PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET
>
> So it looks like a non-NCQ command failed. What were you doing when this happened ?

After reviewing the logs, we were unable to identify any relevant
information regarding the issue.
We appreciate your continued support in resolving this matter.

>
> > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> > originator(PL), code(0x11), sub_code(0x0610)
> > [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred
>
> And this command failure is trigerring a device reset (the HBA may be doing
> that, or the drive did not like what you were doing and reset).
>
> > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
> > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout
>
> This looks like an ioctl() to the adapter diver, and it never got its reply
> apparently. This is 3 hours after the above power-on-reset, so these 2 events
> are likely not related...
>
> > [Thu Jun 12 20:07:53 2025] mf:
>
> This is dumping the mpi_request bytes. Maybe you can try to decode that to get
> hints as to what action triggered this.
>
> I would love to get feedback from the Broadcom folks on this kind of problems,
> but apparently, debugging issues with their HBA does not seem to be high on
> their to-do list.
>
> Broadcom folks,
>
> Could you please comment on these issues ? Not the first time I ask. A reply
> would be welcome so that we all know that you at least care about issues with
> your drivers/HBAs. Thank you.

We have reached out to Broadcom's team directly and hope they will
respond both to our inquiry and to this email correspondence.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
                   ` (2 preceding siblings ...)
  2025-06-09  5:50 ` [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Christoph Hellwig
@ 2025-06-16 20:51 ` Martin K. Petersen
  2025-06-20  3:00 ` Martin K. Petersen
  2025-06-25  1:44 ` Martin K. Petersen
  5 siblings, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2025-06-16 20:51 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Shivasharan S, Ranjan Kumar, linux-scsi, Sathya Prakash,
	Kashyap Desai, Sreekanth Reddy, Suganath Prabu Subramani,
	mpi3mr-linuxdrv.pdl, MPT-FusionLinux.pdl


Damien,

> I am working in the dark here, with zero information on how your HBA
> handle ATA NCQ collateral aborts. I am patching against what I am
> seeing, which may be only a partial picture of the problem. So please
> check this !

Broadcom: Please review!

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
                   ` (3 preceding siblings ...)
  2025-06-16 20:51 ` Martin K. Petersen
@ 2025-06-20  3:00 ` Martin K. Petersen
  2025-06-20 17:28   ` Sathya Prakash Veerichetty
  2025-06-25  1:44 ` Martin K. Petersen
  5 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2025-06-20  3:00 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K . Petersen, linux-scsi, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl


Damien,

> Two similar patches for the mpt3sas and mpi3mr drivers to improve the
> handling of NCQ command terminated due to an NCQ command failure.
> These so-called collateral aborts must be retried immediately but that
> must be done without incrementing the command retry counter.
> Otherwise, these collateral abort commands may endup being failed due
> to other NCQ command errors.

Applied to 6.17/scsi-staging, thanks!

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-20  3:00 ` Martin K. Petersen
@ 2025-06-20 17:28   ` Sathya Prakash Veerichetty
  0 siblings, 0 replies; 16+ messages in thread
From: Sathya Prakash Veerichetty @ 2025-06-20 17:28 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Damien Le Moal, linux-scsi, Kashyap Desai, Sreekanth Reddy,
	Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

[-- Attachment #1: Type: text/plain, Size: 773 bytes --]

Martin,


On Thu, Jun 19, 2025 at 9:00 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Damien,
>
> > Two similar patches for the mpt3sas and mpi3mr drivers to improve the
> > handling of NCQ command terminated due to an NCQ command failure.
> > These so-called collateral aborts must be retried immediately but that
> > must be done without incrementing the command retry counter.
> > Otherwise, these collateral abort commands may endup being failed due
> > to other NCQ command errors.
>
> Applied to 6.17/scsi-staging, thanks!
>
Thanks for taking this patch. We are working with our internal
firmware team on the specifics, if there is any additional change
required, we will push a new set of patches.
> --
> Martin K. Petersen

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4214 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
  2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
                   ` (4 preceding siblings ...)
  2025-06-20  3:00 ` Martin K. Petersen
@ 2025-06-25  1:44 ` Martin K. Petersen
  5 siblings, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2025-06-25  1:44 UTC (permalink / raw)
  To: linux-scsi, Damien Le Moal
  Cc: Martin K . Petersen, Sathya Prakash, Kashyap Desai,
	Sreekanth Reddy, Suganath Prabu Subramani, mpi3mr-linuxdrv.pdl,
	MPT-FusionLinux.pdl

On Fri, 06 Jun 2025 14:27:45 +0900, Damien Le Moal wrote:

> Martin,
> 
> Two similar patches for the mpt3sas and mpi3mr drivers to improve the
> handling of NCQ command terminated due to an NCQ command failure. These
> so-called collateral aborts must be retried immediately but that must be
> done without incrementing the command retry counter. Otherwise, these
> collateral abort commands may endup being failed due to other NCQ
> command errors.
> 
> [...]

Applied to 6.17/scsi-queue, thanks!

[1/2] scsi: mpi3mr: Correctly handle ATA device errors
      https://git.kernel.org/mkp/scsi/c/04caad5a7ba8
[2/2] scsi: mpt3sas: Correctly handle ATA device errors
      https://git.kernel.org/mkp/scsi/c/15592a11d5a5

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-06-25  1:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06  5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
2025-06-06  5:27 ` [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors Damien Le Moal
2025-06-06  5:27 ` [PATCH 2/2] scsi: mpt3sas: " Damien Le Moal
2025-06-09  5:50 ` [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Christoph Hellwig
2025-06-09  7:09   ` Yafang Shao
2025-06-09  7:17     ` Damien Le Moal
2025-06-11  3:27       ` Yafang Shao
2025-06-11  3:57         ` Damien Le Moal
2025-06-11  5:42           ` Yafang Shao
2025-06-16  2:13     ` Yafang Shao
2025-06-16  2:28       ` Damien Le Moal
2025-06-16 12:40         ` Yafang Shao
2025-06-16 20:51 ` Martin K. Petersen
2025-06-20  3:00 ` Martin K. Petersen
2025-06-20 17:28   ` Sathya Prakash Veerichetty
2025-06-25  1:44 ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox