From: Damien Le Moal <dlemoal@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>,
Christoph Hellwig <hch@infradead.org>
Cc: "Martin K . Petersen" <martin.petersen@oracle.com>,
linux-scsi@vger.kernel.org,
Sathya Prakash <sathya.prakash@broadcom.com>,
Kashyap Desai <kashyap.desai@broadcom.com>,
Sreekanth Reddy <sreekanth.reddy@broadcom.com>,
Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>,
mpi3mr-linuxdrv.pdl@broadcom.com,
MPT-FusionLinux.pdl@broadcom.com
Subject: Re: [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr
Date: Mon, 16 Jun 2025 11:28:28 +0900 [thread overview]
Message-ID: <e0ac9296-a688-4146-bb1b-e5ef7dc4b5e1@kernel.org> (raw)
In-Reply-To: <CALOAHbBkdjz+ujYnAKYvxaQsyd_juDKg38-G8Sk+cKCN_0HftQ@mail.gmail.com>
On 6/16/25 11:13, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>>> I think promted this.
>>
>> Thank you for the information and for addressing this so quickly!
>>
>>>
>>> Yafang, can you check if this makes the writeback errors you're seeing
>>> go away?
>>
>> I’m happy to test the fix and will share the results as soon as I have them.
>
> We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
> applying this patch series to our production servers. Since our
> production servers only use mpt3sas drives, we can verify the fix
> specifically for this driver:
>
> Tested-by: Yafang Shao <laoar.shao@gmail.com>
Thansk. I tested with the mpi3mr driver.
> We’ve encountered another instance of DID_SOFT_ERROR triggered by a
> reset on an mpt3sas drive. Do you have any insights into what might be
> causing this failure? Below are the error details:
>
> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)
This decodes to:
Value 31110610h
Type: 30000000h SAS
Origin: 01000000h PL
Code: 00110000h PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code: 00000600h PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET
So it looks like a non-NCQ command failed. What were you doing when this happened ?
> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)
> [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred
And this command failure is trigerring a device reset (the HBA may be doing
that, or the drive did not like what you were doing and reset).
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout
This looks like an ioctl() to the adapter diver, and it never got its reply
apparently. This is 3 hours after the above power-on-reset, so these 2 events
are likely not related...
> [Thu Jun 12 20:07:53 2025] mf:
This is dumping the mpi_request bytes. Maybe you can try to decode that to get
hints as to what action triggered this.
I would love to get feedback from the Broadcom folks on this kind of problems,
but apparently, debugging issues with their HBA does not seem to be high on
their to-do list.
Broadcom folks,
Could you please comment on these issues ? Not the first time I ask. A reply
would be welcome so that we all know that you at least care about issues with
your drivers/HBAs. Thank you.
>
> [Thu Jun 12 20:07:53 2025] 00000013
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 9fcda615
> [Thu Jun 12 20:07:53 2025] 00fc0000
> [Thu Jun 12 20:07:53 2025] 00000018
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025]
>
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000006
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 02000000
> [Thu Jun 12 20:07:53 2025]
>
> [Thu Jun 12 20:07:53 2025] 00000012
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
>
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013)
> [Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op
> 0x0:(READ) flags 0x80700 phys_seg 28 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op
> 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88
> 00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op
> 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in
> "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5
> [Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred
> [Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175,
> offset 1285156864, sector 13979483112
>
>
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2025-06-16 2:28 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-06 5:27 [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Damien Le Moal
2025-06-06 5:27 ` [PATCH 1/2] scsi: mpi3mr: Correctly handle ATA device errors Damien Le Moal
2025-06-06 5:27 ` [PATCH 2/2] scsi: mpt3sas: " Damien Le Moal
2025-06-09 5:50 ` [PATCH 0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr Christoph Hellwig
2025-06-09 7:09 ` Yafang Shao
2025-06-09 7:17 ` Damien Le Moal
2025-06-11 3:27 ` Yafang Shao
2025-06-11 3:57 ` Damien Le Moal
2025-06-11 5:42 ` Yafang Shao
2025-06-16 2:13 ` Yafang Shao
2025-06-16 2:28 ` Damien Le Moal [this message]
2025-06-16 12:40 ` Yafang Shao
2025-06-16 20:51 ` Martin K. Petersen
2025-06-20 3:00 ` Martin K. Petersen
2025-06-20 17:28 ` Sathya Prakash Veerichetty
2025-06-25 1:44 ` Martin K. Petersen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e0ac9296-a688-4146-bb1b-e5ef7dc4b5e1@kernel.org \
--to=dlemoal@kernel.org \
--cc=MPT-FusionLinux.pdl@broadcom.com \
--cc=hch@infradead.org \
--cc=kashyap.desai@broadcom.com \
--cc=laoar.shao@gmail.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=mpi3mr-linuxdrv.pdl@broadcom.com \
--cc=sathya.prakash@broadcom.com \
--cc=sreekanth.reddy@broadcom.com \
--cc=suganath-prabu.subramani@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.