From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Damien Le Moal <dlemoal@kernel.org>,
Yafang Shao <laoar.shao@gmail.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
Sasha Levin <sashal@kernel.org>,
sathya.prakash@broadcom.com, kashyap.desai@broadcom.com,
sumit.saxena@broadcom.com, sreekanth.reddy@broadcom.com,
mpi3mr-linuxdrv.pdl@broadcom.com, linux-scsi@vger.kernel.org
Subject: [PATCH AUTOSEL 6.16-6.1] scsi: mpi3mr: Correctly handle ATA device errors
Date: Tue, 5 Aug 2025 09:09:15 -0400 [thread overview]
Message-ID: <20250805130945.471732-40-sashal@kernel.org> (raw)
In-Reply-To: <20250805130945.471732-1-sashal@kernel.org>
From: Damien Le Moal <dlemoal@kernel.org>
[ Upstream commit 04caad5a7ba86e830d04750417a15bad8ac2613c ]
With the ATA error model, an NCQ command failure always triggers an abort
(termination) of all NCQ commands queued on the device. In such case, the
SAT or the host must handle the failed command according to the command
sense data and immediately retry all other NCQ commands that were aborted
due to the failed NCQ command.
For SAS HBAs controlled by the mpi3mr driver, NCQ command aborts are not
handled by the HBA SAT and sent back to the host, with an ioc log
information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code
PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function
mpi3mr_process_op_reply_desc() always forces a retry of commands
terminated with the status MPI3_IOCSTATUS_SCSI_IOC_TERMINATED using the
SCSI result DID_SOFT_ERROR, regardless of the ioc_loginfo for the
command. This correctly forces the retry of collateral NCQ abort
commands, but with the retry counter for the command being incremented.
If a command to an ATA device is subject to too many retries due to other
NCQ commands failing (e.g. read commands trying to access unreadable
sectors), the collateral NCQ abort commands may be terminated with an
error as they run out of retries. This violates the SAT specification and
causes hard-to-debug command errors.
Solve this issue by modifying the handling of the
MPI3_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for an
ATA device and if the command ioc_loginfo indicates an NCQ collateral
abort. If that is the case, force the command retry using the SCSI result
DID_IMM_RETRY to avoid incrementing the command retry count.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250606052747.742998-2-dlemoal@kernel.org
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Critical Bug Fix for ATA NCQ Command Handling
This commit fixes a **data integrity and reliability issue** affecting
ATA devices connected through mpi3mr SAS HBAs when NCQ (Native Command
Queuing) commands fail.
### The Bug Being Fixed
The code change addresses a specific failure mode where:
1. **NCQ Command Collateral Aborts**: When one NCQ command fails on an
ATA device, the SATA specification requires ALL other NCQ commands to
be aborted and retried
2. **Current Broken Behavior**: The driver was treating these collateral
aborts with `DID_SOFT_ERROR`, which increments retry counters
3. **Cascading Failures**: Innocent NCQ commands that were collaterally
aborted could exhaust their retry limits and fail permanently, even
though they had no actual errors
### Code Analysis
The key change in `mpi3mr_process_op_reply_desc()` at line 3432-3433:
**Before:**
```c
case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED:
scmd->result = DID_SOFT_ERROR << 16;
break;
```
**After:**
```c
case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED:
if (ioc_loginfo == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) {
WARN_ON_ONCE(xfer_count != 0);
scmd->result = DID_IMM_RETRY << 16;
} else {
scmd->result = DID_SOFT_ERROR << 16;
}
break;
```
The new code:
- Checks for the specific log info code `0x31080000` that indicates an
NCQ collateral abort
- Uses `DID_IMM_RETRY` instead of `DID_SOFT_ERROR` for these specific
cases
- `DID_IMM_RETRY` retries the command immediately WITHOUT incrementing
the retry counter
- Adds a `WARN_ON_ONCE` sanity check that no data was transferred
### Why This Qualifies for Stable
1. **Fixes Real User-Impacting Bug**: This fixes command failures that
shouldn't happen - healthy I/O operations failing due to retry
exhaustion when they were only collateral damage
2. **Data Integrity Risk**: Commands that should succeed but fail due to
retry exhaustion can lead to:
- Application errors
- Filesystem corruption risks
- Database consistency issues
3. **Minimal and Contained Fix**:
- Only 11 lines of actual code change
- Confined to one specific error path
- Only affects ATA devices with specific error condition
- Maintains existing behavior for all other cases
4. **Follows Stable Rules**:
- Fixes a bug that users can hit
- Small, targeted change
- No new features added
- No architectural changes
- Low regression risk
5. **Clear Technical Merit**:
- The commit message explicitly references SAT specification
compliance
- Has a `Tested-by` tag from another developer
- The logic is sound - NCQ collateral aborts are not real failures
6. **Hardware Compatibility**: This affects common SATA devices
connected to enterprise SAS controllers, a configuration used in many
production systems
The change is conservative, well-tested, and fixes a specification
compliance issue that can cause spurious command failures in production
environments.
drivers/scsi/mpi3mr/mpi3mr_os.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index ce444efd859e..87983ea4e06e 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -49,6 +49,13 @@ static void mpi3mr_send_event_ack(struct mpi3mr_ioc *mrioc, u8 event,
#define MPI3_EVENT_WAIT_FOR_DEVICES_TO_REFRESH (0xFFFE)
+/*
+ * SAS Log info code for a NCQ collateral abort after an NCQ error:
+ * IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR
+ * See: drivers/message/fusion/lsi/mpi_log_sas.h
+ */
+#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR 0x31080000
+
/**
* mpi3mr_host_tag_for_scmd - Get host tag for a scmd
* @mrioc: Adapter instance reference
@@ -3430,7 +3437,18 @@ void mpi3mr_process_op_reply_desc(struct mpi3mr_ioc *mrioc,
scmd->result = DID_NO_CONNECT << 16;
break;
case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED:
- scmd->result = DID_SOFT_ERROR << 16;
+ if (ioc_loginfo == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) {
+ /*
+ * This is a ATA NCQ command aborted due to another NCQ
+ * command failure. We must retry this command
+ * immediately but without incrementing its retry
+ * counter.
+ */
+ WARN_ON_ONCE(xfer_count != 0);
+ scmd->result = DID_IMM_RETRY << 16;
+ } else {
+ scmd->result = DID_SOFT_ERROR << 16;
+ }
break;
case MPI3_IOCSTATUS_SCSI_TASK_TERMINATED:
case MPI3_IOCSTATUS_SCSI_EXT_TERMINATED:
--
2.39.5
next prev parent reply other threads:[~2025-08-05 13:11 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250805130945.471732-1-sashal@kernel.org>
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-5.4] scsi: bfa: Double-free fix Sasha Levin
2025-08-05 13:08 ` [PATCH AUTOSEL 6.16-6.12] scsi: lpfc: Ensure HBA_SETUP flag is used only for SLI4 in dev_loss_tmo_callbk Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: lpfc: Check for hdwq null ptr when cleaning up lpfc_vport structure Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-6.15] scsi: pm80xx: Free allocated tags after failure Sasha Levin
2025-08-05 13:09 ` Sasha Levin [this message]
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: mpt3sas: Correctly handle ATA device errors Sasha Levin
2025-08-05 13:09 ` [PATCH AUTOSEL 6.16-5.4] scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250805130945.471732-40-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=dlemoal@kernel.org \
--cc=kashyap.desai@broadcom.com \
--cc=laoar.shao@gmail.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=mpi3mr-linuxdrv.pdl@broadcom.com \
--cc=patches@lists.linux.dev \
--cc=sathya.prakash@broadcom.com \
--cc=sreekanth.reddy@broadcom.com \
--cc=stable@vger.kernel.org \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox