From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69B2D2737F8; Tue, 5 Aug 2025 13:11:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754399474; cv=none; b=EsEzhES896Yt3dR7NTW0+EWZRVm3C9CHwSctHZ3ODcOxeGQLfJ1cuTjQ+qzAxdegHA2F11LuurgjRKih/pHYUGQQv4f/dT9y+1cxFBBA9cC1a+6xwrHDTkCZucVfLxXW3mtMhfphC5yjbwtPlT501odZaluwL5av5z3PB9XLPws= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754399474; c=relaxed/simple; bh=lbmu52mQsEOM7YkZBqzjp0Bsh/rk+t6yvibjKNf9z+U=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JMhsrd5d97mUTYW+tPyZ0osNDeBHhYEUoAYcLbb4PiE8SuTRYQ5eyAhcCImDnLrVaFdaIYgxd4hQ1NDIL82pCv2utWg/YESBu6emZSfR5a/q0+UHWOa/Xxn6QbXa2bGE/gZiLI5OGvoI+YI6iVpVqhaksYPJgOGtyU7Ax52jspk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=s/y9i0ic; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="s/y9i0ic" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 75811C4CEF4; Tue, 5 Aug 2025 13:11:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1754399474; bh=lbmu52mQsEOM7YkZBqzjp0Bsh/rk+t6yvibjKNf9z+U=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=s/y9i0icGi5aOrKq6adKO3YmQ4MMyxXjfdxbUL/tT9MNT7/fxLgnHY1X1A7JKIz75 fXIVH/JQFMbnsYSfabnTPLk69iopFD7ZWTT1Qvup/E+jF3nF5OW+h1uLIMEx6sNspI g6dQwbpQ1fXeKDBfb4dKtVMkb6xpm9beOddWv7pz8wpVdDHLdWNB2apaLONl1vpVen M3KB/V3WmbmdCd1dED49C63lTHS41leH+CjWMJErFfrE0N4hLL3ELsOfQzqJ/S1Hl3 LlBGNRDh0fVu/evl8evtRYhbuf+fLs02gHCkq9uO+4SJth0bsIxuXPIL1zy0ZoZdRo t8vyFL99FC85g== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Damien Le Moal , Yafang Shao , "Martin K . Petersen" , Sasha Levin , sathya.prakash@broadcom.com, kashyap.desai@broadcom.com, sumit.saxena@broadcom.com, sreekanth.reddy@broadcom.com, mpi3mr-linuxdrv.pdl@broadcom.com, linux-scsi@vger.kernel.org Subject: [PATCH AUTOSEL 6.16-6.1] scsi: mpi3mr: Correctly handle ATA device errors Date: Tue, 5 Aug 2025 09:09:15 -0400 Message-Id: <20250805130945.471732-40-sashal@kernel.org> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250805130945.471732-1-sashal@kernel.org> References: <20250805130945.471732-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: linux-scsi@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.16 Content-Transfer-Encoding: 8bit From: Damien Le Moal [ Upstream commit 04caad5a7ba86e830d04750417a15bad8ac2613c ] With the ATA error model, an NCQ command failure always triggers an abort (termination) of all NCQ commands queued on the device. In such case, the SAT or the host must handle the failed command according to the command sense data and immediately retry all other NCQ commands that were aborted due to the failed NCQ command. For SAS HBAs controlled by the mpi3mr driver, NCQ command aborts are not handled by the HBA SAT and sent back to the host, with an ioc log information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function mpi3mr_process_op_reply_desc() always forces a retry of commands terminated with the status MPI3_IOCSTATUS_SCSI_IOC_TERMINATED using the SCSI result DID_SOFT_ERROR, regardless of the ioc_loginfo for the command. This correctly forces the retry of collateral NCQ abort commands, but with the retry counter for the command being incremented. If a command to an ATA device is subject to too many retries due to other NCQ commands failing (e.g. read commands trying to access unreadable sectors), the collateral NCQ abort commands may be terminated with an error as they run out of retries. This violates the SAT specification and causes hard-to-debug command errors. Solve this issue by modifying the handling of the MPI3_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for an ATA device and if the command ioc_loginfo indicates an NCQ collateral abort. If that is the case, force the command retry using the SCSI result DID_IMM_RETRY to avoid incrementing the command retry count. Signed-off-by: Damien Le Moal Link: https://lore.kernel.org/r/20250606052747.742998-2-dlemoal@kernel.org Tested-by: Yafang Shao Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: **Backport Status: YES** This commit should be backported to stable kernel trees. Here's my extensive analysis: ## Critical Bug Fix for ATA NCQ Command Handling This commit fixes a **data integrity and reliability issue** affecting ATA devices connected through mpi3mr SAS HBAs when NCQ (Native Command Queuing) commands fail. ### The Bug Being Fixed The code change addresses a specific failure mode where: 1. **NCQ Command Collateral Aborts**: When one NCQ command fails on an ATA device, the SATA specification requires ALL other NCQ commands to be aborted and retried 2. **Current Broken Behavior**: The driver was treating these collateral aborts with `DID_SOFT_ERROR`, which increments retry counters 3. **Cascading Failures**: Innocent NCQ commands that were collaterally aborted could exhaust their retry limits and fail permanently, even though they had no actual errors ### Code Analysis The key change in `mpi3mr_process_op_reply_desc()` at line 3432-3433: **Before:** ```c case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED: scmd->result = DID_SOFT_ERROR << 16; break; ``` **After:** ```c case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED: if (ioc_loginfo == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) { WARN_ON_ONCE(xfer_count != 0); scmd->result = DID_IMM_RETRY << 16; } else { scmd->result = DID_SOFT_ERROR << 16; } break; ``` The new code: - Checks for the specific log info code `0x31080000` that indicates an NCQ collateral abort - Uses `DID_IMM_RETRY` instead of `DID_SOFT_ERROR` for these specific cases - `DID_IMM_RETRY` retries the command immediately WITHOUT incrementing the retry counter - Adds a `WARN_ON_ONCE` sanity check that no data was transferred ### Why This Qualifies for Stable 1. **Fixes Real User-Impacting Bug**: This fixes command failures that shouldn't happen - healthy I/O operations failing due to retry exhaustion when they were only collateral damage 2. **Data Integrity Risk**: Commands that should succeed but fail due to retry exhaustion can lead to: - Application errors - Filesystem corruption risks - Database consistency issues 3. **Minimal and Contained Fix**: - Only 11 lines of actual code change - Confined to one specific error path - Only affects ATA devices with specific error condition - Maintains existing behavior for all other cases 4. **Follows Stable Rules**: - Fixes a bug that users can hit - Small, targeted change - No new features added - No architectural changes - Low regression risk 5. **Clear Technical Merit**: - The commit message explicitly references SAT specification compliance - Has a `Tested-by` tag from another developer - The logic is sound - NCQ collateral aborts are not real failures 6. **Hardware Compatibility**: This affects common SATA devices connected to enterprise SAS controllers, a configuration used in many production systems The change is conservative, well-tested, and fixes a specification compliance issue that can cause spurious command failures in production environments. drivers/scsi/mpi3mr/mpi3mr_os.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c index ce444efd859e..87983ea4e06e 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_os.c +++ b/drivers/scsi/mpi3mr/mpi3mr_os.c @@ -49,6 +49,13 @@ static void mpi3mr_send_event_ack(struct mpi3mr_ioc *mrioc, u8 event, #define MPI3_EVENT_WAIT_FOR_DEVICES_TO_REFRESH (0xFFFE) +/* + * SAS Log info code for a NCQ collateral abort after an NCQ error: + * IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR + * See: drivers/message/fusion/lsi/mpi_log_sas.h + */ +#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR 0x31080000 + /** * mpi3mr_host_tag_for_scmd - Get host tag for a scmd * @mrioc: Adapter instance reference @@ -3430,7 +3437,18 @@ void mpi3mr_process_op_reply_desc(struct mpi3mr_ioc *mrioc, scmd->result = DID_NO_CONNECT << 16; break; case MPI3_IOCSTATUS_SCSI_IOC_TERMINATED: - scmd->result = DID_SOFT_ERROR << 16; + if (ioc_loginfo == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) { + /* + * This is a ATA NCQ command aborted due to another NCQ + * command failure. We must retry this command + * immediately but without incrementing its retry + * counter. + */ + WARN_ON_ONCE(xfer_count != 0); + scmd->result = DID_IMM_RETRY << 16; + } else { + scmd->result = DID_SOFT_ERROR << 16; + } break; case MPI3_IOCSTATUS_SCSI_TASK_TERMINATED: case MPI3_IOCSTATUS_SCSI_EXT_TERMINATED: -- 2.39.5