From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DAA7C2E8B86; Sat, 25 Oct 2025 16:26:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761409579; cv=none; b=cScDA02LhWkU03cnHMbVgUrbIlXjR4+1NNLpV4RIz+da1uPmDdfAF8j+hVKBw1L+mqt9K8hRp+/SkMPlzxvSGj0gUFfXiUoqI9/rb+Brq8lTLoPW1FaAlpXIAE95dC3Y9QenOynTttwFNe7GsjVzvfPbNAM7djFrl8X+a/IDXiw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761409579; c=relaxed/simple; bh=TTPFQPjhgfnF0VqXDar2dd7qhmfsUyK41ErUY6wpt7A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=KmSUsmh7E1gHds+9V85iwfYK2V3G8lY52xPM5Mj80AtRf2rkAXMo8cVFYxsIuBPA8IL2Nt45W5sXWK5Hf+gtnrMJLIzWogm4R4pxlbyNIjMUilg76tWD+Ug2F7BmcKS3IAc09NPKoaszVoS6ci92K0PxFPZSGVT9bYk5ChKH+6g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aL/tYGY9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aL/tYGY9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8D934C4CEF5; Sat, 25 Oct 2025 16:26:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1761409578; bh=TTPFQPjhgfnF0VqXDar2dd7qhmfsUyK41ErUY6wpt7A=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aL/tYGY9FxN6fLwOViMzPBQfeO52uqYOYr2Af3GX6JbonUbUJNSDG188Ck2Tw130Q X7exw+XgE0xHjXTDChduQF0Q6hGY6fhGgY6m27bwsLZvQTdYeyb68+CLAaQTqng9+N SpOxcdwzWKvZwc+0mKLb5/oW7+m4wOz9T39KCi80pJ3WqpCHDqhZIz0pCzgNKILOfU TxGFxq6hKxrrwD5EVUSOJSiAuBbNqcpeMsvTXpP4ikhOCyhjJv1ocFRGkMXdlNufUa NRxz0mDHko2inyOBhOEO6D1CIULBiCliRF2xeM9m5QNzu6F4qHyOIgItDsgs1xDpS8 iDQMlhc9yRh6Q== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Chandrakanth Patil , "Martin K. Petersen" , Sasha Levin , sathya.prakash@broadcom.com, kashyap.desai@broadcom.com, sumit.saxena@broadcom.com, sreekanth.reddy@broadcom.com, mpi3mr-linuxdrv.pdl@broadcom.com, linux-scsi@vger.kernel.org Subject: [PATCH AUTOSEL 6.17] scsi: mpi3mr: Fix device loss during enclosure reboot due to zero link speed Date: Sat, 25 Oct 2025 12:00:05 -0400 Message-ID: <20251025160905.3857885-374-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org> References: <20251025160905.3857885-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.17.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Chandrakanth Patil [ Upstream commit d6c8e8b7c98c3cb326515ef4bc5c57e16ac5ae4e ] During enclosure reboot or expander reset, firmware may report a link speed of 0 in "Device Add" events while the link is still coming up. The driver drops such devices, leaving them missing even after the link recovers. Fix this by treating link speed 0 as 1.5 Gbps during device addition so the device is exposed to the OS. The actual link speed will be updated later when link-up events arrive. Signed-off-by: Chandrakanth Patil Link: https://lore.kernel.org/r/20250820084138.228471-2-chandrakanth.patil@broadcom.com Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: Based on my comprehensive analysis of this commit, I can now provide my final recommendation. ## **BACKPORT RECOMMENDATION: YES** --- ## **DETAILED ANALYSIS** ### **1. PROBLEM DESCRIPTION** The commit fixes a critical device loss issue in the Broadcom mpi3mr SAS/SATA HBA driver. During enclosure reboot or expander reset operations, the firmware reports a link speed of 0 (MPI3_SAS_NEG_LINK_RATE_UNKNOWN_LINK_RATE = 0x00) in "Device Add" events while the physical link is still initializing. The driver was incorrectly dropping these devices, and they remained missing even after the link fully recovered to operational speed. ### **2. CODE CHANGES ANALYSIS** The fix consists of four distinct changes across two files: #### **Change 1: mpi3mr_expander_add() (mpi3mr_transport.c:2084-2085)** ```c +if (link_rate < MPI3_SAS_NEG_LINK_RATE_1_5) + link_rate = MPI3_SAS_NEG_LINK_RATE_1_5; ``` **Impact**: During expander device addition, treats link speeds below 1.5 Gbps (including 0) as 1.5 Gbps, allowing the device to be exposed to the OS. #### **Change 2: mpi3mr_report_tgtdev_to_sas_transport() (mpi3mr_transport.c:2395-2396)** ```c +if (link_rate < MPI3_SAS_NEG_LINK_RATE_1_5) + link_rate = MPI3_SAS_NEG_LINK_RATE_1_5; ``` **Impact**: Same treatment for target device reporting to SAS transport layer. #### **Change 3: mpi3mr_remove_device_by_sas_address() (mpi3mr_transport.c:417-420)** ```c -list_del_init(&tgtdev->list); was_on_tgtdev_list = 1; -mpi3mr_tgtdev_put(tgtdev); +if (tgtdev->state == MPI3MR_DEV_REMOVE_HS_STARTED) { + list_del_init(&tgtdev->list); + mpi3mr_tgtdev_put(tgtdev); +} ``` **Impact**: Prevents premature device list deletion by checking the device state. Only removes devices from the list if they're in the MPI3MR_DEV_REMOVE_HS_STARTED state, avoiding race conditions during device state transitions. #### **Change 4: Debug logging improvements (mpi3mr_os.c:2058, 3078)** **Impact**: Adds event context (0x%08x) to debug messages for better diagnostics. Purely cosmetic, aids debugging. ### **3. HISTORICAL CONTEXT & PATTERN CONSISTENCY** My research reveals this fix **extends an existing pattern** already established in the codebase: - **Commit 42fc9fee116fc6** (August 2022, v6.1): Introduced similar link rate handling in `mpi3mr_sas_host_refresh()` at line 1174: ```c if (attached_handle && link_rate < MPI3_SAS_NEG_LINK_RATE_1_5) link_rate = MPI3_SAS_NEG_LINK_RATE_1_5; ``` - **Commit 3f1254ed01d086** (March 2023, v6.4): Added the `mpi3mr_dev_state` enum to fix "Successive VD delete and add causes FW fault" This commit applies the same defensive link rate handling to two additional code paths that were missing it. ### **4. DEPENDENCY ANALYSIS** **Required for v6.4+:** - ✅ MPI3_SAS_NEG_LINK_RATE constants (present since driver introduction) - ✅ mpi3mr_update_links() function (added v6.1) - ✅ `enum mpi3mr_dev_state` with MPI3MR_DEV_REMOVE_HS_STARTED (added v6.4) **Backporting to < v6.4:** Would require either: 1. Backporting commit 3f1254ed01d086 first, OR 2. Omitting the device state check portion (changes 1-2 would still provide value) ### **5. RISK ASSESSMENT** **RISK LEVEL: LOW** ✅ **Positive factors:** - Small, surgical changes (13 insertions, 6 deletions) - Follows established code pattern (line 1174) - No API changes or function signature modifications - Confined to single driver subsystem (mpi3mr) - No new functionality - purely defensive fix - All code paths already exist, just adding validation ⚠️ **Considerations:** - Device state check requires v6.4+ (manageable dependency) - Affects device lifecycle management (but improves correctness) ### **6. USER IMPACT** **SEVERITY: HIGH - User-Visible Bug** **Affected scenario:** Common operational maintenance activities: - Enclosure reboots/power cycles - Expander resets - Fabric reconfiguration - Hot-plug operations **User experience without fix:** 1. Enclosure/expander is rebooted for maintenance 2. Firmware sends Device Add events with link_rate=0 during link initialization 3. Driver drops the device 4. Link comes up to full speed (1.5/3.0/6.0/12.0 Gbps) 5. **Device remains missing permanently** - requires driver reload or system reboot **User experience with fix:** 1. Enclosure/expander is rebooted 2. Device is added with temporary 1.5 Gbps link rate 3. Link completes initialization 4. Actual link speed is updated via subsequent link-up events 5. **Device remains accessible throughout** **No workaround available** - users cannot prevent firmware from reporting link_rate=0 during initialization. ### **7. TESTING & VALIDATION** Based on git history analysis: - ✅ No reverts found for this commit - ✅ No follow-up regression fixes found - ✅ Similar pattern at line 1174 has been stable since v6.1 (2+ years) - ✅ One stable backport already exists (837115eeb55b3), indicating it's being considered for stable trees ### **8. STABLE TREE CRITERIA COMPLIANCE** Checking against stable kernel rules: | Criterion | Status | Evidence | |-----------|--------|----------| | **Fixes important bug** | ✅ YES | Devices lost during enclosure operations | | **User-visible impact** | ✅ YES | Missing storage devices, requires reboot | | **Small and contained** | ✅ YES | 13 lines added, 6 removed, 2 files | | **No new features** | ✅ YES | Pure bug fix | | **Minimal regression risk** | ✅ YES | Follows existing pattern, low complexity | | **Confined to subsystem** | ✅ YES | Only affects mpi3mr driver | | **Real-world issue** | ✅ YES | Common during enclosure maintenance | **Note:** Commit does NOT have explicit Cc: stable tag, but meets all technical criteria. ### **9. ARCHITECTURAL CONSIDERATIONS** The fix demonstrates good defensive programming: 1. **Graceful degradation**: Treats unknown/zero link speed as minimum viable speed rather than failing 2. **Eventual consistency**: Allows initial device exposure with conservative speed, updates later 3. **State machine correctness**: Device removal now properly respects state transitions 4. **Consistency**: Applies same defensive pattern across all device addition code paths ### **10. FINAL RECOMMENDATION** ## **YES - STRONGLY RECOMMEND BACKPORTING to v6.4+** **Justification:** 1. ✅ **Critical user-facing bug** - Devices permanently lost during routine maintenance 2. ✅ **Low risk** - Small, well-contained changes following established patterns 3. ✅ **High confidence** - Similar code has been stable since v6.1 4. ✅ **Clean dependencies** - All prerequisites present in v6.4+ 5. ✅ **Stable tree appropriate** - Meets all criteria for important bugfixes 6. ✅ **Common scenario** - Affects standard operational procedures 7. ✅ **No workarounds** - Users cannot avoid this firmware behavior **Target stable versions:** 6.4+, 6.6 LTS, 6.12 LTS **Adaptation note for < v6.4:** If backporting to kernels older than v6.4, the device state check in `mpi3mr_remove_device_by_sas_address()` would need to be omitted or commit 3f1254ed01d086 backported first. However, the link rate fixes (changes 1-2) alone would still provide significant value. drivers/scsi/mpi3mr/mpi3mr_os.c | 8 ++++---- drivers/scsi/mpi3mr/mpi3mr_transport.c | 11 +++++++++-- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c index e467b56949e98..1582cdbc66302 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_os.c +++ b/drivers/scsi/mpi3mr/mpi3mr_os.c @@ -2049,8 +2049,8 @@ static void mpi3mr_fwevt_bh(struct mpi3mr_ioc *mrioc, if (!fwevt->process_evt) goto evt_ack; - dprint_event_bh(mrioc, "processing event(0x%02x) in the bottom half handler\n", - fwevt->event_id); + dprint_event_bh(mrioc, "processing event(0x%02x) -(0x%08x) in the bottom half handler\n", + fwevt->event_id, fwevt->evt_ctx); switch (fwevt->event_id) { case MPI3_EVENT_DEVICE_ADDED: @@ -3076,8 +3076,8 @@ void mpi3mr_os_handle_events(struct mpi3mr_ioc *mrioc, } if (process_evt_bh || ack_req) { dprint_event_th(mrioc, - "scheduling bottom half handler for event(0x%02x),ack_required=%d\n", - evt_type, ack_req); + "scheduling bottom half handler for event(0x%02x) - (0x%08x), ack_required=%d\n", + evt_type, le32_to_cpu(event_reply->event_context), ack_req); sz = event_reply->event_data_length * 4; fwevt = mpi3mr_alloc_fwevt(sz); if (!fwevt) { diff --git a/drivers/scsi/mpi3mr/mpi3mr_transport.c b/drivers/scsi/mpi3mr/mpi3mr_transport.c index c8d6ced5640e9..d70f002d6487d 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_transport.c +++ b/drivers/scsi/mpi3mr/mpi3mr_transport.c @@ -413,9 +413,11 @@ static void mpi3mr_remove_device_by_sas_address(struct mpi3mr_ioc *mrioc, sas_address, hba_port); if (tgtdev) { if (!list_empty(&tgtdev->list)) { - list_del_init(&tgtdev->list); was_on_tgtdev_list = 1; - mpi3mr_tgtdev_put(tgtdev); + if (tgtdev->state == MPI3MR_DEV_REMOVE_HS_STARTED) { + list_del_init(&tgtdev->list); + mpi3mr_tgtdev_put(tgtdev); + } } } spin_unlock_irqrestore(&mrioc->tgtdev_lock, flags); @@ -2079,6 +2081,8 @@ int mpi3mr_expander_add(struct mpi3mr_ioc *mrioc, u16 handle) link_rate = (expander_pg1.negotiated_link_rate & MPI3_SAS_NEG_LINK_RATE_LOGICAL_MASK) >> MPI3_SAS_NEG_LINK_RATE_LOGICAL_SHIFT; + if (link_rate < MPI3_SAS_NEG_LINK_RATE_1_5) + link_rate = MPI3_SAS_NEG_LINK_RATE_1_5; mpi3mr_update_links(mrioc, sas_address_parent, handle, i, link_rate, hba_port); } @@ -2388,6 +2392,9 @@ int mpi3mr_report_tgtdev_to_sas_transport(struct mpi3mr_ioc *mrioc, link_rate = mpi3mr_get_sas_negotiated_logical_linkrate(mrioc, tgtdev); + if (link_rate < MPI3_SAS_NEG_LINK_RATE_1_5) + link_rate = MPI3_SAS_NEG_LINK_RATE_1_5; + mpi3mr_update_links(mrioc, sas_address_parent, tgtdev->dev_handle, parent_phy_number, link_rate, hba_port); -- 2.51.0