[PATCH AUTOSEL 6.18-6.12] scsi: ufs: host: mediatek: Fix shutdown/suspend race condition

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.18-6.12] scsi: ufs: host: mediatek: Fix shutdown/suspend race condition
       [not found] <20251210034915.2268617-1-sashal@kernel.org>
@ 2025-12-10  3:48 ` Sasha Levin
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] clk: mvebu: cp110 add CLK_IGNORE_UNUSED to pcie_x10, pcie_x11 & pcie_x4 Sasha Levin
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] firmware: imx: scu-irq: Init workqueue before request mbox channel Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-12-10  3:48 UTC (permalink / raw)
  To: patches, stable
  Cc: Peter Wang, Chun-Hung Wu, Martin K. Petersen, Sasha Levin,
	chaotian.jing, matthias.bgg, angelogioacchino.delregno,
	linux-scsi, linux-mediatek, linux-kernel, linux-arm-kernel

From: Peter Wang <peter.wang@mediatek.com>

[ Upstream commit 014de20bb36ba03e0e0b0a7e0a1406ab900c9fda ]

Address a race condition between shutdown and suspend operations in the
UFS Mediatek driver. Before entering suspend, check if a shutdown is in
progress to prevent conflicts and ensure system stability.

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Acked-by: Chun-Hung Wu <chun-hung.wu@mediatek.com>
Link: https://patch.msgid.link/20250924094527.2992256-6-peter.wang@mediatek.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis Summary

### 1. Commit Message Analysis
- **Subject**: "Fix shutdown/suspend race condition" - clearly indicates
  a bug fix
- **No Cc: stable tag** or **Fixes: tag** present
- The message describes preventing conflicts between shutdown and
  suspend operations "to ensure system stability"

### 2. Code Change Analysis
The fix is extremely small (4 lines):
```c
if (hba->shutting_down) {
    ret = -EBUSY;
    goto out;
}
```

**What it does**: At the start of `ufs_mtk_system_suspend()`, before
calling `ufshcd_system_suspend()` and performing Mediatek-specific
operations (voltage regulator LPM, MTCMOS control), it checks if
`hba->shutting_down` is true.

**Why the race is problematic**: The Mediatek driver performs hardware-
specific operations after calling the core suspend:
- `ufs_mtk_dev_vreg_set_lpm()` - controls voltage regulators
- `ufs_mtk_mtcmos_ctrl()` - controls power domains

If shutdown is in progress (`ufshcd_wl_shutdown()` sets
`hba->shutting_down = true`), these operations could conflict with the
shutdown sequence that also manipulates hardware state, causing
instability.

### 3. Classification
- **Bug fix**: Yes - fixes a real race condition
- **Not a feature**: No new functionality, APIs, or capabilities added
- Uses existing infrastructure (`shutting_down` flag already exists in
  `struct ufs_hba`)

### 4. Scope and Risk Assessment
- **Lines changed**: 4 lines (minimal)
- **Files touched**: 1 (driver-specific)
- **Risk**: Very low - adds a defensive early return that makes the
  driver more conservative
- **Regression potential**: Minimal - worst case is a suspend returning
  `-EBUSY` when it shouldn't, but this is very unlikely given the
  shutdown context

### 5. User Impact
- **Affected hardware**: Mediatek UFS storage (common in Android devices
  and embedded systems)
- **Severity**: Race conditions between suspend/shutdown can cause
  system hangs, crashes, or data corruption
- **Real-world scenario**: User initiates shutdown while system is
  suspending (e.g., closing laptop lid during shutdown)

### 6. Stability Indicators
- Signed-off-by Mediatek developer (Peter Wang)
- Acked-by another Mediatek engineer
- Merged by SCSI maintainer Martin K. Petersen

### 7. Concerns
- No explicit `Cc: stable` tag
- No `Fixes:` tag indicating when the bug was introduced
- No bug report or `Reported-by` tag showing users encountered this
  issue

### Decision Rationale

**In favor of backporting:**
- Fixes a real race condition that can cause system instability
- The fix is trivially correct and uses existing infrastructure
- Very small change with minimal regression risk
- Mediatek UFS is widely deployed in consumer devices
- Race conditions between power management paths (suspend/shutdown) are
  serious issues

**Against backporting:**
- No explicit stable tag from the maintainer
- No evidence of users hitting this bug in practice
- Other UFS drivers don't have similar protection (though they may not
  have the same post-suspend hardware operations)

The fix is clearly correct, surgical, and addresses a real class of bug
(PM race conditions). While lacking explicit stable markers, the fix
meets the stable kernel criteria: it's obviously correct, fixes a real
bug affecting system stability, and is small and contained.

**YES**

 drivers/ufs/host/ufs-mediatek.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/ufs/host/ufs-mediatek.c b/drivers/ufs/host/ufs-mediatek.c
index 758a393a9de1a..d0cbd96ad29dc 100644
--- a/drivers/ufs/host/ufs-mediatek.c
+++ b/drivers/ufs/host/ufs-mediatek.c
@@ -2373,6 +2373,11 @@ static int ufs_mtk_system_suspend(struct device *dev)
 	struct arm_smccc_res res;
 	int ret;
 
+	if (hba->shutting_down) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	ret = ufshcd_system_suspend(dev);
 	if (ret)
 		goto out;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.18-5.10] clk: mvebu: cp110 add CLK_IGNORE_UNUSED to pcie_x10, pcie_x11 & pcie_x4
       [not found] <20251210034915.2268617-1-sashal@kernel.org>
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-6.12] scsi: ufs: host: mediatek: Fix shutdown/suspend race condition Sasha Levin
@ 2025-12-10  3:48 ` Sasha Levin
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] firmware: imx: scu-irq: Init workqueue before request mbox channel Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-12-10  3:48 UTC (permalink / raw)
  To: patches, stable
  Cc: Josua Mayer, Andrew Lunn, Gregory CLEMENT, Sasha Levin,
	sebastian.hesselbarth, linux-arm-kernel

From: Josua Mayer <josua@solid-run.com>

[ Upstream commit f0e6bc0c3ef4b4afb299bd6912586cafd5d864e9 ]

CP110 based platforms rely on the bootloader for pci port
initialization.
TF-A actively prevents non-uboot re-configuration of pci lanes, and many
boards do not have software control over the pci card reset.

If a pci port had link at boot-time and the clock is stopped at a later
point, the link fails and can not be recovered.

PCI controller driver probe - and by extension ownership of a driver for
the pci clocks - may be delayed especially on large modular kernels,
causing the clock core to start disabling unused clocks.

Add the CLK_IGNORE_UNUSED flag to the three pci port's clocks to ensure
they are not stopped before the pci controller driver has taken
ownership and tested for an existing link.

This fixes failed pci link detection when controller driver probes late,
e.g. with arm64 defconfig and CONFIG_PHY_MVEBU_CP110_COMPHY=m.

Closes: https://lore.kernel.org/r/b71596c7-461b-44b6-89ab-3cfbd492639f@solid-run.com
Signed-off-by: Josua Mayer <josua@solid-run.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Commit Analysis: clk: mvebu: cp110 add CLK_IGNORE_UNUSED to pcie_x10,
pcie_x11 & pcie_x4

### 1. COMMIT MESSAGE ANALYSIS

**Subject:** Adding CLK_IGNORE_UNUSED flag to three PCIe clock gates on
Marvell CP110 platforms.

**Key problem described:**
- CP110 platforms rely on bootloader for PCIe port initialization
- TF-A (Trusted Firmware-A) prevents non-U-Boot reconfiguration of PCIe
  lanes
- Many boards lack software control over PCIe card reset
- If a PCIe port had an active link at boot and the clock is stopped
  later, the link fails permanently and cannot be recovered
- PCIe controller driver probe may be delayed on large modular kernels,
  causing the clock framework to disable "unused" clocks before the
  driver takes ownership

**Important tags:**
- **No "Cc: stable@vger.kernel.org"** - maintainer didn't explicitly
  request backport
- **No "Fixes:" tag** - unclear when the issue was introduced
- **Reviewed-by: Andrew Lunn** - reputable ARM/networking kernel
  developer
- **Closes:** link to lore.kernel.org bug report - confirms real users
  hit this

### 2. CODE CHANGE ANALYSIS

The change adds a new `gate_flags()` helper function that returns
`CLK_IGNORE_UNUSED` for three specific clock gates:
- `CP110_GATE_PCIE_X1_0` (pcie_x10)
- `CP110_GATE_PCIE_X1_1` (pcie_x11)
- `CP110_GATE_PCIE_X4` (pcie_x4)

This flag is then passed to `init.flags` when registering gate clocks.

**Technical mechanism of the bug:**
1. Boot proceeds with PCIe link established by bootloader
2. Clock framework marks these PCIe clocks as "unused" (no driver
   claimed them yet)
3. Late in boot, clock framework garbage-collects unused clocks
4. PCIe clocks are disabled, breaking the active link
5. When PCIe driver finally probes (especially in modular configs), link
   is irrecoverably failed

**What CLK_IGNORE_UNUSED does:**
Tells the clock framework to never disable these clocks just because
they appear unclaimed. This is the standard mechanism for clocks that
must remain active until a driver explicitly takes ownership.

### 3. CLASSIFICATION

**Type:** Hardware workaround / quirk for platform-specific behavior

This falls into the "quirks and workarounds" exception category for
stable kernels. It's a workaround for the specific constraints of the
Marvell CP110 platform where:
- TF-A manages PCIe lane configuration
- Clock disable breaks PCIe links irreversibly
- Driver load timing varies across kernel configurations

### 4. SCOPE AND RISK ASSESSMENT

**Size:** ~21 lines added, 1 file changed, self-contained

**Risk level:** LOW
- Only affects Marvell CP110 platforms
- No changes to core clock framework
- Worst case: slightly higher power consumption (clocks stay on when
  could be off)
- No chance of breaking other subsystems

**Subsystem:** mvebu clock driver - mature platform-specific driver

### 5. USER IMPACT

**Affected users:** Marvell CP110-based platforms (Armada 7K/8K,
SolidRun products, etc.)

**Severity:** HIGH for affected users - PCIe devices completely fail to
be detected

**Real-world evidence:**
- Bug report on lore.kernel.org linked in commit
- Reproducible with "arm64 defconfig and
  CONFIG_PHY_MVEBU_CP110_COMPHY=m"

### 6. STABILITY INDICATORS

- **Reviewed-by:** Andrew Lunn (highly respected maintainer for this
  subsystem)
- The CP110 clock driver has existed for years - this isn't new code
- Change is isolated and uses standard clock framework mechanisms

### 7. DEPENDENCY CHECK

- Self-contained change, no dependencies on other commits
- The CP110 clock driver exists in stable trees
- No required prerequisite patches

### CONCERNS

1. **No explicit stable tag** - maintainer didn't mark for backport
2. **No Fixes: tag** - we don't know how far back this issue goes
3. **Workaround approach** - CLK_IGNORE_UNUSED is somewhat heavy-handed
   but appropriate here

### DECISION ANALYSIS

**For backporting:**
- Fixes real hardware failure (PCIe not working) on production hardware
- Small, surgical, self-contained fix
- Very low regression risk (only affects specific ARM platform)
- Follows the "hardware quirk" exception pattern for stable
- Uses standard clock framework mechanisms
- Has review from reputable subsystem maintainer

**Against backporting:**
- No Cc: stable from maintainer
- Affects relatively niche ARM SoC platform
- No indication of when bug was introduced

### Conclusion

This commit fixes a real, user-impacting bug where PCIe devices fail to
be detected on Marvell CP110-based platforms. The fix is a standard
hardware quirk/workaround pattern - adding CLK_IGNORE_UNUSED to prevent
premature clock disabling that breaks active PCIe links. The change is
small (~21 lines), isolated to a single platform-specific driver,
carries minimal regression risk, and has been reviewed by a reputable
kernel maintainer.

While there's no explicit stable tag, this fits the stable criteria for
hardware workarounds that fix real user-facing bugs. The worst case
(clocks stay on unnecessarily) is far preferable to the current bug
(PCIe completely broken).

**YES**

 drivers/clk/mvebu/cp110-system-controller.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/clk/mvebu/cp110-system-controller.c b/drivers/clk/mvebu/cp110-system-controller.c
index 03c59bf221060..b47c869060466 100644
--- a/drivers/clk/mvebu/cp110-system-controller.c
+++ b/drivers/clk/mvebu/cp110-system-controller.c
@@ -110,6 +110,25 @@ static const char * const gate_base_names[] = {
 	[CP110_GATE_EIP197]	= "eip197"
 };

+static unsigned long gate_flags(const u8 bit_idx)
+{
+	switch (bit_idx) {
+	case CP110_GATE_PCIE_X1_0:
+	case CP110_GATE_PCIE_X1_1:
+	case CP110_GATE_PCIE_X4:
+		/*
+		 * If a port had an active link at boot time, stopping
+		 * the clock creates a failed state from which controller
+		 * driver can not recover.
+		 * Prevent stopping this clock till after a driver has taken
+		 * ownership.
+		 */
+		return CLK_IGNORE_UNUSED;
+	default:
+		return 0;
+	}
+};
+
 struct cp110_gate_clk {
 	struct clk_hw hw;
 	struct regmap *regmap;
@@ -171,6 +190,7 @@ static struct clk_hw *cp110_register_gate(const char *name,
 	init.ops = &cp110_gate_ops;
 	init.parent_names = &parent_name;
 	init.num_parents = 1;
+	init.flags = gate_flags(bit_idx);

 	gate->regmap = regmap;
 	gate->bit_idx = bit_idx;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.18-5.10] firmware: imx: scu-irq: Init workqueue before request mbox channel
       [not found] <20251210034915.2268617-1-sashal@kernel.org>
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-6.12] scsi: ufs: host: mediatek: Fix shutdown/suspend race condition Sasha Levin
  2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] clk: mvebu: cp110 add CLK_IGNORE_UNUSED to pcie_x10, pcie_x11 & pcie_x4 Sasha Levin
@ 2025-12-10  3:48 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-12-10  3:48 UTC (permalink / raw)
  To: patches, stable
  Cc: Peng Fan, Frank Li, Shawn Guo, Sasha Levin, imx, linux-arm-kernel

From: Peng Fan <peng.fan@nxp.com>

[ Upstream commit 81fb53feb66a3aefbf6fcab73bb8d06f5b0c54ad ]

With mailbox channel requested, there is possibility that interrupts may
come in, so need to make sure the workqueue is initialized before
the queue is scheduled by mailbox rx callback.

Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis Summary

### What the Bug Is

This commit fixes a classic **initialization race condition** in the
i.MX SCU IRQ driver. The problem is:

1. `mbox_request_channel_byname(cl, "gip3")` is called, which sets up a
   mailbox channel with `imx_scu_irq_callback` as the receive callback
2. Once the channel is established, interrupts from the System
   Controller Unit (SCU) can trigger the callback **at any time**
3. The callback (`imx_scu_irq_callback` at line 175-178) calls
   `schedule_work(&imx_sc_irq_work)`
4. **But** `INIT_WORK(&imx_sc_irq_work, ...)` was being called **after**
   the mailbox channel was requested

If an interrupt arrives in the window between
`mbox_request_channel_byname()` and `INIT_WORK()`, it would schedule an
uninitialized work struct, leading to undefined behavior, crashes, or
memory corruption in the workqueue subsystem.

### The Fix

The fix is trivially correct: move `INIT_WORK()` to **before**
`mbox_request_channel_byname()`. This ensures the work struct is
properly initialized before any callback can possibly use it.

### Bug Origin

The bug was introduced in commit `851826c7566e9` ("firmware: imx: enable
imx scu general irq function") in kernel v5.2-rc1 when this driver was
first created. The initialization order was wrong from the very
beginning.

### Stable Kernel Criteria Assessment

| Criterion | Assessment |
|-----------|------------|
| Obviously correct | ✅ Yes - classic "initialize before use" pattern |
| Fixes real bug | ✅ Yes - race condition causing potential crashes |
| Small and contained | ✅ Yes - moves one line of code |
| No new features | ✅ Correct - purely a fix |
| Tested | ✅ Has Reviewed-by from NXP engineer |
| Low risk | ✅ Cannot introduce regressions |

### Risk vs. Benefit

- **Risk**: Extremely low - the work struct must be initialized before
  use regardless of when the first interrupt arrives; moving
  initialization earlier cannot break anything
- **Benefit**: Prevents crashes on i.MX SoC platforms (used in embedded
  systems, automotive, IoT devices) where early interrupts could trigger
  the race

### Dependencies

None - this is a self-contained single-line reordering fix that should
apply cleanly to any kernel with this driver (5.2+).

### Conclusion

This is an ideal stable backport candidate. It fixes a real
initialization race condition that can cause crashes, the fix is
obviously correct (a single line moved earlier in the initialization
sequence), it's minimal and surgical, and it affects real users of i.MX
SoC platforms. The only missing element is an explicit `Cc:
stable@vger.kernel.org` tag, but the fix clearly meets all stable kernel
rules.

**YES**

 drivers/firmware/imx/imx-scu-irq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/imx/imx-scu-irq.c b/drivers/firmware/imx/imx-scu-irq.c
index 6125cccc9ba79..53bde775a1bf6 100644
--- a/drivers/firmware/imx/imx-scu-irq.c
+++ b/drivers/firmware/imx/imx-scu-irq.c
@@ -214,6 +214,8 @@ int imx_scu_enable_general_irq_channel(struct device *dev)
 	cl->dev = dev;
 	cl->rx_callback = imx_scu_irq_callback;

+	INIT_WORK(&imx_sc_irq_work, imx_scu_irq_work_handler);
+
 	/* SCU general IRQ uses general interrupt channel 3 */
 	ch = mbox_request_channel_byname(cl, "gip3");
 	if (IS_ERR(ch)) {
@@ -223,8 +225,6 @@ int imx_scu_enable_general_irq_channel(struct device *dev)
 		return ret;
 	}

-	INIT_WORK(&imx_sc_irq_work, imx_scu_irq_work_handler);
-
 	if (!of_parse_phandle_with_args(dev->of_node, "mboxes",
 				       "#mbox-cells", 0, &spec))
 		i = of_alias_get_id(spec.np, "mu");
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-12-10  3:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251210034915.2268617-1-sashal@kernel.org>
2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-6.12] scsi: ufs: host: mediatek: Fix shutdown/suspend race condition Sasha Levin
2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] clk: mvebu: cp110 add CLK_IGNORE_UNUSED to pcie_x10, pcie_x11 & pcie_x4 Sasha Levin
2025-12-10  3:48 ` [PATCH AUTOSEL 6.18-5.10] firmware: imx: scu-irq: Init workqueue before request mbox channel Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).