[PATCH AUTOSEL 6.17-5.4] rds: Fix endianness annotation for RDS_MPATH

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.17-5.4] rds: Fix endianness annotation for RDS_MPATH_HASH
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
@ 2025-10-25 15:57 ` Sasha Levin
  2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] net/mlx5e: Prevent entering switchdev mode with inconsistent netns Sasha Levin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2025-10-25 15:57 UTC (permalink / raw)
  To: patches, stable
  Cc: Ujwal Kundur, Allison Henderson, Jakub Kicinski, Sasha Levin,
	netdev, linux-rdma, rds-devel

From: Ujwal Kundur <ujwal.kundur@gmail.com>

[ Upstream commit 77907a068717fbefb25faf01fecca553aca6ccaa ]

jhash_1word accepts host endian inputs while rs_bound_port is a be16
value (sockaddr_in6.sin6_port). Use ntohs() for consistency.

Flagged by Sparse.

Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250820175550.498-4-ujwal.kundur@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## BACKPORT RECOMMENDATION: YES (Low Priority)

## Executive Summary

This commit fixes a **real but subtle endianness bug** in the RDS
(Reliable Datagram Sockets) multipath hashing mechanism that has existed
since multipath support was introduced in Linux 4.10 (July 2016). The
fix adds a single `ntohs()` call to properly convert network byte order
to host byte order before hashing, ensuring correct behavior across all
architectures.

## Detailed Technical Analysis

### The Bug (net/rds/rds.h:96)

**Before (incorrect):**
```c
#define RDS_MPATH_HASH(rs, n) (jhash_1word((rs)->rs_bound_port, \
                               (rs)->rs_hash_initval) & ((n) - 1))
```

**After (correct):**
```c
#define RDS_MPATH_HASH(rs, n) (jhash_1word(ntohs((rs)->rs_bound_port), \
                               (rs)->rs_hash_initval) & ((n) - 1))
```

### Root Cause Analysis

Using semcode tools, I verified that:

1. **`rs_bound_port` is `__be16`** (net/rds/rds.h:600):
   - Defined as `rs_bound_sin6.sin6_port` from `struct sockaddr_in6`
   - Stored in network byte order (big-endian) as confirmed in
     net/rds/bind.c:126: `rs->rs_bound_port = cpu_to_be16(rover);`

2. **`jhash_1word()` expects `u32` in host byte order**
   (tools/include/linux/jhash.h:170):
  ```c
  static inline u32 jhash_1word(u32 a, u32 initval)
  ```

3. **The macro violates type safety** by passing `__be16` where `u32`
   (host endian) is expected

### Functional Impact

**On Little-Endian Systems (x86, x86_64, ARM-LE):**
- Port 80 (0x0050 in network order) → hashed as 0x5000 (20480) ❌
- Port 443 (0x01BB in network order) → hashed as 0xBB01 (47873) ❌
- Results in **incorrect hash values** and **wrong multipath selection**

**On Big-Endian Systems (SPARC, PowerPC in BE mode):**
- Port 80 → hashed correctly as 80 ✓
- Port 443 → hashed correctly as 443 ✓

**Cross-Architecture Implications:**
- Heterogeneous clusters (mixing LE and BE systems) would compute
  different hashes for the same port
- This violates the fundamental assumption that the same port should
  select the same path consistently

### Code Location and Usage

The `RDS_MPATH_HASH` macro is used in **net/rds/send.c:1050-1052**:
```c
static int rds_send_mprds_hash(struct rds_sock *rs,
                               struct rds_connection *conn, int
nonblock)
{
    int hash;

    if (conn->c_npaths == 0)
        hash = RDS_MPATH_HASH(rs, RDS_MPATH_WORKERS);
    else
        hash = RDS_MPATH_HASH(rs, conn->c_npaths);
    // ... path selection logic
}
```

This function is called from `rds_sendmsg()` to determine which
connection path to use for multipath RDS, affecting all RDS multipath
traffic.

### Historical Context

- **Introduced:** July 14, 2016 in commit 5916e2c1554f3 ("RDS: TCP:
  Enable multipath RDS for TCP")
- **Bug duration:** ~9 years (2016-2025)
- **Affected kernels:** All versions from v4.10 onwards
- **Discovery method:** Sparse static analysis tool
- **No Fixes: tag:** Indicating maintainer didn't consider it critical
- **No Cc: stable tag:** Not marked for automatic stable backporting

### Why This Bug Went Unnoticed

1. **Limited Deployment Scope:**
   - RDS is primarily used in Oracle RAC (Real Application Clusters)
   - Niche protocol with specialized use cases
   - Not commonly deployed in general-purpose environments

2. **Homogeneous Architectures:**
   - Most RDS deployments use identical hardware (typically x86_64)
   - Within a single architecture, the bug is **consistent** (always
     wrong, but deterministically wrong)
   - Same port always selects the same path (even if it's the "wrong"
     path)

3. **Subtle Impact:**
   - Doesn't cause crashes or data corruption
   - Only affects multipath load distribution
   - Performance impact may be attributed to other factors

### Comparison with Correct Usage

Looking at similar kernel code in **include/net/ip.h:714**, I found the
correct pattern:
```c
static inline u32 ipv4_portaddr_hash(const struct net *net,
                                     __be32 saddr,
                                     unsigned int port)
{
    return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
}
```

Note the explicit `(__force u32)` cast to convert big-endian to host
endian before passing to `jhash_1word()`.

## Backporting Assessment

### Criteria Evaluation

| Criterion | Assessment | Details |
|-----------|-----------|---------|
| **Fixes a real bug** | ✅ YES | Endianness type mismatch causing
incorrect hash on LE systems |
| **Affects users** | ⚠️ LIMITED | RDS is niche; most deployments
homogeneous |
| **Small change** | ✅ YES | Single line, one function call added |
| **Obviously correct** | ✅ YES | Standard byte order conversion;
matches kernel patterns |
| **No side effects** | ⚠️ MINOR | Hash values change on LE systems;
path selection may differ |
| **Architectural change** | ✅ NO | Correctness fix only |
| **Risk of regression** | 🟡 LOW | Minimal; changes observable behavior
but fixes incorrect behavior |

### Benefits of Backporting

1. **Correctness:** Fixes architecturally incorrect code that violates
   API contracts
2. **Sparse-clean:** Brings code in line with kernel coding standards
3. **Cross-architecture consistency:** Ensures LE and BE systems hash
   identically
4. **Future-proofing:** Prevents potential issues in heterogeneous
   deployments
5. **Long-term stability:** Eliminates subtle load-balancing issues

### Risks of Backporting

1. **Behavior Change on LE Systems:**
   - Hash values will change for all ports
   - Existing multipath connections may select different paths after
     upgrade
   - Could cause brief connection disruption during kernel update

2. **Limited Testing:**
   - RDS multipath is not widely deployed
   - Difficult to predict impact on production systems
   - No specific bug reports to validate the fix against

3. **Low Severity:**
   - No CVE assigned
   - No security implications
   - Hasn't caused reported user-facing issues in 9 years

## Related Commits

This is part of a series of endianness annotation fixes by Ujwal Kundur:
- **92b925297a2f** "rds: Fix endianness annotation of jhash wrappers"
  (companion fix)
- **5e9e8e376ae19** "rds: Fix endianness annotations for RDS extension
  headers"

These related commits further support the importance of proper
endianness handling in the RDS subsystem.

## Recommendation Rationale

**YES, this commit should be backported**, but as a **low-priority
correctness fix** rather than a critical bugfix:

1. **It's the right thing to do technically:** The code is objectively
   incorrect and violates the jhash_1word() API contract
2. **Minimal risk:** One-line change with obvious semantics
3. **Long-term benefit:** Prevents future issues and aligns with kernel
   standards
4. **Completeness:** Part of a broader effort to fix RDS endianness
   issues
5. **Stable kernel compatibility:** Follows stable tree rules (small,
   obvious, no ABI change)

**However, prioritization should be LOW because:**
- No reported user issues in 9 years
- Limited real-world impact (niche protocol, homogeneous deployments)
- Maintainer didn't mark as Cc: stable
- Behavior change (hash values) could surprise users

## Suggested Backport Scope

Backport to all **actively maintained stable kernels** where RDS
multipath exists:
- v6.x series (current)
- v5.15+ LTS series
- v5.10 LTS (if still maintained)

**Do NOT backport to:**
- EOL kernels (maintenance overhead not justified)
- Kernels older than v4.10 (RDS_MPATH_HASH doesn't exist)

## Final Verdict

**Backport Status: YES**

This is a **valid correctness fix** for a real architectural bug that
should be backported to stable trees. While the practical impact is
limited due to RDS's specialized usage, the fix is minimal, obviously
correct, and brings the code in line with kernel standards. The small
risk of path selection changes on little-endian systems is outweighed by
the long-term benefits of having correct, Sparse-clean code that behaves
consistently across all architectures.

 net/rds/rds.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/rds.h b/net/rds/rds.h
index dc360252c5157..5b1c072e2e7ff 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -93,7 +93,7 @@ enum {
 
 /* Max number of multipaths per RDS connection. Must be a power of 2 */
 #define	RDS_MPATH_WORKERS	8
-#define	RDS_MPATH_HASH(rs, n) (jhash_1word((rs)->rs_bound_port, \
+#define	RDS_MPATH_HASH(rs, n) (jhash_1word(ntohs((rs)->rs_bound_port), \
 			       (rs)->rs_hash_initval) & ((n) - 1))
 
 #define IS_CANONICAL(laddr, faddr) (htonl(laddr) < htonl(faddr))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.17] net/mlx5e: Prevent entering switchdev mode with inconsistent netns
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
  2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-5.4] rds: Fix endianness annotation for RDS_MPATH_HASH Sasha Levin
@ 2025-10-25 15:58 ` Sasha Levin
  2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.1] net/mlx5e: Don't query FEC statistics when FEC is disabled Sasha Levin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2025-10-25 15:58 UTC (permalink / raw)
  To: patches, stable
  Cc: Jianbo Liu, Cosmin Ratiu, Jiri Pirko, Dragos Tatulea,
	Tariq Toukan, Jakub Kicinski, Sasha Levin, saeedm, mbloch, netdev,
	linux-rdma

From: Jianbo Liu <jianbol@nvidia.com>

[ Upstream commit 06fdc45f16c392dc3394c67e7c17ae63935715d3 ]

When a PF enters switchdev mode, its netdevice becomes the uplink
representor but remains in its current network namespace. All other
representors (VFs, SFs) are created in the netns of the devlink
instance.

If the PF's netns has been moved and differs from the devlink's netns,
enabling switchdev mode would create a state where the OVS control
plane (ovs-vsctl) cannot manage the switch because the PF uplink
representor and the other representors are split across different
namespaces.

To prevent this inconsistent configuration, block the request to enter
switchdev mode if the PF netdevice's netns does not match the netns of
its devlink instance.

As part of this change, the PF's netns is first marked as immutable.
This prevents race conditions where the netns could be changed after
the check is performed but before the mode transition is complete, and
it aligns the PF's behavior with that of the final uplink representor.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1759094723-843774-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES
**Why Backport**
- Rejects switchdev activation when the PF netdev sits in a different
  netns than the devlink instance, avoiding the broken state where OVS
  loses control of the split representors (`drivers/net/ethernet/mellano
  x/mlx5/core/eswitch_offloads.c:3842-3847`).
- New helper grabs the uplink netdev safely via the existing ref-counted
  accessor and sets `netns_immutable` under RTNL so the PF behaves like
  the eventual uplink representor, while immediately detecting namespace
  divergence (`drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.
  c:3777-3797`;
  `drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h:48-64`).
- If the mode change later fails, the helper rolls the flag back to keep
  legacy behavior untouched; successful transitions keep the flag set,
  matching switchdev guidance to freeze port namespaces (`drivers/net/et
  hernet/mellanox/mlx5/core/eswitch_offloads.c:3867-3869`;
  `Documentation/networking/switchdev.rst:130-143`).
- Locking the namespace leverages the core check that rejects moves of
  immutable interfaces (`net/core/dev.c:12352-12355`), eliminating the
  race window the commit message highlights without touching data-path
  code.
- The change is tightly scoped to the mode-set path, has no dependencies
  on new infrastructure, and fixes a long-standing, user-visible bug
  with minimal regression risk—strong fit for stable kernels that ship
  mlx5 switchdev support.

 .../mellanox/mlx5/core/eswitch_offloads.c     | 33 +++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index f358e8fe432cf..59a1a3a5fc8b5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3739,6 +3739,29 @@ void mlx5_eswitch_unblock_mode(struct mlx5_core_dev *dev)
 	up_write(&esw->mode_lock);
 }
 
+/* Returns false only when uplink netdev exists and its netns is different from
+ * devlink's netns. True for all others so entering switchdev mode is allowed.
+ */
+static bool mlx5_devlink_netdev_netns_immutable_set(struct devlink *devlink,
+						    bool immutable)
+{
+	struct mlx5_core_dev *mdev = devlink_priv(devlink);
+	struct net_device *netdev;
+	bool ret;
+
+	netdev = mlx5_uplink_netdev_get(mdev);
+	if (!netdev)
+		return true;
+
+	rtnl_lock();
+	netdev->netns_immutable = immutable;
+	ret = net_eq(dev_net(netdev), devlink_net(devlink));
+	rtnl_unlock();
+
+	mlx5_uplink_netdev_put(mdev, netdev);
+	return ret;
+}
+
 int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
 				  struct netlink_ext_ack *extack)
 {
@@ -3781,6 +3804,14 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
 	esw->eswitch_operation_in_progress = true;
 	up_write(&esw->mode_lock);
 
+	if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV &&
+	    !mlx5_devlink_netdev_netns_immutable_set(devlink, true)) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Can't change E-Switch mode to switchdev when netdev net namespace has diverged from the devlink's.");
+		err = -EINVAL;
+		goto skip;
+	}
+
 	if (mode == DEVLINK_ESWITCH_MODE_LEGACY)
 		esw->dev->priv.flags |= MLX5_PRIV_FLAGS_SWITCH_LEGACY;
 	mlx5_eswitch_disable_locked(esw);
@@ -3799,6 +3830,8 @@ int mlx5_devlink_eswitch_mode_set(struct devlink *devlink, u16 mode,
 	}
 
 skip:
+	if (mode == DEVLINK_ESWITCH_MODE_SWITCHDEV && err)
+		mlx5_devlink_netdev_netns_immutable_set(devlink, false);
 	down_write(&esw->mode_lock);
 	esw->eswitch_operation_in_progress = false;
 unlock:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.17-6.1] net/mlx5e: Don't query FEC statistics when FEC is disabled
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
  2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-5.4] rds: Fix endianness annotation for RDS_MPATH_HASH Sasha Levin
  2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] net/mlx5e: Prevent entering switchdev mode with inconsistent netns Sasha Levin
@ 2025-10-25 15:59 ` Sasha Levin
  2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] RDMA/mana_ib: Drain send wrs of GSI QP Sasha Levin
  2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-5.15] RDMA/irdma: Update Kconfig Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2025-10-25 15:59 UTC (permalink / raw)
  To: patches, stable
  Cc: Carolina Jubran, Dragos Tatulea, Yael Chemla, Vadim Fedorenko,
	Aleksandr Loktionov, Jakub Kicinski, Sasha Levin, saeedm, tariqt,
	mbloch, netdev, linux-rdma

From: Carolina Jubran <cjubran@nvidia.com>

[ Upstream commit 6b81b8a0b1978284e007566d7a1607b47f92209f ]

Update mlx5e_stats_fec_get() to check the active FEC mode and skip
statistics collection when FEC is disabled.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20250924124037.1508846-3-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES
- `drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:1611` now checks
  `mode == MLX5E_FEC_NOFEC` before touching the PPCNT register groups,
  so the driver stops trying to read FEC statistics when hardware
  reports that FEC is disabled. Previously `mlx5e_stats_fec_get()` still
  called `fec_set_corrected_bits_total()` even in that state, so every
  FEC stats query attempted an unsupported PPCNT access.
- Those reads go through `mlx5_core_access_reg()` with `verbose=true`
  (`drivers/net/ethernet/mellanox/mlx5/core/port.c:36-83`), which means
  firmware failures get logged and waste command bandwidth. Admins hit
  this whenever tools poll FEC stats on links running without FEC, so it
  is a user-visible bug.
- Passing the already computed `mode` into `fec_set_block_stats()`
  (`drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:1448-1471` and
  `:1616`) keeps the existing per-mode handling while avoiding redundant
  `fec_active_mode()` calls; no other callers are affected, so the
  change stays self-contained.
- The patch introduces no new features or interfaces—it simply avoids
  querying counters that do not exist in the “no FEC” configuration—so
  it satisfies stable rules (clear bug fix, minimal risk, contained to
  the mlx5e stats code) and should be backported.

 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index c6185ddba04b8..9c45c6e670ebf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -1446,16 +1446,13 @@ static void fec_set_rs_stats(struct ethtool_fec_stats *fec_stats, u32 *ppcnt)
 }
 
 static void fec_set_block_stats(struct mlx5e_priv *priv,
+				int mode,
 				struct ethtool_fec_stats *fec_stats)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
 	u32 out[MLX5_ST_SZ_DW(ppcnt_reg)] = {};
 	u32 in[MLX5_ST_SZ_DW(ppcnt_reg)] = {};
 	int sz = MLX5_ST_SZ_BYTES(ppcnt_reg);
-	int mode = fec_active_mode(mdev);
-
-	if (mode == MLX5E_FEC_NOFEC)
-		return;
 
 	MLX5_SET(ppcnt_reg, in, local_port, 1);
 	MLX5_SET(ppcnt_reg, in, grp, MLX5_PHYSICAL_LAYER_COUNTERS_GROUP);
@@ -1497,11 +1494,14 @@ static void fec_set_corrected_bits_total(struct mlx5e_priv *priv,
 void mlx5e_stats_fec_get(struct mlx5e_priv *priv,
 			 struct ethtool_fec_stats *fec_stats)
 {
-	if (!MLX5_CAP_PCAM_FEATURE(priv->mdev, ppcnt_statistical_group))
+	int mode = fec_active_mode(priv->mdev);
+
+	if (mode == MLX5E_FEC_NOFEC ||
+	    !MLX5_CAP_PCAM_FEATURE(priv->mdev, ppcnt_statistical_group))
 		return;
 
 	fec_set_corrected_bits_total(priv, fec_stats);
-	fec_set_block_stats(priv, fec_stats);
+	fec_set_block_stats(priv, mode, fec_stats);
 }
 
 #define PPORT_ETH_EXT_OFF(c) \
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.17] RDMA/mana_ib: Drain send wrs of GSI QP
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.1] net/mlx5e: Don't query FEC statistics when FEC is disabled Sasha Levin
@ 2025-10-25 15:59 ` Sasha Levin
  2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-5.15] RDMA/irdma: Update Kconfig Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2025-10-25 15:59 UTC (permalink / raw)
  To: patches, stable
  Cc: Konstantin Taranov, Leon Romanovsky, Sasha Levin, longli,
	linux-rdma

From: Konstantin Taranov <kotaranov@microsoft.com>

[ Upstream commit 44d69d3cf2e8047c279cbb9708f05e2c43e33234 ]

Drain send WRs of the GSI QP on device removal.

In rare servicing scenarios, the hardware may delete the
state of the GSI QP, preventing it from generating CQEs
for pending send WRs. Since WRs submitted to the GSI QP
hold CM resources, the device cannot be removed until
those WRs are completed. This patch marks all pending
send WRs as failed, allowing the GSI QP to release the CM
resources and enabling safe device removal.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1753779618-23629-1-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES

Explanation

- What it fixes
  - Addresses a real hang during device removal when hardware loses GSI
    QP state and stops generating send CQEs; pending GSI send WRs hold
    CM resources and block teardown. The patch proactively fails those
    WRs so CM can release resources and removal can proceed.

- Change details and rationale
  - drivers/infiniband/hw/mana/cq.c: adds mana_drain_gsi_sqs(struct
    mana_ib_dev *mdev)
    - Obtains the GSI QP by number via mana_get_qp_ref(mdev,
      MANA_GSI_QPN, false).
    - Locates its send CQ: container_of(qp->ibqp.send_cq, struct
      mana_ib_cq, ibcq).
    - Under cq->cq_lock, iterates pending UD SQ shadow entries via
      shadow_queue_get_next_to_complete(&qp->shadow_sq), marks each with
      IB_WC_GENERAL_ERR, and advances with
      shadow_queue_advance_next_to_complete(&qp->shadow_sq).
    - After unlocking, invokes cq->ibcq.comp_handler if present to wake
      pollers, then mana_put_qp_ref(qp).
    - This directly resolves the “no CQE emitted” case by synthesizing
      error completion for all outstanding GSI send WRs.
  - drivers/infiniband/hw/mana/device.c: in mana_ib_remove, before
    unregistering the device, calls mana_drain_gsi_sqs(dev) when running
    as RNIC. This ensures all problematic GSI send WRs are drained
    before teardown.
  - drivers/infiniband/hw/mana/mana_ib.h: defines MANA_GSI_QPN (1) and
    declares mana_drain_gsi_sqs(), making explicit that QP1 is the GSI
    QP and exposing the drain helper.

- Scope, risk, and stable rules compliance
  - Small, contained change:
    - Touches only the MANA RDMA (mana_ib) driver and only the teardown
      path and CQ-side helper.
    - No UAPI/ABI change, no behavior outside removal-time GSI send WR
      cleanups.
  - No architectural changes:
    - Adds a helper and a single removal-path call site; does not
      refactor subsystem code.
  - Low regression risk:
    - Operates under existing cq->cq_lock to serialize with CQ
      processing.
    - Uses existing shadow queue primitives and error code
      IB_WC_GENERAL_ERR.
    - Only runs on device removal for RNIC devices; normal datapath
      unaffected.
  - User-visible impact is a fix for a hard-to-debug removal hang;
    aligns with stable policy to backport important bug fixes.

- Security and reliability considerations
  - Prevents a device-removal stall that can be a
    reliability/availability issue (potential DoS via stuck teardown).
    The fix reduces hang risk without exposing new attack surface.

- Dependencies and backport considerations
  - The patch relies on:
    - Presence of GSI/UD QP support and the shadow SQ/RQ infrastructure
      (e.g., qp->shadow_sq, shadow_queue_get_next_to_complete(), and
      cq->cq_lock).
    - The GSI QP number being 1 (MANA_GSI_QPN).
    - A mana_get_qp_ref() with a boolean third parameter in the target
      tree (some branches have a 2-argument variant; trivial adaptation
      may be required).
  - Conclusion: Backport to stable series that already include MANA’s
    GSI/UD QP and shadow-queue CQ processing. It’s not applicable to
    older trees that lack UD/GSI support in this driver.

Why this meets stable criteria

- Fixes an important, user-affecting bug (device removal hang).
- Minimal, well-scoped change in a single driver.
- No new features or interface changes.
- Low risk of regression; guarded by existing locking and only active
  during removal.
- RDMA driver-local; no core subsystem impact.

Given the above, this is a strong candidate for backporting to relevant
stable kernels that contain the corresponding GSI/UD code paths.

 drivers/infiniband/hw/mana/cq.c      | 26 ++++++++++++++++++++++++++
 drivers/infiniband/hw/mana/device.c  |  3 +++
 drivers/infiniband/hw/mana/mana_ib.h |  3 +++
 3 files changed, 32 insertions(+)

diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index 28e154bbb50f8..1becc87791235 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -291,6 +291,32 @@ static int mana_process_completions(struct mana_ib_cq *cq, int nwc, struct ib_wc
 	return wc_index;
 }
 
+void mana_drain_gsi_sqs(struct mana_ib_dev *mdev)
+{
+	struct mana_ib_qp *qp = mana_get_qp_ref(mdev, MANA_GSI_QPN, false);
+	struct ud_sq_shadow_wqe *shadow_wqe;
+	struct mana_ib_cq *cq;
+	unsigned long flags;
+
+	if (!qp)
+		return;
+
+	cq = container_of(qp->ibqp.send_cq, struct mana_ib_cq, ibcq);
+
+	spin_lock_irqsave(&cq->cq_lock, flags);
+	while ((shadow_wqe = shadow_queue_get_next_to_complete(&qp->shadow_sq))
+			!= NULL) {
+		shadow_wqe->header.error_code = IB_WC_GENERAL_ERR;
+		shadow_queue_advance_next_to_complete(&qp->shadow_sq);
+	}
+	spin_unlock_irqrestore(&cq->cq_lock, flags);
+
+	if (cq->ibcq.comp_handler)
+		cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+	mana_put_qp_ref(qp);
+}
+
 int mana_ib_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
 {
 	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
diff --git a/drivers/infiniband/hw/mana/device.c b/drivers/infiniband/hw/mana/device.c
index fa60872f169f4..bdeddb642b877 100644
--- a/drivers/infiniband/hw/mana/device.c
+++ b/drivers/infiniband/hw/mana/device.c
@@ -230,6 +230,9 @@ static void mana_ib_remove(struct auxiliary_device *adev)
 {
 	struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev);
 
+	if (mana_ib_is_rnic(dev))
+		mana_drain_gsi_sqs(dev);
+
 	ib_unregister_device(&dev->ib_dev);
 	dma_pool_destroy(dev->av_pool);
 	if (mana_ib_is_rnic(dev)) {
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 5d31034ac7fb3..af09a3e6ccb78 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -43,6 +43,8 @@
  */
 #define MANA_AV_BUFFER_SIZE	64
 
+#define MANA_GSI_QPN		(1)
+
 struct mana_ib_adapter_caps {
 	u32 max_sq_id;
 	u32 max_rq_id;
@@ -718,6 +720,7 @@ int mana_ib_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
 int mana_ib_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 		      const struct ib_send_wr **bad_wr);
 
+void mana_drain_gsi_sqs(struct mana_ib_dev *mdev);
 int mana_ib_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
 int mana_ib_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH AUTOSEL 6.17-5.15] RDMA/irdma: Update Kconfig
       [not found] <20251025160905.3857885-1-sashal@kernel.org>
                   ` (3 preceding siblings ...)
  2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] RDMA/mana_ib: Drain send wrs of GSI QP Sasha Levin
@ 2025-10-25 16:00 ` Sasha Levin
  4 siblings, 0 replies; 5+ messages in thread
From: Sasha Levin @ 2025-10-25 16:00 UTC (permalink / raw)
  To: patches, stable
  Cc: Tatyana Nikolova, Jacob Moroni, Leon Romanovsky, Sasha Levin,
	linux-rdma

From: Tatyana Nikolova <tatyana.e.nikolova@intel.com>

[ Upstream commit 060842fed53f77a73824c9147f51dc6746c1267a ]

Update Kconfig to add dependency on idpf module and
add IPU E2000 to the list of supported devices.

Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20250827152545.2056-17-tatyana.e.nikolova@intel.com
Tested-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES – the new dependency avoids a real build break that appears as soon
as the GEN3/IDPF support series lands while keeping risk negligible.

- `drivers/infiniband/hw/irdma/ig3rdma_if.c` calls several IDPF-exported
  helpers (`idpf_idc_rdma_vc_send_sync`, `idpf_idc_request_reset`,
  `idpf_idc_vport_dev_ctrl`) unconditionally at lines 25, 73, and 188
  (`drivers/infiniband/hw/irdma/ig3rdma_if.c:25`, `:73`, `:188`). If
  `CONFIG_INFINIBAND_IRDMA` is enabled without `CONFIG_IDPF`, modpost
  reports unresolved symbols and the build fails.
- The patch adds the missing `depends on IDPF` requirement to the
  Kconfig entry (`drivers/infiniband/hw/irdma/Kconfig:6`), so broken
  configurations are filtered out at menuconfig time instead of failing
  late in the build.
- The help text tweak (`drivers/infiniband/hw/irdma/Kconfig:10-11`) is
  purely informational and carries no risk.
- No functional behavior changes or architectural upheaval are involved;
  it is a small, self-contained dependency fix squarely in stable’s
  remit.

 drivers/infiniband/hw/irdma/Kconfig | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/Kconfig b/drivers/infiniband/hw/irdma/Kconfig
index 5f49a58590ed7..0bd7e3fca1fbb 100644
--- a/drivers/infiniband/hw/irdma/Kconfig
+++ b/drivers/infiniband/hw/irdma/Kconfig
@@ -4,10 +4,11 @@ config INFINIBAND_IRDMA
 	depends on INET
 	depends on IPV6 || !IPV6
 	depends on PCI
-	depends on ICE && I40E
+	depends on IDPF && ICE && I40E
 	select GENERIC_ALLOCATOR
 	select AUXILIARY_BUS
 	select CRC32
 	help
-	  This is an Intel(R) Ethernet Protocol Driver for RDMA driver
-	  that support E810 (iWARP/RoCE) and X722 (iWARP) network devices.
+	  This is an Intel(R) Ethernet Protocol Driver for RDMA that
+	  supports IPU E2000 (RoCEv2), E810 (iWARP/RoCEv2) and X722 (iWARP)
+	  network devices.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-25 16:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-5.4] rds: Fix endianness annotation for RDS_MPATH_HASH Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] net/mlx5e: Prevent entering switchdev mode with inconsistent netns Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.1] net/mlx5e: Don't query FEC statistics when FEC is disabled Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] RDMA/mana_ib: Drain send wrs of GSI QP Sasha Levin
2025-10-25 16:00 ` [PATCH AUTOSEL 6.17-5.15] RDMA/irdma: Update Kconfig Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox