Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [PATCH] IB/mad: cap RMPP reassembly window size to bound find_seg_location walk
@ 2026-05-18 21:23 Michael Bommarito
  2026-05-19 14:46 ` Leon Romanovsky
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Bommarito @ 2026-05-18 21:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma, linux-kernel, stable, Vlad Dumitrescu, Or Har-Toov,
	Bob Pearson, Sean Hefty, Kees Cook

A peer on the same InfiniBand subnet or RoCEv2 L2 (or any UDP/4791-
reachable peer for internet-exposed RoCEv2 ports) can pin a target
port's IB MAD kworker for milliseconds per low-bandwidth RMPP burst
by sending an RMPP management transaction with descending segment
numbers. QP1 GMP traffic is unauthenticated by IBTA spec, so no
credentials are required. The bug sits on the IB management path
(QP1 GMP RMPP reassembly), not the RDMA data plane, so RDMA verbs
throughput is unaffected; deployments that raise recv_queue_size to
tune management-plane throughput are quadratically more exposed,
because per-burst cost grows O(F^2) with the configured window.

drivers/infiniband/core/mad_rmpp.c::find_seg_location() walks
rmpp_recv->rmpp_wc->rmpp_list in reverse on every inbound RMPP DATA
segment to locate the insertion point keyed by segment number. The
walk is O(N) per insert under spin_lock_irqsave(&rmpp_recv->lock) in
kworker context, so F adversarially-reordered segments aggregate to
O(F^2). window_size() returns max(recv_queue.max_active >> 3, 1):
the IB MAD core default recv_queue_size of 512 yields window=64
(per-burst cost in the microsecond range), but tuned production
configs with recv_queue_size=8192 push window to 1024 and let a
single low-bandwidth burst pin the per-port MAD kworker for several
milliseconds.

Cap the effective window at IB_MAD_RMPP_MAX_WINDOW = 64 in
window_size() so admins tuning recv_queue_size for higher RX throughput
do not enlarge the walker attack surface. Real RMPP transactions in
the wild (SA queries, perf-counter reads) are well served by a window
of 64, which is also the IB MAD core default. A structural follow-up
would convert rmpp_recv->rmpp_wc->rmpp_list to an rb_tree keyed by
seg_num and lift the cap; that mirrors tcp_data_queue_ofo post-
CVE-2018-5390. For now the cap suffices.

Fixes: fa619a77046b ("[PATCH] IB: Add RMPP implementation")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
I reproduced this under x86_64 QEMU/KVM (4 vCPUs) on v7.1-rc2 with
CONFIG_RDMA_RXE + CONFIG_INFINIBAND_USER_MAD, a veth pair carrying
two rdma_rxe links, and raw RoCEv2/UDP/4791 packet injection with
descending seg_num while holding seg #1. Without the cap, F=1024
burst produces 1022 paired continue_rmpp invocations whose per-call
walker duration grows from ~1 us (early, near-empty list) to ~5 us
(late, ~1000-deep list), a 4x per-call amplification as the queue
deepens, with aggregate walker time per burst >= 1.5 ms (lower bound,
ftrace 1 us granularity). With the cap, the same F=1024 burst drops
to ~0.28 ms aggregate (5.4x reduction); F=32 in-window legitimate
RMPP still completes normally (30 walker calls, avg 1.5 us, max 3 us).
tools/testing/selftests/drivers/net/rdma/ carries no RMPP-specific
selftest in v7.1-rc2 (rdma_rxe self-tests do not exercise QP1 GMP
RMPP reassembly), so no in-tree selftest delta to report.

 drivers/infiniband/core/mad_rmpp.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
index 17c4c52a19e4c..4d55b133c689c 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -391,9 +391,25 @@ static inline struct ib_mad_recv_buf *get_next_seg(struct list_head *rmpp_list,
 	return container_of(seg->list.next, struct ib_mad_recv_buf, list);
 }
 
+/*
+ * Cap the per-RMPP-transaction in-flight window. find_seg_location()
+ * walks the rmpp_recv list reverse to find each insertion point, so the
+ * aggregate cost across an attacker-paced reordered window is O(N^2)
+ * under spin_lock_irqsave(&rmpp_recv->lock) in kworker context. The
+ * default recv_queue_size of 512 yields window=64, which keeps that
+ * cost in the noise; tuned configurations (recv_queue_size up to 8192)
+ * push window to 1024 and the per-port kworker measurably stalls under
+ * a low-bandwidth burst from any unauthenticated peer on QP1 GMP. Cap
+ * window at IB_MAD_RMPP_MAX_WINDOW so the bug class is structurally
+ * defused regardless of recv_queue_size tuning.
+ */
+#define IB_MAD_RMPP_MAX_WINDOW 64
+
 static inline int window_size(struct ib_mad_agent_private *agent)
 {
-	return max(agent->qp_info->recv_queue.max_active >> 3, 1);
+	int wsize = agent->qp_info->recv_queue.max_active >> 3;
+
+	return clamp(wsize, 1, IB_MAD_RMPP_MAX_WINDOW);
 }
 
 static struct ib_mad_recv_buf *find_seg_location(struct list_head *rmpp_list,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] IB/mad: cap RMPP reassembly window size to bound find_seg_location walk
  2026-05-18 21:23 [PATCH] IB/mad: cap RMPP reassembly window size to bound find_seg_location walk Michael Bommarito
@ 2026-05-19 14:46 ` Leon Romanovsky
  0 siblings, 0 replies; 2+ messages in thread
From: Leon Romanovsky @ 2026-05-19 14:46 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel, stable,
	Vlad Dumitrescu, Or Har-Toov, Bob Pearson, Sean Hefty, Kees Cook

On Mon, May 18, 2026 at 05:23:36PM -0400, Michael Bommarito wrote:
> A peer on the same InfiniBand subnet or RoCEv2 L2 (or any UDP/4791-
> reachable peer for internet-exposed RoCEv2 ports) can pin a target
> port's IB MAD kworker for milliseconds per low-bandwidth RMPP burst
> by sending an RMPP management transaction with descending segment
> numbers. QP1 GMP traffic is unauthenticated by IBTA spec, so no
> credentials are required. The bug sits on the IB management path
> (QP1 GMP RMPP reassembly), not the RDMA data plane, so RDMA verbs
> throughput is unaffected; deployments that raise recv_queue_size to
> tune management-plane throughput are quadratically more exposed,
> because per-burst cost grows O(F^2) with the configured window.
> 
> drivers/infiniband/core/mad_rmpp.c::find_seg_location() walks
> rmpp_recv->rmpp_wc->rmpp_list in reverse on every inbound RMPP DATA
> segment to locate the insertion point keyed by segment number. The
> walk is O(N) per insert under spin_lock_irqsave(&rmpp_recv->lock) in
> kworker context, so F adversarially-reordered segments aggregate to
> O(F^2). window_size() returns max(recv_queue.max_active >> 3, 1):
> the IB MAD core default recv_queue_size of 512 yields window=64
> (per-burst cost in the microsecond range), but tuned production
> configs with recv_queue_size=8192 push window to 1024 and let a
> single low-bandwidth burst pin the per-port MAD kworker for several
> milliseconds.
> 
> Cap the effective window at IB_MAD_RMPP_MAX_WINDOW = 64 in
> window_size() so admins tuning recv_queue_size for higher RX throughput
> do not enlarge the walker attack surface. Real RMPP transactions in
> the wild (SA queries, perf-counter reads) are well served by a window
> of 64, which is also the IB MAD core default. A structural follow-up
> would convert rmpp_recv->rmpp_wc->rmpp_list to an rb_tree keyed by
> seg_num and lift the cap; that mirrors tcp_data_queue_ofo post-
> CVE-2018-5390. For now the cap suffices.
> 
> Fixes: fa619a77046b ("[PATCH] IB: Add RMPP implementation")
> Cc: stable@vger.kernel.org
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> I reproduced this under x86_64 QEMU/KVM (4 vCPUs) on v7.1-rc2 with
> CONFIG_RDMA_RXE + CONFIG_INFINIBAND_USER_MAD, a veth pair carrying
> two rdma_rxe links, and raw RoCEv2/UDP/4791 packet injection with
> descending seg_num while holding seg #1. Without the cap, F=1024
> burst produces 1022 paired continue_rmpp invocations whose per-call
> walker duration grows from ~1 us (early, near-empty list) to ~5 us
> (late, ~1000-deep list), a 4x per-call amplification as the queue
> deepens, with aggregate walker time per burst >= 1.5 ms (lower bound,
> ftrace 1 us granularity). With the cap, the same F=1024 burst drops
> to ~0.28 ms aggregate (5.4x reduction); F=32 in-window legitimate
> RMPP still completes normally (30 walker calls, avg 1.5 us, max 3 us).
> tools/testing/selftests/drivers/net/rdma/ carries no RMPP-specific
> selftest in v7.1-rc2 (rdma_rxe self-tests do not exercise QP1 GMP
> RMPP reassembly), so no in-tree selftest delta to report.
> 
>  drivers/infiniband/core/mad_rmpp.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)

Please rewrite this patch in human language without AI slop.
While working on this, please ensure that your commit message clearly
explains why the change is needed and what issue it actually resolves.

Thanks

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-19 14:47 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 21:23 [PATCH] IB/mad: cap RMPP reassembly window size to bound find_seg_location walk Michael Bommarito
2026-05-19 14:46 ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox