From: Michael Bommarito <michael.bommarito@gmail.com>
To: Zhu Yanjun <yanjun.zhu@linux.dev>,
Zhu Yanjun <zyjzyj2000@gmail.com>, Jason Gunthorpe <jgg@ziepe.ca>,
Leon Romanovsky <leon@kernel.org>
Cc: Bob Pearson <rpearsonhpe@gmail.com>,
linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH v2] RDMA/rxe: insert mcg into mcg_tree only after rxe_mcast_add() succeeds
Date: Tue, 16 Jun 2026 22:27:28 -0400 [thread overview]
Message-ID: <20260617022728.2770116-1-michael.bommarito@gmail.com> (raw)
In-Reply-To: <1158f1de-4469-463c-91de-c5e24d2add4f@linux.dev>
rxe_get_mcg() publishes a newly allocated multicast group in
rxe->mcg_tree before programming the backing Ethernet multicast address
with rxe_mcast_add(), which runs outside mcg_lock. A local userspace
RDMA client reaches this path with ATTACH_MCAST on a UD QP; if
rxe_mcast_add() then returns an error (for example -ENODEV when the
backing netdev has been removed, or a propagated dev_mc_add() error),
the unwind frees the published group without removing it from the tree.
A later lookup of the same MGID dereferences the freed struct rxe_mcg
from __rxe_lookup_mcg().
Fix this by keeping the new mcg private until rxe_mcast_add() succeeds.
Split the tree publication into __rxe_publish_mcg(), call rxe_mcast_add()
before taking the tree reference, and free the still-private mcg on
failure. Because the group is never visible in mcg_tree until the
multicast address is programmed, no concurrent caller can look it up or
attach a QP to a group that is about to be torn down, so the error path
needs no conditional unwind. If another caller publishes the same MGID
while the address is being programmed, the post-add re-check under
mcg_lock finds the winner; this caller then drops its private object and
balances its own rxe_mcast_add() with rxe_mcast_del() before returning
the winner.
Reproduced by forcing the rxe_mcast_add() error return under KASAN:
without the change the next attach to the same MGID reports a
slab-use-after-free in __rxe_lookup_mcg(); with it the forced failure
returns cleanly. A no-injection attach/detach regression, including a
two-QP shared join/leave and re-attach, stays KASAN- and leak-clean.
Fixes: a926a903b7dc ("RDMA/rxe: Do not call dev_mc_add/del() under a spinlock")
Cc: stable@vger.kernel.org # v5.18+
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
v2: switch approach in response to Zhu Yanjun's review of v1
(https://lore.kernel.org/all/1158f1de-4469-463c-91de-c5e24d2add4f@linux.dev/). v1 unwound the
already-published mcg on the failure path; Zhu noted that because
rxe_mcast_add() runs outside mcg_lock, a concurrent caller can find
the published mcg and attach a QP in that window, so an unconditional
teardown would orphan those QPs and leak the mcg. Rather than make the
teardown conditional, v2 does not publish the mcg into mcg_tree until
after rxe_mcast_add() has succeeded, which removes the window entirely
(the safety of the two-CPU same-MGID race and the loser-side
rxe_mcast_del() balance is analysed in the commit log above). No
Fixes/stable change; this is still the same UAF.
v1: https://lore.kernel.org/all/20260614130443.2517578-1-michael.bommarito@gmail.com/
Reproduction
============
Tested on v7.1-rc4 (5200f5f493f7) on x86_64 QEMU/KVM with KASAN,
rxe, rdma_rxe, and userspace verbs against a Soft-RoCE device.
Conditions: the caller needs access to the rxe uverbs device and a UD
QP. The demonstrated error path requires rxe_mcast_add() to fail after
the mcg allocation; natural failures include netdev removal returning
-ENODEV and dev_mc_add() errors such as -ENOMEM. The test made that
return deterministic by forcing rxe_mcast_add() to return -ENODEV.
The private harness creates a UD QP, attaches an MGID, forces the
rxe_mcast_add() failure, and repeats attach on the same MGID so the
lookup walks the stale rb-node.
Stock: KASAN reports a slab-use-after-free in __rxe_lookup_mcg() while
comparing mcg->mgid on the next attach.
Patched: the forced-failure path returns without a KASAN report.
Regression: no-injection attach/detach, two-QP shared join/leave, and
re-attach complete without KASAN, WARN, or leak reports.
Mitigations: restrict access to the rxe uverbs device or avoid loading
the rxe driver where untrusted local users can create RDMA objects.
The harness is available off-list on request. The RDMA selftest gate was
checked; tools/testing/selftests/rdma does not contain a matching
rxe_mcast or ATTACH_MCAST coverage test.
drivers/infiniband/sw/rxe/rxe_mcast.c | 50 +++++++++++++++++++--------
1 file changed, 36 insertions(+), 14 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 5cad72073eca1..eaa259cc39ea9 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -175,7 +175,9 @@ struct rxe_mcg *rxe_lookup_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
* @mgid: multicast address as a gid
* @mcg: new mcg object
*
- * Context: caller should hold rxe->mcg lock
+ * Initializes the mcg fields. The mcg is private and not yet visible in
+ * mcg_tree, so this may run without rxe->mcg_lock; __rxe_publish_mcg()
+ * makes it visible under the lock once it is ready.
*/
static void __rxe_init_mcg(struct rxe_dev *rxe, union ib_gid *mgid,
struct rxe_mcg *mcg)
@@ -184,13 +186,22 @@ static void __rxe_init_mcg(struct rxe_dev *rxe, union ib_gid *mgid,
memcpy(&mcg->mgid, mgid, sizeof(mcg->mgid));
INIT_LIST_HEAD(&mcg->qp_list);
mcg->rxe = rxe;
+}
- /* caller holds a ref on mcg but that will be
- * dropped when mcg goes out of scope. We need to take a ref
- * on the pointer that will be saved in the red-black tree
- * by __rxe_insert_mcg and used to lookup mcg from mgid later.
- * Inserting mcg makes it visible to outside so this should
- * be done last after the object is ready.
+/**
+ * __rxe_publish_mcg - make a fully initialized mcg visible in mcg_tree
+ * @mcg: the mcg object
+ *
+ * Context: caller must hold rxe->mcg_lock and a reference on mcg
+ */
+static void __rxe_publish_mcg(struct rxe_mcg *mcg)
+{
+ /* caller holds a ref on mcg but that will be dropped when mcg goes
+ * out of scope. We need to take a ref on the pointer that will be
+ * saved in the red-black tree by __rxe_insert_mcg and used to lookup
+ * mcg from mgid later. Inserting mcg makes it visible to outside so
+ * this is done last after the object is ready and the multicast
+ * address has been programmed.
*/
kref_get(&mcg->ref_cnt);
__rxe_insert_mcg(mcg);
@@ -228,26 +239,37 @@ static struct rxe_mcg *rxe_get_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
err = -ENOMEM;
goto err_dec;
}
+ __rxe_init_mcg(rxe, mgid, mcg);
+
+ /* program the multicast address while mcg is still private, before
+ * it is inserted into mcg_tree. dev_mc_add() may sleep so this must
+ * run outside mcg_lock. On failure mcg was never published, so a
+ * plain free is correct and the tree is untouched.
+ */
+ err = rxe_mcast_add(rxe, mgid);
+ if (err) {
+ kfree(mcg);
+ goto err_dec;
+ }
spin_lock_bh(&rxe->mcg_lock);
- /* re-check to see if someone else just added it */
+ /* re-check to see if someone else just added it while we were adding
+ * the multicast address; if so use theirs and drop ours
+ */
tmp = __rxe_lookup_mcg(rxe, mgid);
if (tmp) {
spin_unlock_bh(&rxe->mcg_lock);
+ rxe_mcast_del(rxe, mgid);
atomic_dec(&rxe->mcg_num);
kfree(mcg);
return tmp;
}
- __rxe_init_mcg(rxe, mgid, mcg);
+ __rxe_publish_mcg(mcg);
spin_unlock_bh(&rxe->mcg_lock);
- /* add mcast address outside of lock */
- err = rxe_mcast_add(rxe, mgid);
- if (!err)
- return mcg;
+ return mcg;
- kfree(mcg);
err_dec:
atomic_dec(&rxe->mcg_num);
return ERR_PTR(err);
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
--
2.53.0
next prev parent reply other threads:[~2026-06-17 2:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-14 13:04 [PATCH] RDMA/rxe: destroy the mcg when rxe_mcast_add() fails in rxe_get_mcg() Michael Bommarito
2026-06-15 1:28 ` Zhu Yanjun
2026-06-17 2:27 ` Michael Bommarito [this message]
2026-06-17 15:50 ` [PATCH v2] RDMA/rxe: insert mcg into mcg_tree only after rxe_mcast_add() succeeds Zhu Yanjun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260617022728.2770116-1-michael.bommarito@gmail.com \
--to=michael.bommarito@gmail.com \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=rpearsonhpe@gmail.com \
--cc=yanjun.zhu@linux.dev \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox