* [PATCH net 0/2] net: mana: Fix NULL dereferences during teardown after attach failure.
@ 2026-05-18 19:43 Dipayaan Roy
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
2026-05-18 19:43 ` [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached Dipayaan Roy
0 siblings, 2 replies; 10+ messages in thread
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
When mana_attach() fails (e.g. during queue allocation), the error
cleanup frees apc->tx_qp and apc->rxqs and sets them to NULL. Multiple
subsequent teardown paths can then dereference these NULL pointers,
causing kernel panics.
Patch 1 adds NULL guards in the low-level teardown functions
(mana_fence_rqs, mana_destroy_vport, mana_dealloc_queues) so they are
safe to call regardless of queue initialization state. This covers all
callers: mana_remove(), mana_change_mtu() recovery, and internal error
paths in mana_alloc_queues().
Patch 2 addresses the queue reset work handler specifically, where an
unconditional mana_detach() on an already-detached port caused
redundant teardown. It checks netif_device_present() to skip the detach
and directly retry mana_attach().
Dipayaan Roy (2):
net: mana: Add NULL guards in teardown path to prevent panic on attach
failure
net: mana: Skip redundant detach in queue reset handler if already
detached
drivers/net/ethernet/microsoft/mana/mana_en.c | 77 ++++++++++++-------
1 file changed, 48 insertions(+), 29 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
2026-05-18 19:43 [PATCH net 0/2] net: mana: Fix NULL dereferences during teardown after attach failure Dipayaan Roy
@ 2026-05-18 19:43 ` Dipayaan Roy
2026-05-18 21:49 ` Harshitha Ramamurthy
` (2 more replies)
2026-05-18 19:43 ` [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached Dipayaan Roy
1 sibling, 3 replies; 10+ messages in thread
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. Multiple teardown paths such as
mana_remove(), mana_change_mtu() recovery, and internal error handling
in mana_alloc_queues() can subsequently call into functions that
dereference these pointers without NULL checks:
- mana_chn_setxdp() dereferences apc->rxqs[0], causing a NULL pointer
dereference panic (CR2: 0000000000000000 at mana_chn_setxdp+0x26).
- mana_destroy_vport() iterates apc->rxqs without a NULL check.
- mana_fence_rqs() iterates apc->rxqs without a NULL check.
- mana_dealloc_queues() iterates apc->tx_qp without a NULL check.
Add NULL guards for apc->rxqs in mana_fence_rqs(),
mana_destroy_vport(), and before the mana_chn_setxdp() call. Add a
NULL guard for apc->tx_qp in mana_dealloc_queues() to skip TX queue
draining when TX queues were never allocated or already freed.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 70 +++++++++++--------
1 file changed, 41 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9afc786b297a..0582803907a8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1727,6 +1727,9 @@ static void mana_fence_rqs(struct mana_port_context *apc)
struct mana_rxq *rxq;
int err;
+ if (!apc->rxqs)
+ return;
+
for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
rxq = apc->rxqs[rxq_idx];
err = mana_fence_rq(apc, rxq);
@@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
struct mana_rxq *rxq;
u32 rxq_idx;
- for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
- rxq = apc->rxqs[rxq_idx];
- if (!rxq)
- continue;
+ if (apc->rxqs) {
- mana_destroy_rxq(apc, rxq, true);
- apc->rxqs[rxq_idx] = NULL;
+ for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
+ rxq = apc->rxqs[rxq_idx];
+ if (!rxq)
+ continue;
+
+ mana_destroy_rxq(apc, rxq, true);
+ apc->rxqs[rxq_idx] = NULL;
+ }
}
mana_destroy_txq(apc);
@@ -3269,7 +3275,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
if (apc->port_is_up)
return -EINVAL;
- mana_chn_setxdp(apc, NULL);
+ if (apc->rxqs)
+ mana_chn_setxdp(apc, NULL);
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
mana_pf_deregister_filter(apc);
@@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
* number of queues.
*/
- for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
- tsleep = 1000;
- while (atomic_read(&txq->pending_sends) > 0 &&
- time_before(jiffies, timeout)) {
- usleep_range(tsleep, tsleep + 1000);
- tsleep <<= 1;
- }
- if (atomic_read(&txq->pending_sends)) {
- err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
- if (err) {
- netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
- err, atomic_read(&txq->pending_sends),
- txq->gdma_txq_id);
+ if (apc->tx_qp) {
+ for (i = 0; i < apc->num_queues; i++) {
+ txq = &apc->tx_qp[i].txq;
+ tsleep = 1000;
+ while (atomic_read(&txq->pending_sends) > 0 &&
+ time_before(jiffies, timeout)) {
+ usleep_range(tsleep, tsleep + 1000);
+ tsleep <<= 1;
+ }
+ if (atomic_read(&txq->pending_sends)) {
+ err =
+ pcie_flr(to_pci_dev(gd->gdma_context->dev));
+ if (err) {
+ netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
+ err,
+ atomic_read(&txq->pending_sends),
+ txq->gdma_txq_id);
+ }
+ break;
}
- break;
}
- }
- for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
- while ((skb = skb_dequeue(&txq->pending_skbs))) {
- mana_unmap_skb(skb, apc);
- dev_kfree_skb_any(skb);
+ for (i = 0; i < apc->num_queues; i++) {
+ txq = &apc->tx_qp[i].txq;
+ while ((skb = skb_dequeue(&txq->pending_skbs))) {
+ mana_unmap_skb(skb, apc);
+ dev_kfree_skb_any(skb);
+ }
+ atomic_set(&txq->pending_sends, 0);
}
- atomic_set(&txq->pending_sends, 0);
}
+
/* We're 100% sure the queues can no longer be woken up, because
* we're sure now mana_poll_tx_cq() can't be running.
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
2026-05-18 19:43 [PATCH net 0/2] net: mana: Fix NULL dereferences during teardown after attach failure Dipayaan Roy
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
@ 2026-05-18 19:43 ` Dipayaan Roy
2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
1 sibling, 2 replies; 10+ messages in thread
From: Dipayaan Roy @ 2026-05-18 19:43 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
When mana_per_port_queue_reset_work_handler() runs, it unconditionally
calls mana_detach() which attempts to tear down queues that are already
freed, leading to NULL pointer dereferences on apc->tx_qp and
apc->rxqs.
Check netif_device_present() in the reset handler and skip
mana_detach() when the port is already in detached state. This avoids
the redundant teardown and goes directly to mana_attach() to retry
bringing the port back up.
Fixes: 3b194343c250 ("net: mana: Implement ndo_tx_timeout and serialize queue resets per port.")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 0582803907a8..907efadf6fd6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
goto out;
}
+ /* If already detached (indicates detach succeeded but attach failed
+ * previously). Now skip mana detach and just retry mana_attach.
+ */
+ if (!netif_device_present(ndev))
+ goto attach;
+
err = mana_detach(ndev, false);
if (err) {
netdev_err(ndev, "mana_detach failed: %d\n", err);
goto dealloc_pre_rxbufs;
}
+attach:
err = mana_attach(ndev);
if (err)
netdev_err(ndev, "mana_attach failed: %d\n", err);
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
@ 2026-05-18 21:49 ` Harshitha Ramamurthy
2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
2 siblings, 0 replies; 10+ messages in thread
From: Harshitha Ramamurthy @ 2026-05-18 21:49 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
On Mon, May 18, 2026 at 12:52 PM Dipayaan Roy
<dipayanroy@linux.microsoft.com> wrote:
>
> When queue allocation fails partway through, the error cleanup frees
> and NULLs apc->tx_qp and apc->rxqs. Multiple teardown paths such as
> mana_remove(), mana_change_mtu() recovery, and internal error handling
> in mana_alloc_queues() can subsequently call into functions that
> dereference these pointers without NULL checks:
>
> - mana_chn_setxdp() dereferences apc->rxqs[0], causing a NULL pointer
> dereference panic (CR2: 0000000000000000 at mana_chn_setxdp+0x26).
> - mana_destroy_vport() iterates apc->rxqs without a NULL check.
> - mana_fence_rqs() iterates apc->rxqs without a NULL check.
> - mana_dealloc_queues() iterates apc->tx_qp without a NULL check.
>
> Add NULL guards for apc->rxqs in mana_fence_rqs(),
> mana_destroy_vport(), and before the mana_chn_setxdp() call. Add a
> NULL guard for apc->tx_qp in mana_dealloc_queues() to skip TX queue
> draining when TX queues were never allocated or already freed.
>
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 70 +++++++++++--------
> 1 file changed, 41 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a..0582803907a8 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1727,6 +1727,9 @@ static void mana_fence_rqs(struct mana_port_context *apc)
> struct mana_rxq *rxq;
> int err;
>
> + if (!apc->rxqs)
> + return;
> +
> for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> rxq = apc->rxqs[rxq_idx];
> err = mana_fence_rq(apc, rxq);
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> struct mana_rxq *rxq;
> u32 rxq_idx;
>
> - for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> - rxq = apc->rxqs[rxq_idx];
> - if (!rxq)
> - continue;
> + if (apc->rxqs) {
>
> - mana_destroy_rxq(apc, rxq, true);
> - apc->rxqs[rxq_idx] = NULL;
> + for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> + rxq = apc->rxqs[rxq_idx];
> + if (!rxq)
> + continue;
> +
> + mana_destroy_rxq(apc, rxq, true);
> + apc->rxqs[rxq_idx] = NULL;
> + }
> }
>
> mana_destroy_txq(apc);
> @@ -3269,7 +3275,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
> if (apc->port_is_up)
> return -EINVAL;
>
> - mana_chn_setxdp(apc, NULL);
> + if (apc->rxqs)
> + mana_chn_setxdp(apc, NULL);
Is this check required? mana_dealloc_queues() is only called by
mana_detach(). And mana_detach() calls `mana_dealloc_queues()` only if
the port state was previously up. apc->port_is_up seems to be set to
true only on successful queue allocation. Can apc->rxqs be NULL here?
>
> if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
> mana_pf_deregister_filter(apc);
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> * number of queues.
> */
>
> - for (i = 0; i < apc->num_queues; i++) {
> - txq = &apc->tx_qp[i].txq;
> - tsleep = 1000;
> - while (atomic_read(&txq->pending_sends) > 0 &&
> - time_before(jiffies, timeout)) {
> - usleep_range(tsleep, tsleep + 1000);
> - tsleep <<= 1;
> - }
> - if (atomic_read(&txq->pending_sends)) {
> - err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> - if (err) {
> - netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> - err, atomic_read(&txq->pending_sends),
> - txq->gdma_txq_id);
> + if (apc->tx_qp) {
And same as above, is it possible for apc->tx_qp to be NULL here?
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + tsleep = 1000;
> + while (atomic_read(&txq->pending_sends) > 0 &&
> + time_before(jiffies, timeout)) {
> + usleep_range(tsleep, tsleep + 1000);
> + tsleep <<= 1;
> + }
> + if (atomic_read(&txq->pending_sends)) {
> + err =
> + pcie_flr(to_pci_dev(gd->gdma_context->dev));
> + if (err) {
> + netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> + err,
> + atomic_read(&txq->pending_sends),
> + txq->gdma_txq_id);
> + }
> + break;
> }
> - break;
> }
> - }
>
> - for (i = 0; i < apc->num_queues; i++) {
> - txq = &apc->tx_qp[i].txq;
> - while ((skb = skb_dequeue(&txq->pending_skbs))) {
> - mana_unmap_skb(skb, apc);
> - dev_kfree_skb_any(skb);
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + while ((skb = skb_dequeue(&txq->pending_skbs))) {
> + mana_unmap_skb(skb, apc);
> + dev_kfree_skb_any(skb);
> + }
> + atomic_set(&txq->pending_sends, 0);
> }
> - atomic_set(&txq->pending_sends, 0);
> }
> +
> /* We're 100% sure the queues can no longer be woken up, because
> * we're sure now mana_poll_tx_cq() can't be running.
> */
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
2026-05-18 21:49 ` Harshitha Ramamurthy
@ 2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
2 siblings, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
To: Dipayaan Roy; +Cc: linux-hyperv, bpf
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Missing re-allocation of `apc->rxqs` during interface bring-up causes deterministic NULL pointer dereference
- [High] Uninitialized queue handles (0 and INVALID_MANA_HANDLE) are sent to the firmware during error teardown, corrupting hardware state
- [High] GDMA admin commands are issued immediately after a PCI Function Level Reset (FLR), causing severe cascading timeouts and hardware state violations
--
commit 2129283d37aabfe724d7509e34346eab93d302a2
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>
net: mana: Add NULL guards in teardown path to prevent panic on attach failure
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> struct mana_rxq *rxq;
> u32 rxq_idx;
>
> - for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> - rxq = apc->rxqs[rxq_idx];
> - if (!rxq)
> - continue;
> + if (apc->rxqs) {
>
> - mana_destroy_rxq(apc, rxq, true);
> - apc->rxqs[rxq_idx] = NULL;
> + for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> + rxq = apc->rxqs[rxq_idx];
> + if (!rxq)
> + continue;
> +
> + mana_destroy_rxq(apc, rxq, true);
> + apc->rxqs[rxq_idx] = NULL;
> + }
> }
If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?
When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:
apc->rxqs[i] = rxq;
If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?
[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> * number of queues.
> */
>
> - for (i = 0; i < apc->num_queues; i++) {
> - txq = &apc->tx_qp[i].txq;
> - tsleep = 1000;
> - while (atomic_read(&txq->pending_sends) > 0 &&
> - time_before(jiffies, timeout)) {
> - usleep_range(tsleep, tsleep + 1000);
> - tsleep <<= 1;
> - }
> - if (atomic_read(&txq->pending_sends)) {
> - err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> - if (err) {
> - netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> - err, atomic_read(&txq->pending_sends),
> - txq->gdma_txq_id);
> + if (apc->tx_qp) {
In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?
If mana_destroy_txq() unconditionally iterates over all apc->num_queues:
for (i = 0; i < apc->num_queues; i++) {
...
mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
...
Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?
Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + tsleep = 1000;
> + while (atomic_read(&txq->pending_sends) > 0 &&
> + time_before(jiffies, timeout)) {
> + usleep_range(tsleep, tsleep + 1000);
> + tsleep <<= 1;
> + }
> + if (atomic_read(&txq->pending_sends)) {
> + err =
> + pcie_flr(to_pci_dev(gd->gdma_context->dev));
If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.
Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?
Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?
> + if (err) {
> + netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> + err,
> + atomic_read(&txq->pending_sends),
> + txq->gdma_txq_id);
> + }
> + break;
> }
> - break;
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=1
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
2026-05-18 19:43 ` [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached Dipayaan Roy
@ 2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
1 sibling, 0 replies; 10+ messages in thread
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
To: Dipayaan Roy; +Cc: bpf, linux-hyperv
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The patch correctly prevents `mana_detach()` from running on an already-detached device in the reset handler, but identical bugs remain in `mana_change_mtu()`, `mana_set_channels()`, `mana_set_ringparam()`, and `mana_bpf()`.
--
commit bac73a4812067e75feeec962b27ab9647e30ed9c
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>
net: mana: Skip redundant detach in queue reset handler if already detached
This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> + /* If already detached (indicates detach succeeded but attach failed
> + * previously). Now skip mana detach and just retry mana_attach.
> + */
> + if (!netif_device_present(ndev))
> + goto attach;
> +
> err = mana_detach(ndev, false);
The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?
If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).
If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.
Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.
The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).
Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=2
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
2026-05-18 21:49 ` Harshitha Ramamurthy
2026-05-19 19:47 ` sashiko-bot
@ 2026-05-19 22:55 ` Jakub Kicinski
2026-05-20 18:11 ` Dipayaan Roy
2 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
To: dipayanroy
Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
davem, edumazet, pabeni, leon, longli, kotaranov, horms,
shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
yury.norov
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Add NULL guards in teardown path to prevent panic on attach failure
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> struct mana_rxq *rxq;
> u32 rxq_idx;
>
> - for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> - rxq = apc->rxqs[rxq_idx];
> - if (!rxq)
> - continue;
> + if (apc->rxqs) {
>
> - mana_destroy_rxq(apc, rxq, true);
> - apc->rxqs[rxq_idx] = NULL;
> + for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> + rxq = apc->rxqs[rxq_idx];
> + if (!rxq)
> + continue;
> +
> + mana_destroy_rxq(apc, rxq, true);
> + apc->rxqs[rxq_idx] = NULL;
> + }
> }
If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?
When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:
apc->rxqs[i] = rxq;
If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?
[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> * number of queues.
> */
>
> - for (i = 0; i < apc->num_queues; i++) {
> - txq = &apc->tx_qp[i].txq;
> - tsleep = 1000;
> - while (atomic_read(&txq->pending_sends) > 0 &&
> - time_before(jiffies, timeout)) {
> - usleep_range(tsleep, tsleep + 1000);
> - tsleep <<= 1;
> - }
> - if (atomic_read(&txq->pending_sends)) {
> - err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> - if (err) {
> - netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> - err, atomic_read(&txq->pending_sends),
> - txq->gdma_txq_id);
> + if (apc->tx_qp) {
In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?
If mana_destroy_txq() unconditionally iterates over all apc->num_queues:
for (i = 0; i < apc->num_queues; i++) {
...
mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
...
Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?
Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + tsleep = 1000;
> + while (atomic_read(&txq->pending_sends) > 0 &&
> + time_before(jiffies, timeout)) {
> + usleep_range(tsleep, tsleep + 1000);
> + tsleep <<= 1;
> + }
> + if (atomic_read(&txq->pending_sends)) {
> + err =
> + pcie_flr(to_pci_dev(gd->gdma_context->dev));
If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.
Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?
Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?
> + if (err) {
> + netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> + err,
> + atomic_read(&txq->pending_sends),
> + txq->gdma_txq_id);
> + }
> + break;
> }
> - break;
> }
--
pw-bot: cr
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
2026-05-18 19:43 ` [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached Dipayaan Roy
2026-05-19 19:47 ` sashiko-bot
@ 2026-05-19 22:55 ` Jakub Kicinski
2026-05-20 18:23 ` Dipayaan Roy
1 sibling, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
To: dipayanroy
Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
davem, edumazet, pabeni, leon, longli, kotaranov, horms,
shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
yury.norov
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Skip redundant detach in queue reset handler if already detached
This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> + /* If already detached (indicates detach succeeded but attach failed
> + * previously). Now skip mana detach and just retry mana_attach.
> + */
> + if (!netif_device_present(ndev))
> + goto attach;
> +
> err = mana_detach(ndev, false);
The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?
If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).
If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.
Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.
The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).
Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
2026-05-19 22:55 ` Jakub Kicinski
@ 2026-05-20 18:11 ` Dipayaan Roy
0 siblings, 0 replies; 10+ messages in thread
From: Dipayaan Roy @ 2026-05-20 18:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
On Tue, May 19, 2026 at 03:55:56PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Add NULL guards in teardown path to prevent panic on attach failure
>
> When queue allocation fails partway through, the error cleanup frees
> and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
> pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
> and mana_dealloc_queues() to prevent NULL pointer dereferences.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9afc786b297a8..9e7e4bf526bf6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [ ... ]
> > @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
> > struct mana_rxq *rxq;
> > u32 rxq_idx;
> >
> > - for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > - rxq = apc->rxqs[rxq_idx];
> > - if (!rxq)
> > - continue;
> > + if (apc->rxqs) {
> >
> > - mana_destroy_rxq(apc, rxq, true);
> > - apc->rxqs[rxq_idx] = NULL;
> > + for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> > + rxq = apc->rxqs[rxq_idx];
> > + if (!rxq)
> > + continue;
> > +
> > + mana_destroy_rxq(apc, rxq, true);
> > + apc->rxqs[rxq_idx] = NULL;
> > + }
> > }
>
> If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
> how does the interface recover if the user subsequently brings it back up?
>
> When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
> mana_alloc_queues() -> mana_add_rx_queues(), the code does:
>
> apc->rxqs[i] = rxq;
>
> If apc->rxqs was left as NULL during the teardown, will this cause a NULL
> pointer dereference on the next interface bring-up?
>
> [ ... ]
Hi Jakub,
The only path that recovers from this state is mana_attach(), which
calls mana_init_port() -> mana_init_port_context() to re-allocate
apc->rxqs before calling mana_alloc_queues().
When mana_open is invoked prior to it the rxqs would be already setup by
mana_probe_port -> mana_init_port -> mana_init_port_context.
> > @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
> > * number of queues.
> > */
> >
> > - for (i = 0; i < apc->num_queues; i++) {
> > - txq = &apc->tx_qp[i].txq;
> > - tsleep = 1000;
> > - while (atomic_read(&txq->pending_sends) > 0 &&
> > - time_before(jiffies, timeout)) {
> > - usleep_range(tsleep, tsleep + 1000);
> > - tsleep <<= 1;
> > - }
> > - if (atomic_read(&txq->pending_sends)) {
> > - err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> > - if (err) {
> > - netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > - err, atomic_read(&txq->pending_sends),
> > - txq->gdma_txq_id);
> > + if (apc->tx_qp) {
>
> In the error path where queue creation fails partway through, will
> mana_destroy_txq() be called?
>
> If mana_destroy_txq() unconditionally iterates over all apc->num_queues:
>
> for (i = 0; i < apc->num_queues; i++) {
> ...
> mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
> ...
>
> Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
> that were never fully initialized?
>
> Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
> with these uninitialized handles, and could handle 0 corrupt the firmware
> state?
>
This is a known exisiting issue in the cleaup path for partial tx
setups and was also mentioned in the a recent patch where the rx partial
init clean-up path was fixed. My colleague Aditya is already working on a
patch to fix this and all other issues in the tx cleanup path.
> > + for (i = 0; i < apc->num_queues; i++) {
> > + txq = &apc->tx_qp[i].txq;
> > + tsleep = 1000;
> > + while (atomic_read(&txq->pending_sends) > 0 &&
> > + time_before(jiffies, timeout)) {
> > + usleep_range(tsleep, tsleep + 1000);
> > + tsleep <<= 1;
> > + }
> > + if (atomic_read(&txq->pending_sends)) {
> > + err =
> > + pcie_flr(to_pci_dev(gd->gdma_context->dev));
>
> If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
> internal GDMA state.
>
> Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
> and mana_destroy_vport() are called. Will these functions attempt to send
> commands to the now-destroyed GDMA admin queue via mana_send_request() and
> wait for completions?
>
> Could this lead to cascading timeouts since the hardware was just reset and
> cannot process admin commands?
>
This is pre-existing behavior/issue the pcie_flr() path and the
subsequent calls to mana_disable_vport_rx()/mana_fence_rqs()/mana_destroy_vport()
are unchanged by this patch. My patch only adds a NULL guard for
apc->tx_qp to skip this entire TX drain block when tx_qp is already
freed. But this point seems to be valid issue, I will investigate this
further and I will send a different patch set for this as follow up work.
> > + if (err) {
> > + netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> > + err,
> > + atomic_read(&txq->pending_sends),
> > + txq->gdma_txq_id);
> > + }
> > + break;
> > }
> > - break;
> > }
> --
> pw-bot: cr
Regards
Dipayaan Roy
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
2026-05-19 22:55 ` Jakub Kicinski
@ 2026-05-20 18:23 ` Dipayaan Roy
0 siblings, 0 replies; 10+ messages in thread
From: Dipayaan Roy @ 2026-05-20 18:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
On Tue, May 19, 2026 at 03:55:59PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: Skip redundant detach in queue reset handler if already detached
>
> This commit adds a check for netif_device_present() in the queue reset
> handler. If the device is already detached, it skips the redundant
> mana_detach() call and proceeds directly to mana_attach().
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9e7e4bf526bf6..740f3c34db451 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
> [ ... ]
> > + /* If already detached (indicates detach succeeded but attach failed
> > + * previously). Now skip mana detach and just retry mana_attach.
> > + */
> > + if (!netif_device_present(ndev))
> > + goto attach;
> > +
> > err = mana_detach(ndev, false);
>
> The fix here correctly prevents mana_detach() from running on an already
> detached device. However, do identical issues remain in mana_change_mtu(),
> mana_set_channels(), mana_set_ringparam(), and mana_bpf()?
>
> If mana_attach() previously failed (for example, due to out of memory),
> the port is left with apc->port_is_up = false, but it preserves
> apc->port_st_save = true, and calls netif_device_detach(ndev).
>
> If an admin tries to recover the port by retrying a configuration command
> like changing the MTU or attaching XDP, these functions appear to
> unconditionally call mana_detach(ndev, false) without the
> !netif_device_present(ndev) check added here.
>
> Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
> port_is_up is already false, port_st_save is overwritten to false,
> permanently erasing the intent that the port should be up.
>
> The subsequent call to mana_attach() would then see port_st_save == false,
> skip mana_alloc_queues(), set port_is_up = false, and successfully call
> netif_device_attach(ndev).
>
> Would this result in the configuration command returning 0 to userspace
> while the port queues were never allocated, leaving the interface silently
> broken until manually toggled down and up?
Hi Jakub,
I agree that mana_change_mtu() and mana_xdp_set() would also need the same
netif_device_present() guard (or alternatively, an early return
when the port is already detached). I'll investigate and address those in a
follow-up patch to keep this fix minimal, as this patch
specifically fixes the reset handler that runs without user intervention,
unlike the ohers cases mentioned above.
Regards
Dipayaan Roy
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-20 18:23 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 19:43 [PATCH net 0/2] net: mana: Fix NULL dereferences during teardown after attach failure Dipayaan Roy
2026-05-18 19:43 ` [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on " Dipayaan Roy
2026-05-18 21:49 ` Harshitha Ramamurthy
2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
2026-05-20 18:11 ` Dipayaan Roy
2026-05-18 19:43 ` [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached Dipayaan Roy
2026-05-19 19:47 ` sashiko-bot
2026-05-19 22:55 ` Jakub Kicinski
2026-05-20 18:23 ` Dipayaan Roy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox