* [PATCH net v2 0/2] Fix MANA RX with bounce buffering
@ 2026-06-24 22:26 Dexuan Cui
2026-06-24 22:26 ` [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU Dexuan Cui
2026-06-24 22:26 ` [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC Dexuan Cui
0 siblings, 2 replies; 4+ messages in thread
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.")
Dipayaan tried to fix this by avoiding page pool frags when bounce
buffering is in use [1][2]. However, that is not a clean solution: no
other NIC drivers need to explicitly check whether bounce buffering is
in use. It is also not good for throughput, since
dma_map_single()/dma_unmap_single() are then called for each incoming
packet.
In fact, page pool frags can still be used with the standard MTU of
1500: all we need is to add page_pool_dma_sync_for_cpu() before the CPU
reads the incoming packet, so I implemented that in v1 [3].
As Simon suggested [4], this version splits v1 into two patches:
Patch 1 adds page_pool_dma_sync_for_cpu().
Patch 2 validates the packet length reported by the NIC.
There is no functional difference between v1 and v2, so I am keeping
Haiyang's Reviewed-by tag in v2.
Please review. Thanks!
Note that, with jumbo MTU and XDP, page pool frags are not used, and
dma_map_single()/dma_unmap_single() are still called for each incoming
packet, causing poor throughput with swiotlb=force; see
mana_get_rxbuf_cfg() and mana_refill_rx_oob() -> mana_get_rxfrag().
The jumbo MTU/XDP issue will be addressed later since that needs more
consideration if we want to use page pool with PP_FLAG_DMA_MAP there:
e.g., for XDP, the received packet can be transmitted in place, i.e. the
same RX buffer can be used as a TX buffer:
mana_rx_skb() -> mana_xdp_tx() -> mana_start_xmit() -> mana_map_skb().
In mana_create_page_pool(), we may have to set pprm.dma_dir to
DMA_BIDIRECTIONAL if XDP is in use:
pprm.dma_dir = mana_xdp_get(mpc) ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
In the case of XDP, the next issue is that mana_rx_skb() -> ... ->
mana_map_skb() appears to call dma_map_single() on an RX buffer allocated
from a page pool created with PP_FLAG_DMA_MAP, which seems incorrect.
Any thoughts?
[1] https://lore.kernel.org/all/ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/#r
[2] https://lore.kernel.org/all/ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
[3] https://lore.kernel.org/all/20260618035029.249361-1-decui@microsoft.com/
[4] https://lore.kernel.org/all/20260619090514.GT827683@horms.kernel.org/
Dexuan Cui (2):
net: mana: Sync page pool RX frags for CPU
net: mana: Validate the packet length reported by the NIC
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
include/net/mana/mana.h | 8 +++
2 files changed, 58 insertions(+), 11 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
2026-06-24 22:26 [PATCH net v2 0/2] Fix MANA RX with bounce buffering Dexuan Cui
@ 2026-06-24 22:26 ` Dexuan Cui
2026-06-26 14:50 ` Simon Horman
2026-06-24 22:26 ` [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC Dexuan Cui
1 sibling, 1 reply; 4+ messages in thread
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.
This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.
Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.
Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1:
v1 is split into two patches in the v2.
Add Haiyang's Reviewed-by.
drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
include/net/mana/mana.h | 8 ++++
2 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..1875bffd82b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
- dma_addr_t *da, bool *from_pool)
+ dma_addr_t *da, bool *from_pool,
+ struct page **pp_page, u32 *dma_sync_offset)
{
struct page *page;
u32 offset;
void *va;
+
*from_pool = false;
+ *pp_page = NULL;
+ *dma_sync_offset = 0;
/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
* per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
va = page_to_virt(page) + offset;
*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
*from_pool = true;
+ *pp_page = page;
+ *dma_sync_offset = offset + rxq->headroom;
return va;
}
/* Allocate frag for rx buffer, and save the old buf */
static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
- struct mana_recv_buf_oob *rxoob, void **old_buf,
- bool *old_fp)
+ struct mana_recv_buf_oob *rxoob, u32 pktlen,
+ void **old_buf, bool *old_fp)
{
+ struct page *pp_page;
+ u32 dma_sync_offset;
bool from_pool;
dma_addr_t da;
void *va;
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return;
- if (!rxoob->from_pool || rxq->frag_count == 1)
+ if (!rxoob->from_pool || rxq->frag_count == 1) {
dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
DMA_FROM_DEVICE);
+ } else {
+ /* The page pool maps the whole page and only syncs for device
+ * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+ * for the CPU before they are read: this is required if DMA
+ * is incoherent or bounce buffers are used.
+ */
+ page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+ rxoob->dma_sync_offset, pktlen);
+ }
*old_buf = rxoob->buf_va;
*old_fp = rxoob->from_pool;
rxoob->buf_va = va;
rxoob->sgl[0].address = da;
rxoob->from_pool = from_pool;
+ rxoob->pp_page = pp_page;
+ rxoob->dma_sync_offset = dma_sync_offset;
}
static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,7 +2190,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
/* Unsuccessful refill will have old_buf == NULL.
* In this case, mana_rx_skb() will drop the packet.
@@ -2566,6 +2586,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
struct mana_rxq *rxq, struct device *dev)
{
struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+ struct page *pp_page = NULL;
+ u32 dma_sync_offset = 0;
bool from_pool = false;
dma_addr_t da;
void *va;
@@ -2573,13 +2595,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
if (mpc->rxbufs_pre)
va = mana_get_rxbuf_pre(rxq, &da);
else
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return -ENOMEM;
rx_oob->buf_va = va;
rx_oob->from_pool = from_pool;
+ rx_oob->pp_page = pp_page;
+ rx_oob->dma_sync_offset = dma_sync_offset;
rx_oob->sgl[0].address = da;
rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
void *buf_va;
bool from_pool; /* allocated from a page pool */
+ /* head page of the page_pool fragment; valid only when
+ * from_pool && frag_count > 1.
+ */
+ struct page *pp_page;
+ /* Fragment offset plus rxq->headroom, passed to
+ * page_pool_dma_sync_for_cpu().
+ */
+ u32 dma_sync_offset;
/* SGL of the buffer going to be sent as part of the work request. */
u32 num_sge;
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC
2026-06-24 22:26 [PATCH net v2 0/2] Fix MANA RX with bounce buffering Dexuan Cui
2026-06-24 22:26 ` [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU Dexuan Cui
@ 2026-06-24 22:26 ` Dexuan Cui
1 sibling, 0 replies; 4+ messages in thread
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
Validate the packet length reported in the RX CQE before using it as a DMA
sync length or passing it to skb processing. The CQE is supplied by the
NIC device and should not be blindly trusted.
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1:
v1 is split into two patches in the v2.
Add Haiyang's Reviewed-by.
drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++++----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1875bffd82b7..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2190,12 +2190,26 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+ if (unlikely(pktlen > rxq->datasize)) {
+ /* Increase it even if mana_rx_skb() isn't called. */
+ rxq->rx_cq.work_done++;
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ ++ndev->stats.rx_dropped;
+ netdev_warn_once(ndev,
+ "Dropped oversized RX packet: len=%u, datasize=%u\n",
+ pktlen, rxq->datasize);
+
+ /* Reuse the RX buffer since rxbuf_oob is unchanged. */
+ } else {
+
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+ &old_buf, &old_fp);
+
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ }
mana_move_wq_tail(rxq->gdma_rq,
rxbuf_oob->wqe_inf.wqe_size_in_bu);
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
2026-06-24 22:26 ` [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU Dexuan Cui
@ 2026-06-26 14:50 ` Simon Horman
0 siblings, 0 replies; 4+ messages in thread
From: Simon Horman @ 2026-06-26 14:50 UTC (permalink / raw)
To: Dexuan Cui
Cc: kys, haiyangz, wei.liu, longli, andrew+netdev, davem, edumazet,
kuba, pabeni, kotaranov, ernis, dipayanroy, kees, jacob.e.keller,
ssengar, linux-hyperv, netdev, linux-kernel, linux-rdma, stable
On Wed, Jun 24, 2026 at 03:26:04PM -0700, Dexuan Cui wrote:
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
>
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
>
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
>
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>
> Changes since v1:
> v1 is split into two patches in the v2.
> Add Haiyang's Reviewed-by.
>
> drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
> include/net/mana/mana.h | 8 ++++
> 2 files changed, 40 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index c9b1df1ed109..1875bffd82b7 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
> }
>
> static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
> - dma_addr_t *da, bool *from_pool)
> + dma_addr_t *da, bool *from_pool,
> + struct page **pp_page, u32 *dma_sync_offset)
> {
> struct page *page;
> u32 offset;
> void *va;
> +
> *from_pool = false;
> + *pp_page = NULL;
> + *dma_sync_offset = 0;
>
> /* Don't use fragments for jumbo frames or XDP where it's 1 fragment
> * per page.
> @@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
> va = page_to_virt(page) + offset;
> *da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
> *from_pool = true;
> + *pp_page = page;
> + *dma_sync_offset = offset + rxq->headroom;
>
> return va;
> }
>
> /* Allocate frag for rx buffer, and save the old buf */
> static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
> - struct mana_recv_buf_oob *rxoob, void **old_buf,
> - bool *old_fp)
> + struct mana_recv_buf_oob *rxoob, u32 pktlen,
> + void **old_buf, bool *old_fp)
> {
> + struct page *pp_page;
> + u32 dma_sync_offset;
> bool from_pool;
> dma_addr_t da;
> void *va;
>
> - va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
> + va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
> + &dma_sync_offset);
> if (!va)
> return;
> - if (!rxoob->from_pool || rxq->frag_count == 1)
> + if (!rxoob->from_pool || rxq->frag_count == 1) {
> dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
> DMA_FROM_DEVICE);
> + } else {
> + /* The page pool maps the whole page and only syncs for device
> + * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
> + * for the CPU before they are read: this is required if DMA
> + * is incoherent or bounce buffers are used.
> + */
> + page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
> + rxoob->dma_sync_offset, pktlen);
> + }
Hi,
I'm sorry to be bothersome but I think that the order of the two patches
that comprise this series should be reversed. Or if that is not possible,
go back to a single patch.
Because, as flagged by https://netdev-ai.bots.linux.dev/sashiko/
Is pktlen here bounded before it reaches page_pool_dma_sync_for_cpu()?
The value originates from oob->ppi[i].pkt_len in mana_process_rx_cqe()
and is forwarded straight into this call with no comparison against
rxq->datasize or (rxq->alloc_size - rxoob->dma_sync_offset).
When SWIOTLB is in use (the swiotlb=force case explicitly called out in
the commit message), page_pool_dma_sync_for_cpu() reaches
dma_sync_single_range_for_cpu() and copies dma_sync_size bytes from the
bounce buffer back into the original page.
Since alloc_size can be smaller than PAGE_SIZE and multiple fragments
share a single page_pool page, can a pktlen larger than the fragment
extent here cause the copy-back to spill past this fragment into
neighbouring fragments that belong to other rxoobs still in flight?
If so, those neighbours may already have been or may shortly be passed
up via napi_gro_receive() in mana_rx_skb(), so the over-sync would
silently overwrite their payloads before the eventual skb_put() in
mana_build_skb() trips skb_over_panic() on this oversized packet.
Would it make sense to validate pktlen against rxq->datasize before
calling mana_refill_rx_oob()? The follow-up patch in this series,
"net: mana: Validate the packet length reported by the NIC" (commit
6c707fe658d6), adds exactly that check:
if (unlikely(pktlen > rxq->datasize))
...
Could that validation be folded into this patch so that the sync-for-CPU
introduced here cannot be steered with an attacker-controlled length,
particularly given that the motivating scenario (swiotlb=force) is the
Confidential VM case where the hypervisor-supplied CQE is untrusted?
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-26 14:50 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 22:26 [PATCH net v2 0/2] Fix MANA RX with bounce buffering Dexuan Cui
2026-06-24 22:26 ` [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU Dexuan Cui
2026-06-26 14:50 ` Simon Horman
2026-06-24 22:26 ` [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC Dexuan Cui
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox