* [PATCH net-next 0/3] fbnic: Support larger io_uring zcrx buffers
@ 2026-05-22 11:32 Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 1/3] fbnic: Track BDQ fragment geometry per ring Björn Töpel
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Björn Töpel @ 2026-05-22 11:32 UTC (permalink / raw)
To: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Shuah Khan, netdev
Cc: Björn Töpel, Jacob Keller, Mohsin Bashir,
Mike Marciniszyn (Meta), Pavel Begunkov, linux-kernel,
linux-kselftest
Hi!
Fbnic programs receive buffers through BDQs. The hardware consumes
BDQs as 4 KiB fragments, and receive completions report the consumed
buffer by returning the BDQ buffer ID in the RCD.
The driver currently derives the BDQ fragment layout from PAGE_SIZE.
That works while HPQ and PPQ use the same allocation size, but
io_uring zcrx can provide larger receive buffers through rx_buf_len.
For zcrx, the PPQ page pool allocation size and the PPQ BDQ fragment
geometry need to match the requested buffer size, without changing
HPQ.
Make the BDQ fragment geometry per ring, then use the rendered RX
queue rx_page_size when creating the PPQ page pool. The NIC still
consumes the PPQ as 4 KiB fragments; a larger zcrx buffer is
represented as multiple BDQ fragments belonging to one net_iov.
Fbnic also validates rx_page_size against its own queue geometry. The
core validates the zcrx request and checks that the imported memory
can be represented as rx_buf_len-sized DMA chunks, but fbnic still
needs to make sure the PPQ retains usable depth after expanding one
software buffer into multiple 4 KiB hardware fragments.
The normal open path uses the rendered per-queue rx_page_size as well.
This preserves a memory-provider binding made while the netdev is
down, instead of falling back to the default PPQ geometry on open.
The selftest change adds an optional iou-zcrx helper check for manual
driver testing. It is not wired into the generic large-chunk test
because different drivers may legitimately return different CQE
boundaries.
Manual testing
==============
The fbnic QEMU model and firmware setup are described here:
https://lore.kernel.org/netdev/20260309113852.2c654de5@kernel.org/
I use something like:
KERNEL=/path/to/linux
DISK=/path/to/fedora-qemu.raw
OVMF_CODE=/path/to/OVMF_CODE.fd
OVMF_VARS=/path/to/OVMF_VARS.fd
MODS=/tmp/fbnic-modules
QEMU=/path/to/fbnic-qemu/build/qemu-system-x86_64
$QEMU \
-machine type=q35,accel=kvm \
-drive if=pflash,format=raw,unit=0,file=$OVMF_CODE,readonly=on \
-drive if=pflash,format=raw,unit=1,file=$OVMF_VARS \
-smp 16 -m 16G \
-object memory-backend-memfd,id=mem,size=16G,share=on \
-numa node,memdev=mem \
-kernel $KERNEL/arch/x86/boot/bzImage \
-append "root=/dev/vda2 rw console=ttyS0 earlycon" \
-drive file=$DISK,format=raw,if=none,id=drive0 \
-device virtio-blk-pci,drive=drive0 \
-no-user-config -nodefaults -nographic \
-virtfs local,path=$MODS/lib/modules,mount_tag=modules,security_model=none,readonly=on \
-virtfs local,path=$KERNEL,mount_tag=hostshare,security_model=none,readonly=on \
-netdev user,id=hostnet0,hostfwd=tcp::9999-:9999 \
-netdev hubport,id=hub_uplink,hubid=0,netdev=hostnet0 \
-device virtio-net-pci,netdev=n1 \
-netdev hubport,id=n1,hubid=0 \
-device pcie-root-port,id=pcie.1,bus=pcie.0,chassis=1 \
-device fbnic,bus=pcie.1,id=fbnic.1,mac=00:de:ad:be:ef:01,netdev=n2,rbt=skt.0,bar4=ctrl.1 \
-netdev hubport,id=n2,hubid=0 \
-chardev socket,id=ctrl.1,path=/tmp/fbnic-ctrl-skt \
-netdev socket,id=skt.0,connect=localhost:9000 \
-serial mon:stdio
Here you'll get a fbnic device, host port forwarding for TCP port
9999, and a 9p mount for the kernel tree and modules.
Inside the guest:
mount -t 9p -o trans=virtio,version=9p2000.L hostshare /host
cd /host/tools/testing/selftests/drivers/net/hw
ethtool -L enp1s0 combined 2
ethtool -G enp1s0 tcp-data-split on hds-thresh 0 rx 64
ethtool -X enp1s0 equal 1
ethtool -N enp1s0 flow-type tcp4 dst-ip 10.0.2.15 dst-port 9999 action 1
echo 64 > /proc/sys/vm/nr_hugepages
./iou-zcrx -s -i enp1s0 -p 9999 -q 1 -x 2
On the host:
cd /path/to/linux/tools/testing/selftests/drivers/net/hw
./iou-zcrx -c -h 127.0.0.1 -p 9999 -l 12840
For fbnic-specific manual checking that traffic reaches the second 4
KiB fragment of an 8 KiB zcrx buffer, run the receiver with:
./iou-zcrx -s -i enp1s0 -p 9999 -q 1 -x 2 -F 4096
Björn Töpel (3):
fbnic: Track BDQ fragment geometry per ring
fbnic: Support larger zcrx receive buffers
selftests: drv-net: Add zcrx payload offset check
drivers/net/ethernet/meta/fbnic/fbnic_csr.h | 29 +--
.../net/ethernet/meta/fbnic/fbnic_debugfs.c | 5 +-
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c | 168 ++++++++++++++----
drivers/net/ethernet/meta/fbnic/fbnic_txrx.h | 6 +
.../selftests/drivers/net/hw/iou-zcrx.c | 28 ++-
5 files changed, 176 insertions(+), 60 deletions(-)
base-commit: 1a1f055318d82e64485a6ff8420e5f70b4267998
--
2.53.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH net-next 1/3] fbnic: Track BDQ fragment geometry per ring
2026-05-22 11:32 [PATCH net-next 0/3] fbnic: Support larger io_uring zcrx buffers Björn Töpel
@ 2026-05-22 11:32 ` Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 2/3] fbnic: Support larger zcrx receive buffers Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 3/3] selftests: drv-net: Add zcrx payload offset check Björn Töpel
2 siblings, 0 replies; 4+ messages in thread
From: Björn Töpel @ 2026-05-22 11:32 UTC (permalink / raw)
To: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Shuah Khan, netdev
Cc: Björn Töpel, Jacob Keller, Mohsin Bashir,
Mike Marciniszyn (Meta), Pavel Begunkov, linux-kernel,
linux-kselftest
Fbnic programs BDQs in 4 KiB fragments, but the driver has so far
decoded buffer IDs using PAGE_SIZE-derived constants. That works while
HPQ and PPQ both use PAGE_SIZE buffers, but it makes the fragment
layout global even though the layout really belongs to the queue.
Store the fragment shift on each BDQ and use it when programming
buffer descriptors and decoding receive completions. HPQ and PPQ still
get the same PAGE_SIZE-derived value, so this does not change behavior
yet.
This prepares PPQ to use a larger io_uring zcrx buffer size without
changing the HPQ layout.
Signed-off-by: Björn Töpel <bjorn@kernel.org>
---
drivers/net/ethernet/meta/fbnic/fbnic_csr.h | 29 ++------
.../net/ethernet/meta/fbnic/fbnic_debugfs.c | 5 +-
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c | 68 ++++++++++++-------
drivers/net/ethernet/meta/fbnic/fbnic_txrx.h | 6 ++
4 files changed, 58 insertions(+), 50 deletions(-)
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_csr.h b/drivers/net/ethernet/meta/fbnic/fbnic_csr.h
index 64b958df7774..0ff972f8febc 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_csr.h
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_csr.h
@@ -109,17 +109,13 @@ enum {
/* Rx Buffer Descriptor Format
*
- * The layout of this can vary depending on the page size of the system.
+ * Buffer descriptors describe 4 KiB BDQ fragments. A BDQ buffer may be one
+ * fragment, or a power-of-two number of fragments.
*
- * If the page size is 4K then the layout will simply consist of ID for
- * the 16 most significant bits, and the lower 46 are essentially the page
- * address with the lowest 12 bits being reserved 0 due to the fact that
- * a page will be aligned.
- *
- * If the page size is larger than 4K then the lower n bits of the ID and
- * page address will be reserved for the fragment ID. This fragment will
- * be 4K in size and will be used to index both the DMA address and the ID
- * by the same amount.
+ * The address field stores the 4 KiB-aligned DMA address. The ID field stores
+ * the software buffer ID, with the low n bits used as the fragment ID when a
+ * buffer spans multiple 4 KiB fragments. The driver increments both the
+ * address and ID by one fragment for each descriptor belonging to a buffer.
*/
#define FBNIC_BD_DESC_ADDR_MASK DESC_GENMASK(45, 12)
#define FBNIC_BD_DESC_ID_MASK DESC_GENMASK(63, 48)
@@ -127,16 +123,6 @@ enum {
(FBNIC_BD_DESC_ADDR_MASK & ~(FBNIC_BD_DESC_ADDR_MASK - 1))
#define FBNIC_BD_FRAG_COUNT \
(PAGE_SIZE / FBNIC_BD_FRAG_SIZE)
-#define FBNIC_BD_FRAG_ADDR_MASK \
- (FBNIC_BD_DESC_ADDR_MASK & \
- ~(FBNIC_BD_DESC_ADDR_MASK * FBNIC_BD_FRAG_COUNT))
-#define FBNIC_BD_FRAG_ID_MASK \
- (FBNIC_BD_DESC_ID_MASK & \
- ~(FBNIC_BD_DESC_ID_MASK * FBNIC_BD_FRAG_COUNT))
-#define FBNIC_BD_PAGE_ADDR_MASK \
- (FBNIC_BD_DESC_ADDR_MASK & ~FBNIC_BD_FRAG_ADDR_MASK)
-#define FBNIC_BD_PAGE_ID_MASK \
- (FBNIC_BD_DESC_ID_MASK & ~FBNIC_BD_FRAG_ID_MASK)
/* Rx Completion Queue Descriptors */
#define FBNIC_RCD_TYPE_MASK DESC_GENMASK(62, 61)
@@ -151,9 +137,6 @@ enum {
/* Address/Length Completion Descriptors */
#define FBNIC_RCD_AL_BUFF_ID_MASK DESC_GENMASK(15, 0)
-#define FBNIC_RCD_AL_BUFF_FRAG_MASK (FBNIC_BD_FRAG_COUNT - 1)
-#define FBNIC_RCD_AL_BUFF_PAGE_MASK \
- (FBNIC_RCD_AL_BUFF_ID_MASK & ~FBNIC_RCD_AL_BUFF_FRAG_MASK)
#define FBNIC_RCD_AL_BUFF_LEN_MASK DESC_GENMASK(28, 16)
#define FBNIC_RCD_AL_BUFF_OFF_MASK DESC_GENMASK(43, 32)
#define FBNIC_RCD_AL_PAGE_FIN DESC_BIT(60)
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_debugfs.c b/drivers/net/ethernet/meta/fbnic/fbnic_debugfs.c
index 3c4563c8f403..1cd9dbab423b 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_debugfs.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_debugfs.c
@@ -181,8 +181,8 @@ static int fbnic_dbg_tcq_desc_seq_show(struct seq_file *s, void *v)
static int fbnic_dbg_bdq_desc_seq_show(struct seq_file *s, void *v)
{
struct fbnic_ring *ring = s->private;
+ unsigned int desc_count, i;
char hdr[80];
- int i;
/* Generate header on first entry */
fbnic_dbg_ring_show(s);
@@ -197,7 +197,8 @@ static int fbnic_dbg_bdq_desc_seq_show(struct seq_file *s, void *v)
return 0;
}
- for (i = 0; i < (ring->size_mask + 1) * FBNIC_BD_FRAG_COUNT; i++) {
+ desc_count = (ring->size_mask + 1) * fbnic_bdq_frag_count(ring);
+ for (i = 0; i < desc_count; i++) {
u64 bd = le64_to_cpu(ring->desc[i]);
seq_printf(s, "%04x %#04llx %#014llx\n", i,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
index 9cd85a0d0c3a..9a9675d04c16 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
@@ -870,19 +870,31 @@ static void fbnic_clean_bdq(struct fbnic_ring *ring, unsigned int hw_head,
ring->head = head;
}
+static u16 fbnic_rcd_bdq_idx(const struct fbnic_ring *bdq, u64 rcd)
+{
+ return FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd) >> bdq->frag_shift;
+}
+
+static unsigned int fbnic_rcd_frag_offset(const struct fbnic_ring *bdq,
+ u64 rcd)
+{
+ return (FIELD_GET(FBNIC_RCD_AL_BUFF_ID_MASK, rcd) &
+ (fbnic_bdq_frag_count(bdq) - 1)) * FBNIC_BD_FRAG_SIZE;
+}
+
static void fbnic_bd_prep(struct fbnic_ring *bdq, u16 id, netmem_ref netmem)
{
- __le64 *bdq_desc = &bdq->desc[id * FBNIC_BD_FRAG_COUNT];
+ u16 frag_count = fbnic_bdq_frag_count(bdq);
+ __le64 *bdq_desc = &bdq->desc[id * frag_count];
dma_addr_t dma = page_pool_get_dma_addr_netmem(netmem);
- u64 bd, i = FBNIC_BD_FRAG_COUNT;
+ u64 bd, i = frag_count;
- bd = (FBNIC_BD_PAGE_ADDR_MASK & dma) |
- FIELD_PREP(FBNIC_BD_PAGE_ID_MASK, id);
+ bd = (FBNIC_BD_DESC_ADDR_MASK & dma) |
+ FIELD_PREP(FBNIC_BD_DESC_ID_MASK, (u64)id << bdq->frag_shift);
- /* In the case that a page size is larger than 4K we will map a
- * single page to multiple fragments. The fragments will be
- * FBNIC_BD_FRAG_COUNT in size and the lower n bits will be use
- * to indicate the individual fragment IDs.
+ /* In the case that the buffer is larger than 4K we will map it
+ * to multiple fragments. The lower n bits will be used to
+ * indicate the individual fragment IDs.
*/
do {
*bdq_desc = cpu_to_le64(bd);
@@ -927,7 +939,7 @@ static void fbnic_fill_bdq(struct fbnic_ring *bdq)
/* Force DMA writes to flush before writing to tail */
dma_wmb();
- writel(i * FBNIC_BD_FRAG_COUNT, bdq->doorbell);
+ writel(i * fbnic_bdq_frag_count(bdq), bdq->doorbell);
}
}
@@ -958,7 +970,8 @@ static void fbnic_pkt_prepare(struct fbnic_napi_vector *nv, u64 rcd,
struct fbnic_pkt_buff *pkt,
struct fbnic_q_triad *qt)
{
- unsigned int hdr_pg_idx = FIELD_GET(FBNIC_RCD_AL_BUFF_PAGE_MASK, rcd);
+ struct fbnic_ring *hpq = &qt->sub0;
+ unsigned int hdr_pg_idx = fbnic_rcd_bdq_idx(hpq, rcd);
unsigned int hdr_pg_off = FIELD_GET(FBNIC_RCD_AL_BUFF_OFF_MASK, rcd);
struct page *page = fbnic_page_pool_get_head(qt, hdr_pg_idx);
unsigned int len = FIELD_GET(FBNIC_RCD_AL_BUFF_LEN_MASK, rcd);
@@ -976,8 +989,7 @@ static void fbnic_pkt_prepare(struct fbnic_napi_vector *nv, u64 rcd,
headroom = hdr_pg_off - hdr_pg_start + FBNIC_RX_PAD;
frame_sz = hdr_pg_end - hdr_pg_start;
xdp_init_buff(&pkt->buff, frame_sz, &qt->xdp_rxq);
- hdr_pg_start += (FBNIC_RCD_AL_BUFF_FRAG_MASK & rcd) *
- FBNIC_BD_FRAG_SIZE;
+ hdr_pg_start += fbnic_rcd_frag_offset(hpq, rcd);
/* Sync DMA buffer */
dma_sync_single_range_for_cpu(nv->dev, page_pool_get_dma_addr(page),
@@ -998,7 +1010,8 @@ static void fbnic_add_rx_frag(struct fbnic_napi_vector *nv, u64 rcd,
struct fbnic_pkt_buff *pkt,
struct fbnic_q_triad *qt)
{
- unsigned int pg_idx = FIELD_GET(FBNIC_RCD_AL_BUFF_PAGE_MASK, rcd);
+ struct fbnic_ring *ppq = &qt->sub1;
+ unsigned int pg_idx = fbnic_rcd_bdq_idx(ppq, rcd);
unsigned int pg_off = FIELD_GET(FBNIC_RCD_AL_BUFF_OFF_MASK, rcd);
unsigned int len = FIELD_GET(FBNIC_RCD_AL_BUFF_LEN_MASK, rcd);
netmem_ref netmem = fbnic_page_pool_get_data(qt, pg_idx);
@@ -1008,12 +1021,11 @@ static void fbnic_add_rx_frag(struct fbnic_napi_vector *nv, u64 rcd,
truesize = FIELD_GET(FBNIC_RCD_AL_PAGE_FIN, rcd) ?
FBNIC_BD_FRAG_SIZE - pg_off : ALIGN(len, 128);
- pg_off += (FBNIC_RCD_AL_BUFF_FRAG_MASK & rcd) *
- FBNIC_BD_FRAG_SIZE;
+ pg_off += fbnic_rcd_frag_offset(ppq, rcd);
/* Sync DMA buffer */
- page_pool_dma_sync_netmem_for_cpu(qt->sub1.page_pool, netmem,
- pg_off, truesize);
+ page_pool_dma_sync_netmem_for_cpu(ppq->page_pool, netmem, pg_off,
+ truesize);
added = xdp_buff_add_frag(&pkt->buff, netmem, pg_off, len, truesize);
if (unlikely(!added)) {
@@ -1256,12 +1268,12 @@ static int fbnic_clean_rcq(struct fbnic_napi_vector *nv,
switch (FIELD_GET(FBNIC_RCD_TYPE_MASK, rcd)) {
case FBNIC_RCD_TYPE_HDR_AL:
- head0 = FIELD_GET(FBNIC_RCD_AL_BUFF_PAGE_MASK, rcd);
+ head0 = fbnic_rcd_bdq_idx(&qt->sub0, rcd);
fbnic_pkt_prepare(nv, rcd, pkt, qt);
break;
case FBNIC_RCD_TYPE_PAY_AL:
- head1 = FIELD_GET(FBNIC_RCD_AL_BUFF_PAGE_MASK, rcd);
+ head1 = fbnic_rcd_bdq_idx(&qt->sub1, rcd);
fbnic_add_rx_frag(nv, rcd, pkt, qt);
break;
@@ -1609,6 +1621,7 @@ static void fbnic_ring_init(struct fbnic_ring *ring, u32 __iomem *doorbell,
ring->doorbell = doorbell;
ring->q_idx = q_idx;
ring->flags = flags;
+ ring->frag_shift = ilog2(FBNIC_BD_FRAG_COUNT);
ring->deferred_head = -1;
}
@@ -1890,15 +1903,18 @@ static int fbnic_alloc_rx_ring_desc(struct fbnic_net *fbn,
size_t desc_size = sizeof(*rxr->desc);
u32 rxq_size;
size_t size;
+ u16 frag_count;
switch (rxr->doorbell - fbnic_ring_csr_base(rxr)) {
case FBNIC_QUEUE_BDQ_HPQ_TAIL:
- rxq_size = fbn->hpq_size / FBNIC_BD_FRAG_COUNT;
- desc_size *= FBNIC_BD_FRAG_COUNT;
+ frag_count = fbnic_bdq_frag_count(rxr);
+ rxq_size = fbn->hpq_size / frag_count;
+ desc_size *= frag_count;
break;
case FBNIC_QUEUE_BDQ_PPQ_TAIL:
- rxq_size = fbn->ppq_size / FBNIC_BD_FRAG_COUNT;
- desc_size *= FBNIC_BD_FRAG_COUNT;
+ frag_count = fbnic_bdq_frag_count(rxr);
+ rxq_size = fbn->ppq_size / frag_count;
+ desc_size *= frag_count;
break;
case FBNIC_QUEUE_RCQ_HEAD:
rxq_size = fbn->rcq_size;
@@ -2564,7 +2580,7 @@ static void fbnic_enable_bdq(struct fbnic_ring *hpq, struct fbnic_ring *ppq)
hpq->tail = 0;
hpq->head = 0;
- log_size = fls(hpq->size_mask) + ilog2(FBNIC_BD_FRAG_COUNT);
+ log_size = fls(hpq->size_mask) + hpq->frag_shift;
/* Store descriptor ring address and size */
fbnic_ring_wr32(hpq, FBNIC_QUEUE_BDQ_HPQ_BAL, lower_32_bits(hpq->dma));
@@ -2576,7 +2592,7 @@ static void fbnic_enable_bdq(struct fbnic_ring *hpq, struct fbnic_ring *ppq)
if (!ppq->size_mask)
goto write_ctl;
- log_size = fls(ppq->size_mask) + ilog2(FBNIC_BD_FRAG_COUNT);
+ log_size = fls(ppq->size_mask) + ppq->frag_shift;
/* Add enabling of PPQ to BDQ control */
bdq_ctl |= FBNIC_QUEUE_BDQ_CTL_PPQ_ENABLE;
@@ -2845,8 +2861,10 @@ static int fbnic_queue_mem_alloc(struct net_device *dev,
fbnic_ring_init(&qt->sub0, real->sub0.doorbell, real->sub0.q_idx,
real->sub0.flags);
+ qt->sub0.frag_shift = real->sub0.frag_shift;
fbnic_ring_init(&qt->sub1, real->sub1.doorbell, real->sub1.q_idx,
real->sub1.flags);
+ qt->sub1.frag_shift = real->sub1.frag_shift;
fbnic_ring_init(&qt->cmpl, real->cmpl.doorbell, real->cmpl.q_idx,
real->cmpl.flags);
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.h b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.h
index e03c9d2c38dc..332cd0e29e15 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.h
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.h
@@ -121,6 +121,7 @@ struct fbnic_ring {
u16 size_mask; /* Size of ring in descriptors - 1 */
u8 q_idx; /* Logical netdev ring index */
u8 flags; /* Ring flags (FBNIC_RING_F_*) */
+ u8 frag_shift; /* BDQ: ilog2(buf_size / 4096) */
u32 head, tail; /* Head/Tail of ring */
@@ -162,6 +163,11 @@ struct fbnic_napi_vector {
extern const struct netdev_queue_mgmt_ops fbnic_queue_mgmt_ops;
+static inline u16 fbnic_bdq_frag_count(const struct fbnic_ring *bdq)
+{
+ return 1U << bdq->frag_shift;
+}
+
netdev_tx_t fbnic_xmit_frame(struct sk_buff *skb, struct net_device *dev);
netdev_features_t
fbnic_features_check(struct sk_buff *skb, struct net_device *dev,
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH net-next 2/3] fbnic: Support larger zcrx receive buffers
2026-05-22 11:32 [PATCH net-next 0/3] fbnic: Support larger io_uring zcrx buffers Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 1/3] fbnic: Track BDQ fragment geometry per ring Björn Töpel
@ 2026-05-22 11:32 ` Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 3/3] selftests: drv-net: Add zcrx payload offset check Björn Töpel
2 siblings, 0 replies; 4+ messages in thread
From: Björn Töpel @ 2026-05-22 11:32 UTC (permalink / raw)
To: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Shuah Khan, netdev
Cc: Björn Töpel, Jacob Keller, Mohsin Bashir,
Mike Marciniszyn (Meta), Pavel Begunkov, linux-kernel,
linux-kselftest
io_uring zcrx can provide receive buffers larger than PAGE_SIZE
through QCFG_RX_PAGE_SIZE. Advertise the parameter and use the
configured size when creating the PPQ page pool.
The NIC still consumes PPQ buffers as 4 KiB BDQ fragments. For larger
zcrx buffers, allocate the page pool with the requested order and set
the PPQ fragment shift from rx_page_size, so one net_iov can cover
multiple hardware fragments.
The core validates the zcrx request and checks that the imported
memory can be represented as rx_buf_len-sized DMA chunks. Fbnic still
has to validate the rendered queue configuration against its own BDQ
geometry: larger receive buffers consume multiple 4 KiB PPQ entries,
and the PPQ must retain usable depth after that expansion.
Use the rendered per-queue rx_page_size on the normal open path as
well. This preserves a memory-provider binding made while the netdev
is down instead of falling back to the default PPQ geometry on open.
Signed-off-by: Björn Töpel <bjorn@kernel.org>
---
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c | 102 +++++++++++++++++--
1 file changed, 94 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
index 9a9675d04c16..57b3277fcd4e 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
@@ -1559,9 +1559,62 @@ void fbnic_free_napi_vectors(struct fbnic_net *fbn)
fbnic_free_napi_vector(fbn, fbn->napi[i]);
}
+static u32 fbnic_qcfg_rx_page_size(const struct netdev_queue_config *qcfg)
+{
+ return qcfg->rx_page_size ?: PAGE_SIZE;
+}
+
+static u32 fbnic_rx_page_frag_count(u32 rx_page_size)
+{
+ return rx_page_size / FBNIC_BD_FRAG_SIZE;
+}
+
+static u8 fbnic_rx_page_frag_shift(u32 rx_page_size)
+{
+ return ilog2(fbnic_rx_page_frag_count(rx_page_size));
+}
+
+static int fbnic_validate_rx_page_size(struct fbnic_net *fbn, u32 rx_page_size,
+ struct netlink_ext_ack *extack)
+{
+ u32 frag_count, ppq_bufs;
+
+ if (!is_power_of_2(rx_page_size)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "rx_page_size must be a power of 2");
+ return -EINVAL;
+ }
+
+ if (rx_page_size < PAGE_SIZE) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "rx_page_size must be at least PAGE_SIZE");
+ return -EINVAL;
+ }
+
+ if (!IS_ALIGNED(rx_page_size, FBNIC_BD_FRAG_SIZE)) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "rx_page_size must be 4K aligned");
+ return -EINVAL;
+ }
+
+ frag_count = fbnic_rx_page_frag_count(rx_page_size);
+ ppq_bufs = fbn->ppq_size / frag_count;
+ /* The PPQ is sized in 4K hardware fragments, but the software ring
+ * has one entry per page-pool allocation. Keep at least two entries so
+ * empty/full ring accounting still leaves one postable buffer.
+ */
+ if (ppq_bufs < 2) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "rx_page_size leaves too few PPQ buffers");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int
fbnic_alloc_qt_page_pools(struct fbnic_net *fbn, struct fbnic_q_triad *qt,
- unsigned int rxq_idx)
+ unsigned int rxq_idx, u32 rx_page_size)
{
struct page_pool_params pp_params = {
.order = 0,
@@ -1596,6 +1649,8 @@ fbnic_alloc_qt_page_pools(struct fbnic_net *fbn, struct fbnic_q_triad *qt,
qt->sub0.page_pool = pp;
if (netif_rxq_has_unreadable_mp(fbn->netdev, rxq_idx)) {
+ pp_params.order = ilog2(rx_page_size) - PAGE_SHIFT;
+ pp_params.max_len = rx_page_size;
pp_params.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
pp_params.dma_dir = DMA_FROM_DEVICE;
@@ -2018,12 +2073,19 @@ static int fbnic_alloc_tx_qt_resources(struct fbnic_net *fbn,
static int fbnic_alloc_rx_qt_resources(struct fbnic_net *fbn,
struct fbnic_napi_vector *nv,
- struct fbnic_q_triad *qt)
+ struct fbnic_q_triad *qt,
+ u32 rx_page_size)
{
struct device *dev = fbn->netdev->dev.parent;
int err;
- err = fbnic_alloc_qt_page_pools(fbn, qt, qt->cmpl.q_idx);
+ err = fbnic_validate_rx_page_size(fbn, rx_page_size, NULL);
+ if (err)
+ return err;
+
+ qt->sub1.frag_shift = fbnic_rx_page_frag_shift(rx_page_size);
+
+ err = fbnic_alloc_qt_page_pools(fbn, qt, qt->cmpl.q_idx, rx_page_size);
if (err)
return err;
@@ -2087,7 +2149,13 @@ static int fbnic_alloc_nv_resources(struct fbnic_net *fbn,
/* Allocate Rx Resources */
for (j = 0; j < nv->rxt_count; j++, i++) {
- err = fbnic_alloc_rx_qt_resources(fbn, nv, &nv->qt[i]);
+ struct netdev_queue_config qcfg;
+ u32 rx_page_size;
+
+ netdev_queue_config(fbn->netdev, nv->qt[i].cmpl.q_idx, &qcfg);
+ rx_page_size = fbnic_qcfg_rx_page_size(&qcfg);
+ err = fbnic_alloc_rx_qt_resources(fbn, nv, &nv->qt[i],
+ rx_page_size);
if (err)
goto free_qt_resources;
}
@@ -2852,9 +2920,16 @@ static int fbnic_queue_mem_alloc(struct net_device *dev,
const struct fbnic_q_triad *real;
struct fbnic_q_triad *qt = qmem;
struct fbnic_napi_vector *nv;
+ u32 rx_page_size = fbnic_qcfg_rx_page_size(qcfg);
+ int err;
- if (!netif_running(dev))
- return fbnic_alloc_qt_page_pools(fbn, qt, idx);
+ if (!netif_running(dev)) {
+ err = fbnic_validate_rx_page_size(fbn, rx_page_size, NULL);
+ if (err)
+ return err;
+
+ return fbnic_alloc_qt_page_pools(fbn, qt, idx, rx_page_size);
+ }
real = container_of(fbn->rx[idx], struct fbnic_q_triad, cmpl);
nv = fbn->napi[idx % fbn->num_napi];
@@ -2864,11 +2939,20 @@ static int fbnic_queue_mem_alloc(struct net_device *dev,
qt->sub0.frag_shift = real->sub0.frag_shift;
fbnic_ring_init(&qt->sub1, real->sub1.doorbell, real->sub1.q_idx,
real->sub1.flags);
- qt->sub1.frag_shift = real->sub1.frag_shift;
fbnic_ring_init(&qt->cmpl, real->cmpl.doorbell, real->cmpl.q_idx,
real->cmpl.flags);
- return fbnic_alloc_rx_qt_resources(fbn, nv, qt);
+ return fbnic_alloc_rx_qt_resources(fbn, nv, qt, rx_page_size);
+}
+
+static int fbnic_validate_qcfg(struct net_device *dev,
+ struct netdev_queue_config *qcfg,
+ struct netlink_ext_ack *extack)
+{
+ struct fbnic_net *fbn = netdev_priv(dev);
+
+ return fbnic_validate_rx_page_size(fbn, fbnic_qcfg_rx_page_size(qcfg),
+ extack);
}
static void fbnic_queue_mem_free(struct net_device *dev, void *qmem)
@@ -2970,4 +3054,6 @@ const struct netdev_queue_mgmt_ops fbnic_queue_mgmt_ops = {
.ndo_queue_mem_free = fbnic_queue_mem_free,
.ndo_queue_start = fbnic_queue_start,
.ndo_queue_stop = fbnic_queue_stop,
+ .ndo_validate_qcfg = fbnic_validate_qcfg,
+ .supported_params = QCFG_RX_PAGE_SIZE,
};
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH net-next 3/3] selftests: drv-net: Add zcrx payload offset check
2026-05-22 11:32 [PATCH net-next 0/3] fbnic: Support larger io_uring zcrx buffers Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 1/3] fbnic: Track BDQ fragment geometry per ring Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 2/3] fbnic: Support larger zcrx receive buffers Björn Töpel
@ 2026-05-22 11:32 ` Björn Töpel
2 siblings, 0 replies; 4+ messages in thread
From: Björn Töpel @ 2026-05-22 11:32 UTC (permalink / raw)
To: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Shuah Khan, netdev
Cc: Björn Töpel, Jacob Keller, Mohsin Bashir,
Mike Marciniszyn (Meta), Pavel Begunkov, linux-kernel,
linux-kselftest
Add an optional iou-zcrx receiver check for payload CQE offsets. With
-F, the receiver fails if no zero-copy receive CQE lands at or beyond
the requested offset within an rx_buf_len-sized buffer.
This is useful for manual driver testing where the driver is expected
to split a larger zcrx buffer into smaller hardware receive fragments.
Do not wire it into the generic large-chunk test, since different
drivers may legitimately return different CQE boundaries.
Signed-off-by: Björn Töpel <bjorn@kernel.org>
---
.../selftests/drivers/net/hw/iou-zcrx.c | 28 +++++++++++++++++--
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 240d13dbc54e..0fb0410aaada 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -85,6 +85,8 @@ static int cfg_send_size = SEND_SIZE;
static struct sockaddr_in6 cfg_addr;
static unsigned int cfg_rx_buf_len;
static bool cfg_dry_run;
+static bool cfg_check_payload_offset, cfg_seen_payload_offset;
+static unsigned int cfg_min_payload_offset;
static char *payload;
static void *area_ptr;
@@ -298,6 +300,13 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
data = (char *)area_ptr + (rcqe->off & mask);
+ if (cfg_check_payload_offset) {
+ unsigned int rx_buf_len = cfg_rx_buf_len ?: page_size;
+
+ if ((rcqe->off & mask) % rx_buf_len >= cfg_min_payload_offset)
+ cfg_seen_payload_offset = true;
+ }
+
for (i = 0; i < n; i++) {
if (*(data + i) != payload[(received + i)])
error(1, 0, "payload mismatch at %d", i);
@@ -374,6 +383,9 @@ static void run_server(void)
if (!stop)
error(1, 0, "test failed\n");
+ if (cfg_check_payload_offset && !cfg_seen_payload_offset)
+ error(1, 0, "no payload CQE at offset >= %u\n",
+ cfg_min_payload_offset);
}
static void run_client(void)
@@ -406,8 +418,11 @@ static void run_client(void)
static void usage(const char *filepath)
{
- error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
- "-l<payload_size> -i<ifname> -q<rxq_id>", filepath);
+ error(1, 0,
+ "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port>\n"
+ "\t-l<payload_size> -i<ifname> -q<rxq_id>\n"
+ "\t[-x<rx_buf_pages>] [-F<min_payload_offset>] [-d]\n",
+ filepath);
}
static void parse_opts(int argc, char **argv)
@@ -425,7 +440,7 @@ static void parse_opts(int argc, char **argv)
usage(argv[0]);
cfg_payload_len = max_payload_len;
- while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:d")) != -1) {
+ while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:F:d")) != -1) {
switch (c) {
case 's':
if (cfg_client)
@@ -463,6 +478,10 @@ static void parse_opts(int argc, char **argv)
case 'x':
cfg_rx_buf_len = page_size * strtoul(optarg, NULL, 0);
break;
+ case 'F':
+ cfg_check_payload_offset = true;
+ cfg_min_payload_offset = strtoul(optarg, NULL, 0);
+ break;
case 'd':
cfg_dry_run = true;
break;
@@ -484,6 +503,9 @@ static void parse_opts(int argc, char **argv)
if (cfg_payload_len > max_payload_len)
error(1, 0, "-l: payload exceeds max (%d)", max_payload_len);
+ if (cfg_check_payload_offset &&
+ cfg_min_payload_offset >= (cfg_rx_buf_len ?: page_size))
+ error(1, 0, "-F: offset outside rx_buf_len");
}
int main(int argc, char **argv)
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-22 11:32 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 11:32 [PATCH net-next 0/3] fbnic: Support larger io_uring zcrx buffers Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 1/3] fbnic: Track BDQ fragment geometry per ring Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 2/3] fbnic: Support larger zcrx receive buffers Björn Töpel
2026-05-22 11:32 ` [PATCH net-next 3/3] selftests: drv-net: Add zcrx payload offset check Björn Töpel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox