public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW
@ 2026-03-25 12:38 Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 1/6] soc: ti: k3-ringacc: Add helper to get realtime count of free elements Siddharth Vadapalli
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

Hello,

NOTE for MAINTAINERS:
Patches in this series span 3 subsystems and I have posted this as an RFC
series to make it easy for the reviewers to understand the complete
implementation. I will eventually split the series and post them
sequentially to the respective subsystem's mailing list:
1. SoC
2. DMAEngine
3. Netdev

Series is based on commit
d1e59a469737 tcp: add cwnd_event_tx_start to tcp_congestion_ops
of the main branch of net-next tree. When I split the series in the
future, I shall base the patches for SoC and DMAEngine on linux-next
and the patches for Netdev on net-next.

This series enables batch processing for the am65-cpsw-nuss.c driver
on the transmit path (ndo_start_xmit and ndo_xdp_xmit) and transmit
completion path. Additionally, this series also recycles descriptors
instead of releasing them to the pool and reallocating them. The
difference in memory footprint without this series and with this series
is hardly noticeable (being under 1 MB).

Feedback on the implementation w.r.t. correctness, ease of use /
maintenance and configurability (sysfs based option for changing batch
size) is appreciated.

Series has been tested in the following combinations to cover edge
cases:
1. Single-Port (CPSW2G on J784S4-EVM)
2. Multi-Port (CPSW3G on AM625-SK)
3. Bidirectional TCP Iperf followed by interfaces being brought down
   with traffic in flight (and TX / RX DMA Channel Teardown) followed
   by interfaces being brought up and ensuring that Iperf traffic
   resumes.

The primary motivation for this series is to improve performance in
terms of lowering the CPU load and achieving higher throughput for
gigabit and multi-gigabit operation.

The upcoming features that I plan to implement are:
1. Enable batch processing on RX
2. Batch processing on ICSSG similar to CPSW (since batch processing
   increases latency, it might not be desirable to enable batch
   processing and may be skipped as well).

The following sections capture the improvements brought about by this
series.

[1] AM625-SK with CPSW3G (multi-port / two netdevs) and single A53
processor (remaining CPUs are disabled) with each MAC Port operating
at 1 Gbps Full-Duplex.

===========================================================================
Baseline for [1]
===========================================================================
Dual TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
403 Mbps + 408 Mbps = 811 Mbps

Dual RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
336 Mbps + 331 Mbps = 667 Mbps

===========================================================================
With this series for [1]
===========================================================================
Dual TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
428 Mbps + 437 Mbps = 865 Mbps

Dual RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
332 Mbps + 337 Mbps = 669 Mbps



[2] J784S4-EVM with CPSW2G (single-port) and single A72 processor
(remaining CPUs are disabled) with the MAC Port operating at 1 Gbps Full-
Duplex.

===========================================================================
Baseline for [2]
===========================================================================
TX Iperf UDP traffic at 84% CPU Load averaged over 30 seconds:
956 Mbps

RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
941 Mbps

===========================================================================
With this series for [2]
===========================================================================
TX Iperf UDP traffic at 80% CPU Load averaged over 30 seconds:
956 Mbps

RX Iperf TCP traffic at 100% CPU Load averaged over 30 seconds:
941 Mbps



[3] J784S4-EVM with CPSW9G (multi-port) and single A72 processor
(remaining CPUs are disabled) with one MAC Port operating at 5 Gbps
Full-Duplex.

===========================================================================
Baseline for [3]
===========================================================================
TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
1.26 Gbps

RX Iperf TCP traffic at 75% CPU Load averaged over 30 seconds:
1.73 Gbps

===========================================================================
With this series for [3]
===========================================================================
TX Iperf UDP traffic at 100% CPU Load averaged over 30 seconds:
1.28 Gbps

RX Iperf TCP traffic at 75% CPU Load averaged over 30 seconds:
1.75 Gbps

Regards,
Siddharth.

Siddharth Vadapalli (6):
  soc: ti: k3-ringacc: Add helper to get realtime count of free elements
  soc: ti: k3-ringacc: Add helpers for batch push and pop operations
  dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX
    DMA
  net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB
    fragments
  net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors
  net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX
    CMPL

 drivers/dma/ti/k3-udma-glue.c            |  55 +++
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 441 +++++++++++++++++++----
 drivers/net/ethernet/ti/am65-cpsw-nuss.h |  31 ++
 drivers/soc/ti/k3-ringacc.c              |  99 +++++
 include/linux/dma/k3-udma-glue.h         |  12 +
 include/linux/soc/ti/k3-ringacc.h        |  35 ++
 6 files changed, 612 insertions(+), 61 deletions(-)

-- 
2.51.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/6] soc: ti: k3-ringacc: Add helper to get realtime count of free elements
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 2/6] soc: ti: k3-ringacc: Add helpers for batch push and pop operations Siddharth Vadapalli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

The existing helper k3_ringacc_ring_get_free() updates the count of free
elements only when the software maintained counter decrements to zero.
As a result, for batch processing, we may read a lower count of free
elements than the actual count. To address this, introduce a new helper
that provides realtime count of free elements.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/soc/ti/k3-ringacc.c       | 11 +++++++++++
 include/linux/soc/ti/k3-ringacc.h |  8 ++++++++
 2 files changed, 19 insertions(+)

diff --git a/drivers/soc/ti/k3-ringacc.c b/drivers/soc/ti/k3-ringacc.c
index 7602b8a909b0..1751d42ee2d3 100644
--- a/drivers/soc/ti/k3-ringacc.c
+++ b/drivers/soc/ti/k3-ringacc.c
@@ -905,6 +905,17 @@ u32 k3_ringacc_ring_get_free(struct k3_ring *ring)
 }
 EXPORT_SYMBOL_GPL(k3_ringacc_ring_get_free);
 
+u32 k3_ringacc_ring_get_rt_free(struct k3_ring *ring)
+{
+	if (!ring || !(ring->flags & K3_RING_FLAG_BUSY))
+		return -EINVAL;
+
+	ring->state.free = ring->size - k3_ringacc_ring_read_occ(ring);
+
+	return ring->state.free;
+}
+EXPORT_SYMBOL_GPL(k3_ringacc_ring_get_rt_free);
+
 u32 k3_ringacc_ring_get_occ(struct k3_ring *ring)
 {
 	if (!ring || !(ring->flags & K3_RING_FLAG_BUSY))
diff --git a/include/linux/soc/ti/k3-ringacc.h b/include/linux/soc/ti/k3-ringacc.h
index 39b022b92598..091cf551932d 100644
--- a/include/linux/soc/ti/k3-ringacc.h
+++ b/include/linux/soc/ti/k3-ringacc.h
@@ -184,6 +184,14 @@ u32 k3_ringacc_ring_get_size(struct k3_ring *ring);
  */
 u32 k3_ringacc_ring_get_free(struct k3_ring *ring);
 
+/**
+ * k3_ringacc_ring_get_rt_free - get realtime value of free elements
+ * @ring: pointer on ring
+ *
+ * Returns realtime count of free elements in the ring.
+ */
+u32 k3_ringacc_ring_get_rt_free(struct k3_ring *ring);
+
 /**
  * k3_ringacc_ring_get_occ - get ring occupancy
  * @ring: pointer on ring
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/6] soc: ti: k3-ringacc: Add helpers for batch push and pop operations
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 1/6] soc: ti: k3-ringacc: Add helper to get realtime count of free elements Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 3/6] dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX DMA Siddharth Vadapalli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

To allow pushing and popping a batch of descriptors at once to improve
efficiency, introduce two helpers:
1. k3_ringacc_ring_push_batch
2. k3_ringacc_ring_pop_batch

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/soc/ti/k3-ringacc.c       | 88 +++++++++++++++++++++++++++++++
 include/linux/soc/ti/k3-ringacc.h | 27 ++++++++++
 2 files changed, 115 insertions(+)

diff --git a/drivers/soc/ti/k3-ringacc.c b/drivers/soc/ti/k3-ringacc.c
index 1751d42ee2d3..33ae7db9c2a1 100644
--- a/drivers/soc/ti/k3-ringacc.c
+++ b/drivers/soc/ti/k3-ringacc.c
@@ -1223,6 +1223,41 @@ int k3_ringacc_ring_push(struct k3_ring *ring, void *elem)
 }
 EXPORT_SYMBOL_GPL(k3_ringacc_ring_push);
 
+int k3_ringacc_ring_push_batch(struct k3_ring *ring, void *elem_arr,
+			       u32 batch_size)
+{
+	void *elem_ptr, *elem;
+	int ret = 0;
+	u32 i;
+
+	if (!ring || !(ring->flags & K3_RING_FLAG_BUSY))
+		return -EINVAL;
+
+	if (k3_ringacc_ring_get_free(ring) < batch_size)
+		if (k3_ringacc_ring_get_rt_free(ring) < batch_size)
+			return -ENOMEM;
+
+	dev_dbg(ring->parent->dev, "ring_push_batch: free%d index%d\n",
+		ring->state.free, ring->state.windex);
+
+	for (i = 0; i < batch_size; i++) {
+		elem_ptr = k3_ringacc_get_elm_addr(ring, ring->state.windex);
+		elem = &((dma_addr_t *)elem_arr)[i];
+		memcpy(elem_ptr, elem, (4 << ring->elm_size));
+		if (ring->parent->dma_rings) {
+			u64 *addr = elem_ptr;
+			*addr |= ((u64)ring->asel << K3_ADDRESS_ASEL_SHIFT);
+		}
+		ring->state.windex = (ring->state.windex + 1) % ring->size;
+	}
+
+	ring->state.free -= batch_size;
+	writel(batch_size, &ring->rt->db);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(k3_ringacc_ring_push_batch);
+
 int k3_ringacc_ring_push_head(struct k3_ring *ring, void *elem)
 {
 	int ret = -EOPNOTSUPP;
@@ -1266,6 +1301,59 @@ int k3_ringacc_ring_pop(struct k3_ring *ring, void *elem)
 }
 EXPORT_SYMBOL_GPL(k3_ringacc_ring_pop);
 
+int k3_ringacc_ring_pop_batch(struct k3_ring *ring, void *elem_arr,
+			      u32 *batch_size, u32 max_batch)
+{
+	void *elem_ptr, *elem;
+	u32 ring_occupancy, i;
+	u32 num_to_pop;
+
+	if (!ring || !(ring->flags & K3_RING_FLAG_BUSY))
+		return -EINVAL;
+
+	if (!ring->state.occ || ring->state.occ < max_batch)
+		k3_ringacc_ring_update_occ(ring);
+
+	if (!ring->state.occ) {
+		if (likely(!ring->state.tdown_complete))
+			return -ENODATA;
+
+		/* Handle teardown */
+		elem = &((dma_addr_t *)elem_arr)[0];
+		dma_addr_t *value = elem;
+		*value = CPPI5_TDCM_MARKER;
+		writel(K3_DMARING_RT_DB_TDOWN_ACK, &ring->rt->db);
+		ring->state.tdown_complete = false;
+		*batch_size = 1;
+		return 0;
+	}
+
+	ring_occupancy = ring->state.occ;
+	if (ring_occupancy > max_batch)
+		num_to_pop = max_batch;
+	else
+		num_to_pop = ring_occupancy;
+
+	dev_dbg(ring->parent->dev, "ring_pop_batch: occ%d index%d\n",
+		ring->state.occ, ring->state.rindex);
+
+	for (i = 0; i < num_to_pop; i++) {
+		elem_ptr = k3_ringacc_get_elm_addr(ring, ring->state.rindex);
+		elem = &((dma_addr_t *)elem_arr)[i];
+		memcpy(elem, elem_ptr, (4 << ring->elm_size));
+		k3_dmaring_remove_asel_from_elem(elem);
+		ring->state.rindex = (ring->state.rindex + 1) % ring->size;
+		dev_dbg(ring->parent->dev, "occ%d index%d pos_ptr%p\n",
+			ring->state.occ, ring->state.rindex, elem_ptr);
+	}
+	ring->state.occ -= num_to_pop;
+	writel(-1 * num_to_pop, &ring->rt->db);
+	*batch_size = num_to_pop;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(k3_ringacc_ring_pop_batch);
+
 int k3_ringacc_ring_pop_tail(struct k3_ring *ring, void *elem)
 {
 	int ret = -EOPNOTSUPP;
diff --git a/include/linux/soc/ti/k3-ringacc.h b/include/linux/soc/ti/k3-ringacc.h
index 091cf551932d..6fffa65ee760 100644
--- a/include/linux/soc/ti/k3-ringacc.h
+++ b/include/linux/soc/ti/k3-ringacc.h
@@ -220,6 +220,19 @@ u32 k3_ringacc_ring_is_full(struct k3_ring *ring);
  */
 int k3_ringacc_ring_push(struct k3_ring *ring, void *elem);
 
+/**
+ * k3_ringacc_ring_push_batch - push a batch of elements to the ring tail
+ * @ring: pointer on ring
+ * @elem_arr: pointer to array of ring element buffers
+ * @batch_size: count of element buffers to be pushed
+ *
+ * Push the batch of element buffers to the ring tail.
+ *
+ * Returns 0 on success, errno otherwise.
+ */
+int k3_ringacc_ring_push_batch(struct k3_ring *ring, void *elem_arr,
+			       u32 batch_size);
+
 /**
  * k3_ringacc_ring_pop - pop element from the ring head
  * @ring: pointer on ring
@@ -232,6 +245,20 @@ int k3_ringacc_ring_push(struct k3_ring *ring, void *elem);
  */
 int k3_ringacc_ring_pop(struct k3_ring *ring, void *elem);
 
+/**
+ * k3_ringacc_ring_pop_batch - pop all elements from the ring head
+ * @ring: pointer on ring
+ * @elem_ar: pointer to array of ring element buffers
+ * @batch_size: pointer to count of elements popped from ring
+ * @max_batch: maximum number of elements to pop
+ *
+ * Pop a batch of element buffers from the ring head.
+ *
+ * Returns 0 on success, errno otherwise.
+ */
+int k3_ringacc_ring_pop_batch(struct k3_ring *ring, void *elem_arr,
+			      u32 *batch_size, u32 max_batch);
+
 /**
  * k3_ringacc_ring_push_head - push element to the ring head
  * @ring: pointer on ring
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/6] dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX DMA
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 1/6] soc: ti: k3-ringacc: Add helper to get realtime count of free elements Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 2/6] soc: ti: k3-ringacc: Add helpers for batch push and pop operations Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 4/6] net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB fragments Siddharth Vadapalli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

To allow pushing and popping a batch of DMA Descriptors on the Transmit
and Receive DMA Channels (Flows), introduce four helpers:
1. k3_udma_glue_push_tx_chn_batch
2. k3_udma_glue_pop_tx_chn_batch
3. k3_udma_glue_push_rx_chn_batch
4. k3_udma_glue_pop_rx_chn_batch

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/dma/ti/k3-udma-glue.c    | 55 ++++++++++++++++++++++++++++++++
 include/linux/dma/k3-udma-glue.h | 12 +++++++
 2 files changed, 67 insertions(+)

diff --git a/drivers/dma/ti/k3-udma-glue.c b/drivers/dma/ti/k3-udma-glue.c
index f87d244cc2d6..15835c521977 100644
--- a/drivers/dma/ti/k3-udma-glue.c
+++ b/drivers/dma/ti/k3-udma-glue.c
@@ -485,6 +485,25 @@ int k3_udma_glue_push_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
 }
 EXPORT_SYMBOL_GPL(k3_udma_glue_push_tx_chn);
 
+int k3_udma_glue_push_tx_chn_batch(struct k3_udma_glue_tx_channel *tx_chn,
+				   struct cppi5_host_desc_t **desc_tx,
+				   dma_addr_t *desc_dma, u32 batch_size)
+{
+	u32 ringtxcq_id;
+	int i;
+
+	if (!atomic_add_unless(&tx_chn->free_pkts, -1 * batch_size, 0))
+		return -ENOMEM;
+
+	ringtxcq_id = k3_ringacc_get_ring_id(tx_chn->ringtxcq);
+
+	for (i = 0; i < batch_size; i++)
+		cppi5_desc_set_retpolicy(&desc_tx[i]->hdr, 0, ringtxcq_id);
+
+	return k3_ringacc_ring_push_batch(tx_chn->ringtx, desc_dma, batch_size);
+}
+EXPORT_SYMBOL_GPL(k3_udma_glue_push_tx_chn_batch);
+
 int k3_udma_glue_pop_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
 			    dma_addr_t *desc_dma)
 {
@@ -498,6 +517,21 @@ int k3_udma_glue_pop_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
 }
 EXPORT_SYMBOL_GPL(k3_udma_glue_pop_tx_chn);
 
+int k3_udma_glue_pop_tx_chn_batch(struct k3_udma_glue_tx_channel *tx_chn,
+				  dma_addr_t *desc_dma, u32 *batch_size,
+				  u32 max_batch)
+{
+	int ret;
+
+	ret = k3_ringacc_ring_pop_batch(tx_chn->ringtxcq, desc_dma, batch_size,
+					max_batch);
+	if (!ret)
+		atomic_add(*batch_size, &tx_chn->free_pkts);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(k3_udma_glue_pop_tx_chn_batch);
+
 int k3_udma_glue_enable_tx_chn(struct k3_udma_glue_tx_channel *tx_chn)
 {
 	int ret;
@@ -1512,6 +1546,16 @@ int k3_udma_glue_push_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 }
 EXPORT_SYMBOL_GPL(k3_udma_glue_push_rx_chn);
 
+int k3_udma_glue_push_rx_chn_batch(struct k3_udma_glue_rx_channel *rx_chn,
+				   u32 flow_num, dma_addr_t desc_dma,
+				   u32 batch_size)
+{
+	struct k3_udma_glue_rx_flow *flow = &rx_chn->flows[flow_num];
+
+	return k3_ringacc_ring_push_batch(flow->ringrxfdq, &desc_dma, batch_size);
+}
+EXPORT_SYMBOL_GPL(k3_udma_glue_push_rx_chn_batch);
+
 int k3_udma_glue_pop_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 			    u32 flow_num, dma_addr_t *desc_dma)
 {
@@ -1521,6 +1565,17 @@ int k3_udma_glue_pop_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 }
 EXPORT_SYMBOL_GPL(k3_udma_glue_pop_rx_chn);
 
+int k3_udma_glue_pop_rx_chn_batch(struct k3_udma_glue_rx_channel *rx_chn,
+				  u32 flow_num, dma_addr_t *desc_dma,
+				  u32 *batch_size, u32 max_batch)
+{
+	struct k3_udma_glue_rx_flow *flow = &rx_chn->flows[flow_num];
+
+	return k3_ringacc_ring_pop_batch(flow->ringrx, desc_dma, batch_size,
+					 max_batch);
+}
+EXPORT_SYMBOL_GPL(k3_udma_glue_pop_rx_chn_batch);
+
 int k3_udma_glue_rx_get_irq(struct k3_udma_glue_rx_channel *rx_chn,
 			    u32 flow_num)
 {
diff --git a/include/linux/dma/k3-udma-glue.h b/include/linux/dma/k3-udma-glue.h
index 5d43881e6fb7..9fe3f51c230c 100644
--- a/include/linux/dma/k3-udma-glue.h
+++ b/include/linux/dma/k3-udma-glue.h
@@ -35,8 +35,14 @@ void k3_udma_glue_release_tx_chn(struct k3_udma_glue_tx_channel *tx_chn);
 int k3_udma_glue_push_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
 			     struct cppi5_host_desc_t *desc_tx,
 			     dma_addr_t desc_dma);
+int k3_udma_glue_push_tx_chn_batch(struct k3_udma_glue_tx_channel *tx_chn,
+				   struct cppi5_host_desc_t **desc_tx,
+				   dma_addr_t *desc_dma, u32 batch_size);
 int k3_udma_glue_pop_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
 			    dma_addr_t *desc_dma);
+int k3_udma_glue_pop_tx_chn_batch(struct k3_udma_glue_tx_channel *tx_chn,
+				  dma_addr_t *desc_dma, u32 *batch_size,
+				  u32 max_batch);
 int k3_udma_glue_enable_tx_chn(struct k3_udma_glue_tx_channel *tx_chn);
 void k3_udma_glue_disable_tx_chn(struct k3_udma_glue_tx_channel *tx_chn);
 void k3_udma_glue_tdown_tx_chn(struct k3_udma_glue_tx_channel *tx_chn,
@@ -127,8 +133,14 @@ void k3_udma_glue_tdown_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 int k3_udma_glue_push_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 		u32 flow_num, struct cppi5_host_desc_t *desc_tx,
 		dma_addr_t desc_dma);
+int k3_udma_glue_push_rx_chn_batch(struct k3_udma_glue_rx_channel *rx_chn,
+				   u32 flow_num, dma_addr_t desc_dma,
+				   u32 batch_size);
 int k3_udma_glue_pop_rx_chn(struct k3_udma_glue_rx_channel *rx_chn,
 		u32 flow_num, dma_addr_t *desc_dma);
+int k3_udma_glue_pop_rx_chn_batch(struct k3_udma_glue_rx_channel *rx_chn,
+				  u32 flow_num, dma_addr_t *desc_dma,
+				  u32 *batch_size, u32 max_batch);
 int k3_udma_glue_rx_flow_init(struct k3_udma_glue_rx_channel *rx_chn,
 		u32 flow_idx, struct k3_udma_glue_rx_flow_cfg *flow_cfg);
 u32 k3_udma_glue_rx_flow_get_fdq_id(struct k3_udma_glue_rx_channel *rx_chn,
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/6] net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB fragments
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
                   ` (2 preceding siblings ...)
  2026-03-25 12:38 ` [RFC PATCH 3/6] dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX DMA Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 5/6] net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 6/6] net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX CMPL Siddharth Vadapalli
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

There are two kinds of descriptors:
1. Host Packet Descriptor
2. Host Buffer Descriptor

Unfragmented SKBs are always associated with a single Host Packet
Descriptor. Fragmented SKBs on the other hand have the Start-of-Packet
SKB associated with a single Host Packet Descriptor and the remaining
fragments are associated with a Host Buffer Descriptor. A single Host
Packet Descriptor is linked to a chain of Host Buffer Descriptors for
fragmented SKBs with as many Host Buffer Descriptors as the number of
SKB fragments.

Since packet completion handling only uses the buffer type of the Host
Packet Descriptor, setting the buffer type of the linked Host Buffer
Descriptors is an unnecessary operation which wastes CPU cycles per SKB
fragment.

Hence, do not set buffer type for SKB fragments.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index d9400599e80a..6df6cb52d952 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -1678,9 +1678,6 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 			goto busy_free_descs;
 		}
 
-		am65_cpsw_nuss_set_buf_type(tx_chn, next_desc,
-					    AM65_CPSW_TX_BUF_TYPE_SKB);
-
 		buf_dma = skb_frag_dma_map(tx_chn->dma_dev, frag, 0, frag_size,
 					   DMA_TO_DEVICE);
 		if (unlikely(dma_mapping_error(tx_chn->dma_dev, buf_dma))) {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 5/6] net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
                   ` (3 preceding siblings ...)
  2026-03-25 12:38 ` [RFC PATCH 4/6] net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB fragments Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  2026-03-25 12:38 ` [RFC PATCH 6/6] net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX CMPL Siddharth Vadapalli
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

The existing implementation allocates the CPPI Descriptors from the
Descriptor Pool of the TX and RX Channels on demand and returns them to
the Pool on completion. Recycle descriptors to speed up the transmit and
receive paths. Use a Cyclic Queue (Ring) to hold the TX and RX Descriptors
for the respective TX Channel and RX Flow and utilize atomic operations for
guarding against concurrent modification.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 237 ++++++++++++++++++++---
 drivers/net/ethernet/ti/am65-cpsw-nuss.h |  19 ++
 2 files changed, 233 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 6df6cb52d952..fc165579a479 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -145,9 +145,6 @@
 	 AM65_CPSW_PN_TS_CTL_RX_ANX_F_EN)
 
 #define AM65_CPSW_ALE_AGEOUT_DEFAULT	30
-/* Number of TX/RX descriptors per channel/flow */
-#define AM65_CPSW_MAX_TX_DESC	500
-#define AM65_CPSW_MAX_RX_DESC	500
 
 #define AM65_CPSW_NAV_PS_DATA_SIZE 16
 #define AM65_CPSW_NAV_SW_DATA_SIZE 16
@@ -374,6 +371,122 @@ static void am65_cpsw_slave_set_promisc(struct am65_cpsw_port *port,
 	}
 }
 
+static size_t am65_cpsw_nuss_num_free_tx_desc(struct am65_cpsw_tx_chn *tx_chn)
+{
+	struct am65_cpsw_tx_ring *tx_ring = &tx_chn->tx_ring;
+	int head_idx, tail_idx, num_free;
+
+	/* Atomically read both head and tail indices */
+	head_idx = atomic_read(&tx_ring->tx_desc_ring_head_idx);
+	tail_idx = atomic_read(&tx_ring->tx_desc_ring_tail_idx);
+
+	/* Calculate number of available descriptors in circular queue */
+	num_free = (tail_idx - head_idx + (AM65_CPSW_MAX_TX_DESC + 1)) %
+		   (AM65_CPSW_MAX_TX_DESC + 1);
+
+	return num_free;
+}
+
+static void am65_cpsw_nuss_put_tx_desc(struct am65_cpsw_tx_chn *tx_chn,
+				       struct cppi5_host_desc_t *desc)
+{
+	struct am65_cpsw_tx_ring *tx_ring = &tx_chn->tx_ring;
+	int tail_idx, new_tail_idx;
+
+	/* Atomically get current tail index and calculate new wrapped index */
+	do {
+		tail_idx = atomic_read(&tx_ring->tx_desc_ring_tail_idx);
+		new_tail_idx = tail_idx + 1;
+		if (new_tail_idx > AM65_CPSW_MAX_TX_DESC)
+			new_tail_idx = 0;
+	} while (atomic_cmpxchg(&tx_ring->tx_desc_ring_tail_idx,
+				tail_idx, new_tail_idx) != tail_idx);
+
+	/* Store the descriptor at the tail position */
+	tx_ring->tx_descs[tail_idx] = desc;
+}
+
+static void am65_cpsw_nuss_put_rx_desc(struct am65_cpsw_rx_flow *flow,
+				       struct cppi5_host_desc_t *desc)
+{
+	struct am65_cpsw_rx_ring *rx_ring = &flow->rx_ring;
+	int tail_idx, new_tail_idx;
+
+	/* Atomically get current tail index and calculate new wrapped index */
+	do {
+		tail_idx = atomic_read(&rx_ring->rx_desc_ring_tail_idx);
+		new_tail_idx = tail_idx + 1;
+		if (new_tail_idx > AM65_CPSW_MAX_RX_DESC)
+			new_tail_idx = 0;
+	} while (atomic_cmpxchg(&rx_ring->rx_desc_ring_tail_idx,
+				tail_idx, new_tail_idx) != tail_idx);
+
+	/* Store the descriptor at the tail position */
+	rx_ring->rx_descs[tail_idx] = desc;
+}
+
+static void *am65_cpsw_nuss_get_tx_desc(struct am65_cpsw_tx_chn *tx_chn)
+{
+	struct am65_cpsw_tx_ring *tx_ring = &tx_chn->tx_ring;
+	int head_idx, tail_idx, new_head_idx;
+
+	/* Atomically get current head index and check if queue is empty */
+	do {
+		head_idx = atomic_read(&tx_ring->tx_desc_ring_head_idx);
+		tail_idx = atomic_read(&tx_ring->tx_desc_ring_tail_idx);
+
+		/* Queue is empty when head == tail */
+		if (head_idx == tail_idx)
+			return NULL;
+
+		/* Calculate new head with wraparound */
+		new_head_idx = head_idx + 1;
+		if (new_head_idx > AM65_CPSW_MAX_TX_DESC)
+			new_head_idx = 0;
+
+	} while (atomic_cmpxchg(&tx_ring->tx_desc_ring_head_idx,
+				head_idx, new_head_idx) != head_idx);
+
+	return tx_ring->tx_descs[head_idx];
+}
+
+static void *am65_cpsw_nuss_get_rx_desc(struct am65_cpsw_rx_flow *flow)
+{
+	struct am65_cpsw_rx_ring *rx_ring = &flow->rx_ring;
+	int head_idx, tail_idx, new_head_idx;
+
+	/* Atomically get current head index and check if queue is empty */
+	do {
+		head_idx = atomic_read(&rx_ring->rx_desc_ring_head_idx);
+		tail_idx = atomic_read(&rx_ring->rx_desc_ring_tail_idx);
+
+		/* Queue is empty when head == tail */
+		if (head_idx == tail_idx)
+			return NULL;
+
+		/* Calculate new head with wraparound */
+		new_head_idx = head_idx + 1;
+		if (new_head_idx > AM65_CPSW_MAX_RX_DESC)
+			new_head_idx = 0;
+
+	} while (atomic_cmpxchg(&rx_ring->rx_desc_ring_head_idx,
+				head_idx, new_head_idx) != head_idx);
+
+	return rx_ring->rx_descs[head_idx];
+}
+
+static inline int am65_cpsw_nuss_tx_descs_available(struct am65_cpsw_tx_chn *tx_chn)
+{
+	struct am65_cpsw_tx_ring *tx_ring = &tx_chn->tx_ring;
+	int head_idx, tail_idx;
+
+	head_idx = atomic_read(&tx_ring->tx_desc_ring_head_idx);
+	tail_idx = atomic_read(&tx_ring->tx_desc_ring_tail_idx);
+
+	return (tail_idx - head_idx + (AM65_CPSW_MAX_TX_DESC + 1)) %
+	       (AM65_CPSW_MAX_TX_DESC + 1);
+}
+
 static void am65_cpsw_nuss_ndo_slave_set_rx_mode(struct net_device *ndev)
 {
 	struct am65_cpsw_common *common = am65_ndev_to_common(ndev);
@@ -423,7 +536,7 @@ static void am65_cpsw_nuss_ndo_host_tx_timeout(struct net_device *ndev,
 		   netif_tx_queue_stopped(netif_txq),
 		   jiffies_to_msecs(jiffies - trans_start),
 		   netdev_queue_dql_avail(netif_txq),
-		   k3_cppi_desc_pool_avail(tx_chn->desc_pool));
+		   am65_cpsw_nuss_num_free_tx_desc(tx_chn));
 
 	if (netif_tx_queue_stopped(netif_txq)) {
 		/* try recover if stopped by us */
@@ -442,7 +555,7 @@ static int am65_cpsw_nuss_rx_push(struct am65_cpsw_common *common,
 	dma_addr_t desc_dma;
 	dma_addr_t buf_dma;
 
-	desc_rx = k3_cppi_desc_pool_alloc(rx_chn->desc_pool);
+	desc_rx = am65_cpsw_nuss_get_rx_desc(&rx_chn->flows[flow_idx]);
 	if (!desc_rx) {
 		dev_err(dev, "Failed to allocate RXFDQ descriptor\n");
 		return -ENOMEM;
@@ -453,7 +566,7 @@ static int am65_cpsw_nuss_rx_push(struct am65_cpsw_common *common,
 				 page_address(page) + AM65_CPSW_HEADROOM,
 				 AM65_CPSW_MAX_PACKET_SIZE, DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(rx_chn->dma_dev, buf_dma))) {
-		k3_cppi_desc_pool_free(rx_chn->desc_pool, desc_rx);
+		am65_cpsw_nuss_put_rx_desc(&rx_chn->flows[flow_idx], desc_rx);
 		dev_err(dev, "Failed to map rx buffer\n");
 		return -EINVAL;
 	}
@@ -508,6 +621,7 @@ static void am65_cpsw_nuss_tx_cleanup(void *data, dma_addr_t desc_dma);
 static void am65_cpsw_destroy_rxq(struct am65_cpsw_common *common, int id)
 {
 	struct am65_cpsw_rx_chn *rx_chn = &common->rx_chns;
+	struct cppi5_host_desc_t *rx_desc;
 	struct am65_cpsw_rx_flow *flow;
 	struct xdp_rxq_info *rxq;
 	int port;
@@ -515,6 +629,12 @@ static void am65_cpsw_destroy_rxq(struct am65_cpsw_common *common, int id)
 	flow = &rx_chn->flows[id];
 	napi_disable(&flow->napi_rx);
 	hrtimer_cancel(&flow->rx_hrtimer);
+	/* return descriptors to pool */
+	rx_desc = am65_cpsw_nuss_get_rx_desc(flow);
+	while (rx_desc) {
+		k3_cppi_desc_pool_free(rx_chn->desc_pool, rx_desc);
+		rx_desc = am65_cpsw_nuss_get_rx_desc(flow);
+	}
 	k3_udma_glue_reset_rx_chn(rx_chn->rx_chn, id, rx_chn,
 				  am65_cpsw_nuss_rx_cleanup);
 
@@ -603,7 +723,12 @@ static int am65_cpsw_create_rxq(struct am65_cpsw_common *common, int id)
 			goto err;
 	}
 
+	/* Preallocate all RX Descriptors */
+	atomic_set(&flow->rx_ring.rx_desc_ring_head_idx, 0);
+	atomic_set(&flow->rx_ring.rx_desc_ring_tail_idx, AM65_CPSW_MAX_RX_DESC);
+
 	for (i = 0; i < AM65_CPSW_MAX_RX_DESC; i++) {
+		flow->rx_ring.rx_descs[i] = k3_cppi_desc_pool_alloc(rx_chn->desc_pool);
 		page = page_pool_dev_alloc_pages(flow->page_pool);
 		if (!page) {
 			dev_err(common->dev, "cannot allocate page in flow %d\n",
@@ -661,9 +786,16 @@ static int am65_cpsw_create_rxqs(struct am65_cpsw_common *common)
 static void am65_cpsw_destroy_txq(struct am65_cpsw_common *common, int id)
 {
 	struct am65_cpsw_tx_chn *tx_chn = &common->tx_chns[id];
+	struct cppi5_host_desc_t *tx_desc;
 
 	napi_disable(&tx_chn->napi_tx);
 	hrtimer_cancel(&tx_chn->tx_hrtimer);
+	/* return descriptors to pool */
+	tx_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
+	while (tx_desc) {
+		k3_cppi_desc_pool_free(tx_chn->desc_pool, tx_desc);
+		tx_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
+	}
 	k3_udma_glue_reset_tx_chn(tx_chn->tx_chn, tx_chn,
 				  am65_cpsw_nuss_tx_cleanup);
 	k3_udma_glue_disable_tx_chn(tx_chn->tx_chn);
@@ -695,7 +827,13 @@ static void am65_cpsw_destroy_txqs(struct am65_cpsw_common *common)
 static int am65_cpsw_create_txq(struct am65_cpsw_common *common, int id)
 {
 	struct am65_cpsw_tx_chn *tx_chn = &common->tx_chns[id];
-	int ret;
+	int ret, i;
+
+	/* Preallocate all TX Descriptors */
+	atomic_set(&tx_chn->tx_ring.tx_desc_ring_head_idx, 0);
+	atomic_set(&tx_chn->tx_ring.tx_desc_ring_tail_idx, AM65_CPSW_MAX_TX_DESC);
+	for (i = 0; i < AM65_CPSW_MAX_TX_DESC; i++)
+		tx_chn->tx_ring.tx_descs[i] = k3_cppi_desc_pool_alloc(tx_chn->desc_pool);
 
 	ret = k3_udma_glue_enable_tx_chn(tx_chn->tx_chn);
 	if (ret)
@@ -1103,7 +1241,7 @@ static int am65_cpsw_xdp_tx_frame(struct net_device *ndev,
 	u32 pkt_len = xdpf->len;
 	int ret;
 
-	host_desc = k3_cppi_desc_pool_alloc(tx_chn->desc_pool);
+	host_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
 	if (unlikely(!host_desc)) {
 		ndev->stats.tx_dropped++;
 		return AM65_CPSW_XDP_CONSUMED;	/* drop */
@@ -1161,7 +1299,7 @@ static int am65_cpsw_xdp_tx_frame(struct net_device *ndev,
 	k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &dma_buf);
 	dma_unmap_single(tx_chn->dma_dev, dma_buf, pkt_len, DMA_TO_DEVICE);
 pool_free:
-	k3_cppi_desc_pool_free(tx_chn->desc_pool, host_desc);
+	am65_cpsw_nuss_put_tx_desc(tx_chn, host_desc);
 	return ret;
 }
 
@@ -1320,7 +1458,7 @@ static int am65_cpsw_nuss_rx_packets(struct am65_cpsw_rx_flow *flow,
 	dev_dbg(dev, "%s rx csum_info:%#x\n", __func__, csum_info);
 
 	dma_unmap_single(rx_chn->dma_dev, buf_dma, buf_dma_len, DMA_FROM_DEVICE);
-	k3_cppi_desc_pool_free(rx_chn->desc_pool, desc_rx);
+	am65_cpsw_nuss_put_rx_desc(flow, desc_rx);
 
 	if (port->xdp_prog) {
 		xdp_init_buff(&xdp, PAGE_SIZE, &port->xdp_rxq[flow->id]);
@@ -1444,13 +1582,48 @@ static void am65_cpsw_nuss_tx_wake(struct am65_cpsw_tx_chn *tx_chn, struct net_d
 		 */
 		__netif_tx_lock(netif_txq, smp_processor_id());
 		if (netif_running(ndev) &&
-		    (k3_cppi_desc_pool_avail(tx_chn->desc_pool) >= MAX_SKB_FRAGS))
+		    (am65_cpsw_nuss_num_free_tx_desc(tx_chn) >= MAX_SKB_FRAGS))
 			netif_tx_wake_queue(netif_txq);
 
 		__netif_tx_unlock(netif_txq);
 	}
 }
 
+static inline void am65_cpsw_nuss_xmit_recycle(struct am65_cpsw_tx_chn *tx_chn,
+					       struct cppi5_host_desc_t *desc)
+{
+	struct cppi5_host_desc_t *first_desc, *next_desc;
+	dma_addr_t buf_dma, next_desc_dma;
+	u32 buf_dma_len;
+
+	first_desc = desc;
+	next_desc = first_desc;
+
+	cppi5_hdesc_get_obuf(first_desc, &buf_dma, &buf_dma_len);
+	k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &buf_dma);
+
+	dma_unmap_single(tx_chn->dma_dev, buf_dma, buf_dma_len, DMA_TO_DEVICE);
+
+	next_desc_dma = cppi5_hdesc_get_next_hbdesc(first_desc);
+	k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &next_desc_dma);
+	while (next_desc_dma) {
+		next_desc = k3_cppi_desc_pool_dma2virt(tx_chn->desc_pool,
+						       next_desc_dma);
+		cppi5_hdesc_get_obuf(next_desc, &buf_dma, &buf_dma_len);
+		k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &buf_dma);
+
+		dma_unmap_page(tx_chn->dma_dev, buf_dma, buf_dma_len,
+			       DMA_TO_DEVICE);
+
+		next_desc_dma = cppi5_hdesc_get_next_hbdesc(next_desc);
+		k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &next_desc_dma);
+
+		am65_cpsw_nuss_put_tx_desc(tx_chn, next_desc);
+	}
+
+	am65_cpsw_nuss_put_tx_desc(tx_chn, first_desc);
+}
+
 static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 					   int chn, unsigned int budget, bool *tdown)
 {
@@ -1509,7 +1682,7 @@ static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 
 		total_bytes += pkt_len;
 		num_tx++;
-		am65_cpsw_nuss_xmit_free(tx_chn, desc_tx);
+		am65_cpsw_nuss_xmit_recycle(tx_chn, desc_tx);
 		dev_sw_netstats_tx_add(ndev, 1, pkt_len);
 		if (!single_port) {
 			/* as packets from multi ports can be interleaved
@@ -1624,7 +1797,7 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 		goto err_free_skb;
 	}
 
-	first_desc = k3_cppi_desc_pool_alloc(tx_chn->desc_pool);
+	first_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
 	if (!first_desc) {
 		dev_dbg(dev, "Failed to allocate descriptor\n");
 		dma_unmap_single(tx_chn->dma_dev, buf_dma, pkt_len,
@@ -1672,7 +1845,7 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		u32 frag_size = skb_frag_size(frag);
 
-		next_desc = k3_cppi_desc_pool_alloc(tx_chn->desc_pool);
+		next_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
 		if (!next_desc) {
 			dev_err(dev, "Failed to allocate descriptor\n");
 			goto busy_free_descs;
@@ -1682,7 +1855,7 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 					   DMA_TO_DEVICE);
 		if (unlikely(dma_mapping_error(tx_chn->dma_dev, buf_dma))) {
 			dev_err(dev, "Failed to map tx skb page\n");
-			k3_cppi_desc_pool_free(tx_chn->desc_pool, next_desc);
+			am65_cpsw_nuss_put_tx_desc(tx_chn, next_desc);
 			ndev->stats.tx_errors++;
 			goto err_free_descs;
 		}
@@ -1725,14 +1898,14 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 		goto err_free_descs;
 	}
 
-	if (k3_cppi_desc_pool_avail(tx_chn->desc_pool) < MAX_SKB_FRAGS) {
+	if (am65_cpsw_nuss_num_free_tx_desc(tx_chn) < MAX_SKB_FRAGS) {
 		netif_tx_stop_queue(netif_txq);
 		/* Barrier, so that stop_queue visible to other cpus */
 		smp_mb__after_atomic();
 		dev_dbg(dev, "netif_tx_stop_queue %d\n", q_idx);
 
 		/* re-check for smp */
-		if (k3_cppi_desc_pool_avail(tx_chn->desc_pool) >=
+		if (am65_cpsw_nuss_num_free_tx_desc(tx_chn) >=
 		    MAX_SKB_FRAGS) {
 			netif_tx_wake_queue(netif_txq);
 			dev_dbg(dev, "netif_tx_wake_queue %d\n", q_idx);
@@ -1742,14 +1915,14 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 	return NETDEV_TX_OK;
 
 err_free_descs:
-	am65_cpsw_nuss_xmit_free(tx_chn, first_desc);
+	am65_cpsw_nuss_xmit_recycle(tx_chn, first_desc);
 err_free_skb:
 	ndev->stats.tx_dropped++;
 	dev_kfree_skb_any(skb);
 	return NETDEV_TX_OK;
 
 busy_free_descs:
-	am65_cpsw_nuss_xmit_free(tx_chn, first_desc);
+	am65_cpsw_nuss_xmit_recycle(tx_chn, first_desc);
 busy_stop_q:
 	netif_tx_stop_queue(netif_txq);
 	return NETDEV_TX_BUSY;
@@ -2195,7 +2368,7 @@ static void am65_cpsw_nuss_free_tx_chns(void *data)
 static void am65_cpsw_nuss_remove_tx_chns(struct am65_cpsw_common *common)
 {
 	struct device *dev = common->dev;
-	int i;
+	int i, j;
 
 	common->tx_ch_rate_msk = 0;
 	for (i = 0; i < common->tx_ch_num; i++) {
@@ -2205,6 +2378,10 @@ static void am65_cpsw_nuss_remove_tx_chns(struct am65_cpsw_common *common)
 			devm_free_irq(dev, tx_chn->irq, tx_chn);
 
 		netif_napi_del(&tx_chn->napi_tx);
+
+		for (j = 0; j < AM65_CPSW_MAX_TX_DESC; j++)
+			k3_cppi_desc_pool_free(tx_chn->desc_pool,
+					       tx_chn->tx_ring.tx_descs[j]);
 	}
 
 	am65_cpsw_nuss_free_tx_chns(common);
@@ -2260,7 +2437,7 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
 		.flags = 0
 	};
 	u32 hdesc_size, hdesc_size_out;
-	int i, ret = 0;
+	int i, j, ret = 0;
 
 	hdesc_size = cppi5_hdesc_calc_size(true, AM65_CPSW_NAV_PS_DATA_SIZE,
 					   AM65_CPSW_NAV_SW_DATA_SIZE);
@@ -2329,6 +2506,13 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
 	return 0;
 
 err:
+	/* Free descriptors */
+	while (i--) {
+		struct am65_cpsw_tx_chn *tx_chn = &common->tx_chns[i];
+
+		for (j = 0; j < AM65_CPSW_MAX_TX_DESC; j++)
+			k3_cppi_desc_pool_free(tx_chn->desc_pool, tx_chn->tx_ring.tx_descs[j]);
+	}
 	am65_cpsw_nuss_free_tx_chns(common);
 
 	return ret;
@@ -2353,7 +2537,7 @@ static void am65_cpsw_nuss_remove_rx_chns(struct am65_cpsw_common *common)
 	struct device *dev = common->dev;
 	struct am65_cpsw_rx_chn *rx_chn;
 	struct am65_cpsw_rx_flow *flows;
-	int i;
+	int i, j;
 
 	rx_chn = &common->rx_chns;
 	flows = rx_chn->flows;
@@ -2362,6 +2546,9 @@ static void am65_cpsw_nuss_remove_rx_chns(struct am65_cpsw_common *common)
 		if (!(flows[i].irq < 0))
 			devm_free_irq(dev, flows[i].irq, &flows[i]);
 		netif_napi_del(&flows[i].napi_rx);
+		for (j = 0; j < AM65_CPSW_MAX_RX_DESC; j++)
+			k3_cppi_desc_pool_free(rx_chn->desc_pool,
+					       flows[i].rx_ring.rx_descs[j]);
 	}
 
 	am65_cpsw_nuss_free_rx_chns(common);
@@ -2378,7 +2565,7 @@ static int am65_cpsw_nuss_init_rx_chns(struct am65_cpsw_common *common)
 	struct am65_cpsw_rx_flow *flow;
 	u32 hdesc_size, hdesc_size_out;
 	u32 fdqring_id;
-	int i, ret = 0;
+	int i, j, ret = 0;
 
 	hdesc_size = cppi5_hdesc_calc_size(true, AM65_CPSW_NAV_PS_DATA_SIZE,
 					   AM65_CPSW_NAV_SW_DATA_SIZE);
@@ -2498,10 +2685,14 @@ static int am65_cpsw_nuss_init_rx_chns(struct am65_cpsw_common *common)
 
 err_request_irq:
 	netif_napi_del(&flow->napi_rx);
+	for (j = 0; j < AM65_CPSW_MAX_RX_DESC; j++)
+		k3_cppi_desc_pool_free(rx_chn->desc_pool, flow->rx_ring.rx_descs[j]);
 
 err_flow:
 	for (--i; i >= 0; i--) {
 		flow = &rx_chn->flows[i];
+		for (j = 0; j < AM65_CPSW_MAX_RX_DESC; j++)
+			k3_cppi_desc_pool_free(rx_chn->desc_pool, flow->rx_ring.rx_descs[j]);
 		devm_free_irq(dev, flow->irq, flow);
 		netif_napi_del(&flow->napi_rx);
 	}
diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.h b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
index 7750448e4746..e64b4cfd6f2c 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.h
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
@@ -6,6 +6,7 @@
 #ifndef AM65_CPSW_NUSS_H_
 #define AM65_CPSW_NUSS_H_
 
+#include <linux/dma/ti-cppi5.h>
 #include <linux/if_ether.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
@@ -23,6 +24,10 @@ struct am65_cpts;
 
 #define AM65_CPSW_MAX_QUEUES	8	/* both TX & RX */
 
+/* Number of TX/RX descriptors per channel/flow */
+#define AM65_CPSW_MAX_TX_DESC	500
+#define AM65_CPSW_MAX_RX_DESC	500
+
 #define AM65_CPSW_PORT_VLAN_REG_OFFSET	0x014
 
 struct am65_cpsw_slave_data {
@@ -75,6 +80,12 @@ struct am65_cpsw_host {
 	u32				vid_context;
 };
 
+struct am65_cpsw_tx_ring {
+	struct cppi5_host_desc_t *tx_descs[AM65_CPSW_MAX_TX_DESC + 1];
+	atomic_t tx_desc_ring_head_idx; /* Points to dequeuing place for free descriptor */
+	atomic_t tx_desc_ring_tail_idx; /* Points to queuing place for freed descriptor */
+};
+
 struct am65_cpsw_tx_chn {
 	struct device *dma_dev;
 	struct napi_struct napi_tx;
@@ -82,6 +93,7 @@ struct am65_cpsw_tx_chn {
 	struct k3_cppi_desc_pool *desc_pool;
 	struct k3_udma_glue_tx_channel *tx_chn;
 	spinlock_t lock; /* protect TX rings in multi-port mode */
+	struct am65_cpsw_tx_ring tx_ring;
 	struct hrtimer tx_hrtimer;
 	unsigned long tx_pace_timeout;
 	int irq;
@@ -92,12 +104,19 @@ struct am65_cpsw_tx_chn {
 	u32 rate_mbps;
 };
 
+struct am65_cpsw_rx_ring {
+	struct cppi5_host_desc_t *rx_descs[AM65_CPSW_MAX_RX_DESC + 1];
+	atomic_t rx_desc_ring_head_idx; /* Points to dequeuing place for free descriptor */
+	atomic_t rx_desc_ring_tail_idx; /* Points to queuing place for freed descriptor */
+};
+
 struct am65_cpsw_rx_flow {
 	u32 id;
 	struct napi_struct napi_rx;
 	struct am65_cpsw_common	*common;
 	int irq;
 	bool irq_disabled;
+	struct am65_cpsw_rx_ring rx_ring;
 	struct hrtimer rx_hrtimer;
 	unsigned long rx_pace_timeout;
 	struct page_pool *page_pool;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 6/6] net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX CMPL
  2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
                   ` (4 preceding siblings ...)
  2026-03-25 12:38 ` [RFC PATCH 5/6] net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors Siddharth Vadapalli
@ 2026-03-25 12:38 ` Siddharth Vadapalli
  5 siblings, 0 replies; 7+ messages in thread
From: Siddharth Vadapalli @ 2026-03-25 12:38 UTC (permalink / raw)
  To: peter.ujfalusi, vkoul, Frank.Li, andrew+netdev, davem, edumazet,
	kuba, pabeni, nm, ssantosh, horms, c-vankar, mwalle
  Cc: dmaengine, linux-kernel, netdev, linux-arm-kernel, danishanwar,
	srk, s-vadapalli

Enable batch processing on the transmit and transmit completion paths by
submitting a batch of packet descriptors on transmit and similarly by
dequeueing a batch of packet descriptors on transmit completion.

Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 201 +++++++++++++++++++----
 drivers/net/ethernet/ti/am65-cpsw-nuss.h |  12 ++
 2 files changed, 178 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index fc165579a479..2b354af14cb7 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -1624,14 +1624,14 @@ static inline void am65_cpsw_nuss_xmit_recycle(struct am65_cpsw_tx_chn *tx_chn,
 	am65_cpsw_nuss_put_tx_desc(tx_chn, first_desc);
 }
 
-static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
-					   int chn, unsigned int budget, bool *tdown)
+static int am65_cpsw_nuss_tx_cmpl_free_batch(struct am65_cpsw_common *common, int chn,
+					     u32 batch_size, unsigned int budget,
+					     bool *tdown)
 {
 	bool single_port = AM65_CPSW_IS_CPSW2G(common);
 	enum am65_cpsw_tx_buf_type buf_type;
 	struct am65_cpsw_tx_swdata *swdata;
 	struct cppi5_host_desc_t *desc_tx;
-	struct device *dev = common->dev;
 	struct am65_cpsw_tx_chn *tx_chn;
 	struct netdev_queue *netif_txq;
 	unsigned int total_bytes = 0;
@@ -1640,21 +1640,13 @@ static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 	unsigned int pkt_len;
 	struct sk_buff *skb;
 	dma_addr_t desc_dma;
-	int res, num_tx = 0;
+	int num_tx = 0, i;
 
 	tx_chn = &common->tx_chns[chn];
 
-	while (true) {
-		if (!single_port)
-			spin_lock(&tx_chn->lock);
-		res = k3_udma_glue_pop_tx_chn(tx_chn->tx_chn, &desc_dma);
-		if (!single_port)
-			spin_unlock(&tx_chn->lock);
-
-		if (res == -ENODATA)
-			break;
-
-		if (cppi5_desc_is_tdcm(desc_dma)) {
+	for (i = 0; i < batch_size; i++) {
+		desc_dma = tx_chn->cmpl_desc_dma_array[i];
+		if (unlikely(cppi5_desc_is_tdcm(desc_dma))) {
 			if (atomic_dec_and_test(&common->tdown_cnt))
 				complete(&common->tdown_complete);
 			*tdown = true;
@@ -1701,7 +1693,34 @@ static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 		am65_cpsw_nuss_tx_wake(tx_chn, ndev, netif_txq);
 	}
 
-	dev_dbg(dev, "%s:%u pkt:%d\n", __func__, chn, num_tx);
+	return num_tx;
+}
+
+static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
+					   int chn, unsigned int budget, bool *tdown)
+{
+	bool single_port = AM65_CPSW_IS_CPSW2G(common);
+	struct am65_cpsw_tx_chn *tx_chn;
+	u32 batch_size = 0;
+	int res, num_tx;
+
+	tx_chn = &common->tx_chns[chn];
+
+	if (!single_port)
+		spin_lock(&tx_chn->lock);
+
+	res = k3_udma_glue_pop_tx_chn_batch(tx_chn->tx_chn, tx_chn->cmpl_desc_dma_array,
+					    &batch_size, AM65_CPSW_TX_BATCH_SIZE);
+	if (!batch_size) {
+		if (!single_port)
+			spin_unlock(&tx_chn->lock);
+		return 0;
+	}
+
+	num_tx = am65_cpsw_nuss_tx_cmpl_free_batch(common, chn, batch_size, budget, tdown);
+
+	if (!single_port)
+		spin_unlock(&tx_chn->lock);
 
 	return num_tx;
 }
@@ -1760,18 +1779,48 @@ static irqreturn_t am65_cpsw_nuss_tx_irq(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+static void am65_cpsw_nuss_submit_ndev_batch(struct am65_cpsw_common *common)
+{
+	bool single_port = AM65_CPSW_IS_CPSW2G(common);
+	struct am65_cpsw_tx_desc_batch *tx_desc_batch;
+	struct am65_cpsw_tx_chn *tx_chn;
+	int ret, i;
+
+	/* Submit packets across netdevs across TX Channels */
+	for (i = 0; i < AM65_CPSW_MAX_QUEUES; i++) {
+		if (common->tx_desc_batch[i].tx_batch_idx) {
+			tx_chn = &common->tx_chns[i];
+			tx_desc_batch = &common->tx_desc_batch[i];
+			if (!single_port)
+				spin_lock_bh(&tx_chn->lock);
+			ret = k3_udma_glue_push_tx_chn_batch(tx_chn->tx_chn,
+							     tx_desc_batch->desc_tx_array,
+							     tx_desc_batch->desc_dma_array,
+							     tx_desc_batch->tx_batch_idx);
+			if (!single_port)
+				spin_unlock_bh(&tx_chn->lock);
+			if (ret)
+				dev_err(common->dev, "failed to push %u pkts on queue %d\n",
+					tx_desc_batch->tx_batch_idx, i);
+			tx_desc_batch->tx_batch_idx = 0;
+		}
+	}
+	atomic_set(&common->tx_batch_count, 0);
+}
+
 static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 						 struct net_device *ndev)
 {
 	struct am65_cpsw_common *common = am65_ndev_to_common(ndev);
 	struct cppi5_host_desc_t *first_desc, *next_desc, *cur_desc;
 	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
+	struct am65_cpsw_tx_desc_batch *tx_desc_batch;
 	struct am65_cpsw_tx_swdata *swdata;
 	struct device *dev = common->dev;
 	struct am65_cpsw_tx_chn *tx_chn;
 	struct netdev_queue *netif_txq;
 	dma_addr_t desc_dma, buf_dma;
-	int ret, q_idx, i;
+	int q_idx, i;
 	u32 *psdata;
 	u32 pkt_len;
 
@@ -1883,20 +1932,31 @@ static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 
 	cppi5_hdesc_set_pktlen(first_desc, pkt_len);
 	desc_dma = k3_cppi_desc_pool_virt2dma(tx_chn->desc_pool, first_desc);
-	if (AM65_CPSW_IS_CPSW2G(common)) {
-		ret = k3_udma_glue_push_tx_chn(tx_chn->tx_chn, first_desc, desc_dma);
-	} else {
-		spin_lock_bh(&tx_chn->lock);
-		ret = k3_udma_glue_push_tx_chn(tx_chn->tx_chn, first_desc, desc_dma);
-		spin_unlock_bh(&tx_chn->lock);
-	}
-	if (ret) {
-		dev_err(dev, "can't push desc %d\n", ret);
-		/* inform bql */
-		netdev_tx_completed_queue(netif_txq, 1, pkt_len);
-		ndev->stats.tx_errors++;
-		goto err_free_descs;
-	}
+
+	/* Batch processing begins */
+	spin_lock_bh(&common->tx_batch_lock);
+
+	tx_desc_batch = &common->tx_desc_batch[q_idx];
+	tx_desc_batch->desc_tx_array[tx_desc_batch->tx_batch_idx] = first_desc;
+	tx_desc_batch->desc_dma_array[tx_desc_batch->tx_batch_idx] = desc_dma;
+	tx_desc_batch->tx_batch_idx++;
+
+	/* Push the batch across all queues and all netdevs in any of the
+	 * following scenarios:
+	 * 1. If we reach the batch size
+	 * 2. If queue is stopped
+	 * 3. No more packets are expected for ndev
+	 * 4. We do not have sufficient free descriptors for upcoming packets
+	 *    and need to push the batch to reclaim them via completion
+	 */
+	if ((atomic_inc_return(&common->tx_batch_count) == AM65_CPSW_TX_BATCH_SIZE) ||
+	    netif_xmit_stopped(netif_txq) ||
+	    !netdev_xmit_more() ||
+	    (am65_cpsw_nuss_num_free_tx_desc(tx_chn) < MAX_SKB_FRAGS))
+		am65_cpsw_nuss_submit_ndev_batch(common);
+
+	/* Batch processing ends */
+	spin_unlock_bh(&common->tx_batch_lock);
 
 	if (am65_cpsw_nuss_num_free_tx_desc(tx_chn) < MAX_SKB_FRAGS) {
 		netif_tx_stop_queue(netif_txq);
@@ -2121,19 +2181,88 @@ static int am65_cpsw_ndo_xdp_xmit(struct net_device *ndev, int n,
 				  struct xdp_frame **frames, u32 flags)
 {
 	struct am65_cpsw_common *common = am65_ndev_to_common(ndev);
+	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
+	struct am65_cpsw_tx_desc_batch *tx_desc_batch;
+	struct cppi5_host_desc_t *host_desc;
+	struct am65_cpsw_tx_swdata *swdata;
 	struct am65_cpsw_tx_chn *tx_chn;
 	struct netdev_queue *netif_txq;
+	dma_addr_t dma_desc, dma_buf;
 	int cpu = smp_processor_id();
-	int i, nxmit = 0;
+	int i, q_idx, nxmit = 0;
+	struct xdp_frame *xdpf;
+	u32 pkt_len;
 
-	tx_chn = &common->tx_chns[cpu % common->tx_ch_num];
+	q_idx = cpu % common->tx_ch_num;
+	tx_chn = &common->tx_chns[q_idx];
 	netif_txq = netdev_get_tx_queue(ndev, tx_chn->id);
 
 	__netif_tx_lock(netif_txq, cpu);
 	for (i = 0; i < n; i++) {
-		if (am65_cpsw_xdp_tx_frame(ndev, tx_chn, frames[i],
-					   AM65_CPSW_TX_BUF_TYPE_XDP_NDO))
+		host_desc = am65_cpsw_nuss_get_tx_desc(tx_chn);
+		if (unlikely(!host_desc)) {
+			ndev->stats.tx_dropped++;
+			break;
+		}
+
+		xdpf = frames[i];
+		pkt_len = xdpf->len;
+
+		am65_cpsw_nuss_set_buf_type(tx_chn, host_desc, AM65_CPSW_TX_BUF_TYPE_XDP_NDO);
+
+		dma_buf = dma_map_single(tx_chn->dma_dev, xdpf->data,
+					 pkt_len, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(tx_chn->dma_dev, dma_buf))) {
+			ndev->stats.tx_dropped++;
+			am65_cpsw_nuss_put_tx_desc(tx_chn, host_desc);
 			break;
+		}
+
+		cppi5_hdesc_init(host_desc, CPPI5_INFO0_HDESC_EPIB_PRESENT,
+				 AM65_CPSW_NAV_PS_DATA_SIZE);
+		cppi5_hdesc_set_pkttype(host_desc, AM65_CPSW_CPPI_TX_PKT_TYPE);
+		cppi5_hdesc_set_pktlen(host_desc, pkt_len);
+		cppi5_desc_set_pktids(&host_desc->hdr, 0, AM65_CPSW_CPPI_TX_FLOW_ID);
+		cppi5_desc_set_tags_ids(&host_desc->hdr, 0, port->port_id);
+
+		k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &dma_buf);
+		cppi5_hdesc_attach_buf(host_desc, dma_buf, pkt_len, dma_buf, pkt_len);
+
+		swdata = cppi5_hdesc_get_swdata(host_desc);
+		swdata->ndev = ndev;
+		swdata->xdpf = xdpf;
+
+		/* Report BQL before sending the packet */
+		netif_txq = netdev_get_tx_queue(ndev, tx_chn->id);
+		netdev_tx_sent_queue(netif_txq, pkt_len);
+
+		dma_desc = k3_cppi_desc_pool_virt2dma(tx_chn->desc_pool, host_desc);
+
+		/* Batch processing begins */
+		spin_lock_bh(&common->tx_batch_lock);
+
+		tx_desc_batch = &common->tx_desc_batch[q_idx];
+		tx_desc_batch->desc_tx_array[tx_desc_batch->tx_batch_idx] = host_desc;
+		tx_desc_batch->desc_dma_array[tx_desc_batch->tx_batch_idx] = dma_desc;
+		tx_desc_batch->tx_batch_idx++;
+
+		/* Push the batch across all queues and all netdevs in any of the
+		 * following scenarios:
+		 * 1. If we reach the batch size
+		 * 2. If queue is stopped
+		 * 3. We are at the last XDP frame in the batch
+		 * 4. We do not have sufficient free descriptors for upcoming packets
+		 *    and need to push the batch to reclaim them via completion
+		 */
+		if ((atomic_inc_return(&common->tx_batch_count) == AM65_CPSW_TX_BATCH_SIZE) ||
+		    netif_xmit_stopped(netif_txq) ||
+		    (i == (n - 1)) ||
+		    (am65_cpsw_nuss_num_free_tx_desc(tx_chn) < MAX_SKB_FRAGS))
+			am65_cpsw_nuss_submit_ndev_batch(common);
+
+		/* Batch processing ends */
+		spin_unlock_bh(&common->tx_batch_lock);
+
 		nxmit++;
 	}
 	__netif_tx_unlock(netif_txq);
@@ -2497,6 +2626,8 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
 			 dev_name(dev), tx_chn->id);
 	}
 
+	atomic_set(&common->tx_batch_count, 0);
+
 	ret = am65_cpsw_nuss_ndev_add_tx_napi(common);
 	if (ret) {
 		dev_err(dev, "Failed to add tx NAPI %d\n", ret);
diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.h b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
index e64b4cfd6f2c..81405e3bed79 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.h
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
@@ -28,6 +28,8 @@ struct am65_cpts;
 #define AM65_CPSW_MAX_TX_DESC	500
 #define AM65_CPSW_MAX_RX_DESC	500
 
+#define AM65_CPSW_TX_BATCH_SIZE	128
+
 #define AM65_CPSW_PORT_VLAN_REG_OFFSET	0x014
 
 struct am65_cpsw_slave_data {
@@ -93,6 +95,7 @@ struct am65_cpsw_tx_chn {
 	struct k3_cppi_desc_pool *desc_pool;
 	struct k3_udma_glue_tx_channel *tx_chn;
 	spinlock_t lock; /* protect TX rings in multi-port mode */
+	dma_addr_t cmpl_desc_dma_array[AM65_CPSW_TX_BATCH_SIZE];
 	struct am65_cpsw_tx_ring tx_ring;
 	struct hrtimer tx_hrtimer;
 	unsigned long tx_pace_timeout;
@@ -165,6 +168,12 @@ struct am65_cpsw_devlink {
 	struct am65_cpsw_common *common;
 };
 
+struct am65_cpsw_tx_desc_batch {
+	struct cppi5_host_desc_t *desc_tx_array[AM65_CPSW_TX_BATCH_SIZE];
+	dma_addr_t desc_dma_array[AM65_CPSW_TX_BATCH_SIZE];
+	u8 tx_batch_idx;
+};
+
 struct am65_cpsw_common {
 	struct device		*dev;
 	struct device		*mdio_dev;
@@ -188,6 +197,9 @@ struct am65_cpsw_common {
 	struct am65_cpsw_tx_chn	tx_chns[AM65_CPSW_MAX_QUEUES];
 	struct completion	tdown_complete;
 	atomic_t		tdown_cnt;
+	atomic_t		tx_batch_count;
+	spinlock_t		tx_batch_lock; /* protect TX batch operations */
+	struct am65_cpsw_tx_desc_batch tx_desc_batch[AM65_CPSW_MAX_QUEUES];
 
 	int			rx_ch_num_flows;
 	struct am65_cpsw_rx_chn	rx_chns;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-25 12:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 12:38 [RFC PATCH 0/6] Descriptor Recycling and Batch processing for CPSW Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 1/6] soc: ti: k3-ringacc: Add helper to get realtime count of free elements Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 2/6] soc: ti: k3-ringacc: Add helpers for batch push and pop operations Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 3/6] dmaengine: ti: k3-udma-glue: Add helpers for batch operations on TX/RX DMA Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 4/6] net: ethernet: ti: am65-cpsw-nuss: Do not set buf_type for SKB fragments Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 5/6] net: ethernet: ti: am65-cpsw-nuss: Recycle TX and RX CPPI Descriptors Siddharth Vadapalli
2026-03-25 12:38 ` [RFC PATCH 6/6] net: ethernet: ti: am65-cpsw-nuss: Enable batch processing for TX / TX CMPL Siddharth Vadapalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox