Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Kuniyuki Iwashima @ 2026-06-25 21:45 UTC (permalink / raw)
  To: avimalin; +Cc: edumazet, kuniyu, netdev, vimal.agrawal, kuba
In-Reply-To: <20260625102020.92814-1-vimal.agrawal@sophos.com>

From: Vimal Agrawal <avimalin@gmail.com>
Date: Thu, 25 Jun 2026 10:20:20 +0000
> Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
> on every allocation attempt with no rate limiting. In workloads with mostly
> active/reachable entries, the GC walk traverses a large portion of the
> neighbour table without reclaiming entries, holding tbl->lock for an
> extended period. This causes severe lock contention and allocation
> latencies exceeding 16ms under sustained neighbour creation.
> 
> Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
> performed within the last second, avoiding repeated full table scans and
> lock acquisitions on the hot allocation path.
> 
> Profiling of neigh_create() shows ~3 orders of magnitude latency
> improvement with this change.
> 
> Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/

From the thread, these look misconfigured.

---8<---
net.ipv6.neigh.default.gc_thresh2 = 32768
net.ipv6.neigh.default.gc_thresh3 = 32768
---8<---

If gc_thresh3 is larger enough, gc_thresh2 will give you 5s
rate limiting.

If the number of active neigh entries constantly exceeds
gc_thresh3, it will be the correct gc_thresh2 for you.

Also, I guess you want a new kernel param for the first
neigh_hash_alloc(), which is currently fixed for 3, which
is too small for some hosts.

50000 entries require neigh_hash_grow() 13 times.

Can you test this on your real workload, starting from
neigh_hash_shift=16 and appropriate gc_thresh2/3 ?

---8<---
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1349c0eedb64..a75b3750eec9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1817,6 +1817,22 @@ EXPORT_SYMBOL(neigh_parms_release);
 static struct lock_class_key neigh_table_proxy_queue_class;
 
 static struct neigh_table __rcu *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
+static __initdata unsigned long neigh_hash_shift = 3;
+
+static int __init neigh_set_hash_shift(char *str)
+{
+	ssize_t ret;
+
+	if (!str)
+		return 0;
+
+	ret = kstrtoul(str, 0, &neigh_hash_shift);
+	if (ret)
+		return 0;
+
+	return 1;
+}
+__setup("neigh_hash_shift=", neigh_set_hash_shift);
 
 void neigh_table_init(int index, struct neigh_table *tbl)
 {
@@ -1843,7 +1859,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 		panic("cannot create neighbour proc dir entry");
 #endif
 
-	RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(3));
+	RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(neigh_hash_shift));
 
 	phsize = (PNEIGH_HASHMASK + 1) * sizeof(struct pneigh_entry *);
 	tbl->phash_buckets = kzalloc(phsize, GFP_KERNEL);
---8<---



> Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
> ---
>  net/core/neighbour.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 1349c0eedb64..078842db3c5f 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>  	int shrunk = 0;
>  	int loop = 0;
>  
> +	if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
> +		return 0;
> +
>  	NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
>  
>  	spin_lock_bh(&tbl->lock);
> -- 
> 2.17.1
> v

^ permalink raw reply related

* Re: [PATCH v14 0/9] tls: Add TLS 1.3 hardware offload support
From: Rishikesh Jethwani @ 2026-06-25 22:57 UTC (permalink / raw)
  To: Nils Juenemann
  Cc: netdev, borisp, davem, edumazet, john.fastabend, kuba, leon,
	mbloch, saeedm, sd, tariqt
In-Reply-To: <CAMPsyauZ+jzG9AysO0FWv6ZY0kvCUpjX_U7o=oOjCuOQ87BCgg@mail.gmail.com>

On Tue, Jun 23, 2026 at 10:53 AM Nils Juenemann
<nils.juenemann@gmail.com> wrote:
>
> Hi Rishikesh, all,
>
> we have been testing the v14 TLS 1.3 HW offload series on a ConnectX-6
> DX and hit a sendfile() final-record loss on the device TX path. We
> reduced it to a self-contained C reproducer and characterized it;
> reporting it here with the analysis and a question on where a fix belongs.
>
> Setup:
>
> NIC: ConnectX-6 DX (crypto enabled), FW 22.47.1026, SR-IOV VF,
> TX offload only
>
> Kernel: net-next + this v14 series
>
> TLS 1.3, AES-128-GCM, kTLS installed via setsockopt(TLS_TX) on the
> sending side with fixed test crypto material and no handshake, like
> tools/testing/selftests/net/tls
>
> a server sends a file with the raw sendfile(2) syscall; a client on
> another host reads the decrypted stream and counts the bytes
>
> Trigger: sendfile(2) with a count larger than the bytes remaining in
> the file (count > EOF). This is what a generic copy loop / Go's
> net.TCPConn.ReadFrom passes for a file of unknown length (~2 GiB). The
> kernel sends up to EOF, but the connection's final TLS record then
> appears not to be put on the wire unless a subsequent write flushes it.
> An abrupt close() appears to drop it, and the peer receives the whole
> body except the last record's bytes.
>
> Reproducer results (two hosts over the ConnectX - a loopback/same-host
> connection stays on TLS_SW and does not show it). Same file, 226965
> bytes (= 13*16384 + 13973):
>
> TLS_HW count>EOF close() -> 212992 short
> TLS_HW count>EOF close(), no zerocopy -> 212992 same
> TLS_HW count==exact close() -> 226965 full
> TLS_HW count>EOF close_notify, then close() -> 226965 full
> TLS_SW count>EOF close(), hw-tx-offload off -> 226965 full
>
> So it is specific to the device-offload TX path: the final record of a
> count > EOF sendfile() appears not to be finalized/flushed at EOF, only
> by a following write. A bounded count, a trailing write (close_notify),
> or software kTLS all avoid it. TLS_TX_ZEROCOPY_RO makes no difference.
> We are currently using the exact-count workaround in a preview environment.
>
> We may be misreading the code, so this is only a pointer: with
> count > EOF tls_push_data() fills the last record without reaching the
> size==0 case; on the device path tls_device_record_close() for that
> pending record appears to run only on the next push, and an abrupt
> teardown appears to discard it. The software path seems to flush
> pending TX records on close (tls_sw_release_resources_tx), which would
> explain why it is unaffected.
>
> Reproducer:
> https://gist.github.com/totallyunknown/a8f0ad3c54e40befde2f5a8d360fa6be
>
> It installs kTLS with fixed test crypto material via
> setsockopt(TLS_TX/TLS_RX), sends a file using the raw sendfile(2)
> syscall, and compares count > EOF against exact-count and close_notify.
> The v14 selftest (patch 9/9) sends via send() only and ends cleanly, so
> it misses this; a sendfile() + count > EOF case reproduces it
> deterministically for us.
>
> Question: should the device offload finalize and flush the connection's
> final record at EOF / on close, the way software kTLS does, or is a
> trailing write required by contract? And should a fix live in net/tls
> (device record close on the final partial record / the close path) or
> on the mlx5 side?
>

close() should be sufficient here.
I will fix this in net/tls/tls_device.c:tls_device_splice_eof().
tls_device_splice_eof() only checked tls_is_partially_sent_record(),
but it also needs to handle tls_is_pending_open_record()
(pending_open_record_frags). On the device TX path, that state occurs
when tls_push_data() exits with MSG_MORE set, which is what
splice/sendfile does while count > EOF still leaves requested bytes
outstanding.
So EOF can be reached with a final open record still pending. The fix
is to close and push that record from tls_device_splice_eof(), so
close() remains sufficient and no trailing write is required.

^ permalink raw reply

* Re: [PATCH net] net: airoha: dma map xmit frags with skb_frag_dma_map()
From: Harshitha Ramamurthy @ 2026-06-25 22:59 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260625-airoha-eth-skb_frag_dma_map-v1-1-31d9e460aae6@kernel.org>

On Thu, Jun 25, 2026 at 2:43 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Map xmit skb fragments using skb_frag_dma_map() instead of
> dma_map_single(skb_frag_address()). skb_frag_address() relies on
> page_address() to obtain a kernel virtual address, which is not
> guaranteed to work for all page types (e.g. highmem pages or
> user-pinned pages from MSG_ZEROCOPY).
> skb_frag_dma_map() maps the fragment directly via its struct page and
> offset through dma_map_page(), avoiding the need for a kernel virtual
> address entirely.
> Introduce an enum airoha_dma_map_type to track how each queue entry was
> mapped (single vs page), so that the matching unmap function is called
> on completion and in error paths.
>
> Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>

> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 61 ++++++++++++++++++++------------
>  drivers/net/ethernet/airoha/airoha_eth.h |  7 ++++
>  2 files changed, 45 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 932b3a3df2e5..1caf6766f2c0 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -944,6 +944,25 @@ static void airoha_qdma_wake_netdev_txqs(struct airoha_queue *q)
>         q->txq_stopped = false;
>  }
>
> +static void airoha_unmap_xmit_buf(struct airoha_eth *eth,
> +                                 struct airoha_queue_entry *e)
> +{
> +       switch (e->dma_type) {
> +       case AIROHA_DMA_MAP_PAGE:
> +               dma_unmap_page(eth->dev, e->dma_addr, e->dma_len,
> +                              DMA_TO_DEVICE);
> +               break;
> +       case AIROHA_DMA_MAP_SINGLE:
> +               dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> +                                DMA_TO_DEVICE);
> +               break;
> +       case AIROHA_DMA_UNMAPPED:
> +       default:
> +               break;
> +       }
> +       e->dma_type = AIROHA_DMA_UNMAPPED;
> +}
> +
>  static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>  {
>         struct airoha_tx_irq_queue *irq_q;
> @@ -1006,9 +1025,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>                 skb = e->skb;
>                 e->skb = NULL;
>
> -               dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> -                                DMA_TO_DEVICE);
> -               e->dma_addr = 0;
> +               airoha_unmap_xmit_buf(eth, e);
>                 list_add_tail(&e->list, &q->tx_list);
>
>                 WRITE_ONCE(desc->msg0, 0);
> @@ -1177,12 +1194,10 @@ static void airoha_qdma_tx_cleanup(struct airoha_qdma *qdma)
>                         struct airoha_qdma_desc *desc = &q->desc[j];
>                         struct sk_buff *skb = e->skb;
>
> -                       if (!e->dma_addr)
> +                       if (e->dma_type == AIROHA_DMA_UNMAPPED)
>                                 continue;
>
> -                       dma_unmap_single(qdma->eth->dev, e->dma_addr,
> -                                        e->dma_len, DMA_TO_DEVICE);
> -                       e->dma_addr = 0;
> +                       airoha_unmap_xmit_buf(qdma->eth, e);
>                         list_add_tail(&e->list, &q->tx_list);
>
>                         WRITE_ONCE(desc->ctrl, 0);
> @@ -2193,8 +2208,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>         struct netdev_queue *txq;
>         struct airoha_queue *q;
>         LIST_HEAD(tx_list);
> +       dma_addr_t addr;
>         int i = 0, qid;
> -       void *data;
>         u16 index;
>         u8 fport;
>
> @@ -2250,24 +2265,22 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>                 return NETDEV_TX_BUSY;
>         }
>
> -       len = skb_headlen(skb);
> -       data = skb->data;
> -
>         e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
>                              list);
> +       len = skb_headlen(skb);
> +       addr = dma_map_single(netdev->dev.parent, skb->data, len,
> +                             DMA_TO_DEVICE);
> +       if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> +               goto error_unlock;
> +
> +       e->dma_type = AIROHA_DMA_MAP_SINGLE;
>         index = e - q->entry;
>
>         while (true) {
>                 struct airoha_qdma_desc *desc = &q->desc[index];
>                 skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> -               dma_addr_t addr;
>                 u32 val;
>
> -               addr = dma_map_single(netdev->dev.parent, data, len,
> -                                     DMA_TO_DEVICE);
> -               if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> -                       goto error_unmap;
> -
>                 list_move_tail(&e->list, &tx_list);
>                 e->skb = i == nr_frags - 1 ? skb : NULL;
>                 e->dma_addr = addr;
> @@ -2291,8 +2304,13 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>                 if (++i == nr_frags)
>                         break;
>
> -               data = skb_frag_address(frag);
>                 len = skb_frag_size(frag);
> +               addr = skb_frag_dma_map(netdev->dev.parent, frag, 0, len,
> +                                       DMA_TO_DEVICE);
> +               if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> +                       goto error_unmap;
> +
> +               e->dma_type = AIROHA_DMA_MAP_PAGE;
>         }
>         q->queued += i;
>
> @@ -2313,11 +2331,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>         return NETDEV_TX_OK;
>
>  error_unmap:
> -       list_for_each_entry(e, &tx_list, list) {
> -               dma_unmap_single(netdev->dev.parent, e->dma_addr, e->dma_len,
> -                                DMA_TO_DEVICE);
> -               e->dma_addr = 0;
> -       }
> +       list_for_each_entry(e, &tx_list, list)
> +               airoha_unmap_xmit_buf(dev->eth, e);
>         list_splice(&tx_list, &q->tx_list);
>  error_unlock:
>         spin_unlock_bh(&q->lock);
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
> index d7ff8c5200e2..2765244d937c 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.h
> +++ b/drivers/net/ethernet/airoha/airoha_eth.h
> @@ -170,12 +170,19 @@ enum trtcm_param {
>  #define TRTCM_TOKEN_RATE_MASK                  GENMASK(23, 6)
>  #define TRTCM_TOKEN_RATE_FRACTION_MASK         GENMASK(5, 0)
>
> +enum airoha_dma_map_type {
> +       AIROHA_DMA_UNMAPPED,
> +       AIROHA_DMA_MAP_SINGLE,
> +       AIROHA_DMA_MAP_PAGE,
> +};
> +
>  struct airoha_queue_entry {
>         union {
>                 void *buf;
>                 struct {
>                         struct list_head list;
>                         struct sk_buff *skb;
> +                       enum airoha_dma_map_type dma_type;
>                 };
>         };
>         dma_addr_t dma_addr;
>
> ---
> base-commit: 232c4ca2343d1181cbfc061f9856d9591e397579
> change-id: 20260625-airoha-eth-skb_frag_dma_map-bcccd5d6e4b1
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox