* Re: [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Kuniyuki Iwashima @ 2026-06-25 21:45 UTC (permalink / raw)
To: avimalin; +Cc: edumazet, kuniyu, netdev, vimal.agrawal, kuba
In-Reply-To: <20260625102020.92814-1-vimal.agrawal@sophos.com>
From: Vimal Agrawal <avimalin@gmail.com>
Date: Thu, 25 Jun 2026 10:20:20 +0000
> Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
> on every allocation attempt with no rate limiting. In workloads with mostly
> active/reachable entries, the GC walk traverses a large portion of the
> neighbour table without reclaiming entries, holding tbl->lock for an
> extended period. This causes severe lock contention and allocation
> latencies exceeding 16ms under sustained neighbour creation.
>
> Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
> performed within the last second, avoiding repeated full table scans and
> lock acquisitions on the hot allocation path.
>
> Profiling of neigh_create() shows ~3 orders of magnitude latency
> improvement with this change.
>
> Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/
From the thread, these look misconfigured.
---8<---
net.ipv6.neigh.default.gc_thresh2 = 32768
net.ipv6.neigh.default.gc_thresh3 = 32768
---8<---
If gc_thresh3 is larger enough, gc_thresh2 will give you 5s
rate limiting.
If the number of active neigh entries constantly exceeds
gc_thresh3, it will be the correct gc_thresh2 for you.
Also, I guess you want a new kernel param for the first
neigh_hash_alloc(), which is currently fixed for 3, which
is too small for some hosts.
50000 entries require neigh_hash_grow() 13 times.
Can you test this on your real workload, starting from
neigh_hash_shift=16 and appropriate gc_thresh2/3 ?
---8<---
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1349c0eedb64..a75b3750eec9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1817,6 +1817,22 @@ EXPORT_SYMBOL(neigh_parms_release);
static struct lock_class_key neigh_table_proxy_queue_class;
static struct neigh_table __rcu *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
+static __initdata unsigned long neigh_hash_shift = 3;
+
+static int __init neigh_set_hash_shift(char *str)
+{
+ ssize_t ret;
+
+ if (!str)
+ return 0;
+
+ ret = kstrtoul(str, 0, &neigh_hash_shift);
+ if (ret)
+ return 0;
+
+ return 1;
+}
+__setup("neigh_hash_shift=", neigh_set_hash_shift);
void neigh_table_init(int index, struct neigh_table *tbl)
{
@@ -1843,7 +1859,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
panic("cannot create neighbour proc dir entry");
#endif
- RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(3));
+ RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(neigh_hash_shift));
phsize = (PNEIGH_HASHMASK + 1) * sizeof(struct pneigh_entry *);
tbl->phash_buckets = kzalloc(phsize, GFP_KERNEL);
---8<---
> Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
> ---
> net/core/neighbour.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 1349c0eedb64..078842db3c5f 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> int shrunk = 0;
> int loop = 0;
>
> + if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
> + return 0;
> +
> NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
>
> spin_lock_bh(&tbl->lock);
> --
> 2.17.1
> v
^ permalink raw reply related
* Re: [PATCH v14 0/9] tls: Add TLS 1.3 hardware offload support
From: Rishikesh Jethwani @ 2026-06-25 22:57 UTC (permalink / raw)
To: Nils Juenemann
Cc: netdev, borisp, davem, edumazet, john.fastabend, kuba, leon,
mbloch, saeedm, sd, tariqt
In-Reply-To: <CAMPsyauZ+jzG9AysO0FWv6ZY0kvCUpjX_U7o=oOjCuOQ87BCgg@mail.gmail.com>
On Tue, Jun 23, 2026 at 10:53 AM Nils Juenemann
<nils.juenemann@gmail.com> wrote:
>
> Hi Rishikesh, all,
>
> we have been testing the v14 TLS 1.3 HW offload series on a ConnectX-6
> DX and hit a sendfile() final-record loss on the device TX path. We
> reduced it to a self-contained C reproducer and characterized it;
> reporting it here with the analysis and a question on where a fix belongs.
>
> Setup:
>
> NIC: ConnectX-6 DX (crypto enabled), FW 22.47.1026, SR-IOV VF,
> TX offload only
>
> Kernel: net-next + this v14 series
>
> TLS 1.3, AES-128-GCM, kTLS installed via setsockopt(TLS_TX) on the
> sending side with fixed test crypto material and no handshake, like
> tools/testing/selftests/net/tls
>
> a server sends a file with the raw sendfile(2) syscall; a client on
> another host reads the decrypted stream and counts the bytes
>
> Trigger: sendfile(2) with a count larger than the bytes remaining in
> the file (count > EOF). This is what a generic copy loop / Go's
> net.TCPConn.ReadFrom passes for a file of unknown length (~2 GiB). The
> kernel sends up to EOF, but the connection's final TLS record then
> appears not to be put on the wire unless a subsequent write flushes it.
> An abrupt close() appears to drop it, and the peer receives the whole
> body except the last record's bytes.
>
> Reproducer results (two hosts over the ConnectX - a loopback/same-host
> connection stays on TLS_SW and does not show it). Same file, 226965
> bytes (= 13*16384 + 13973):
>
> TLS_HW count>EOF close() -> 212992 short
> TLS_HW count>EOF close(), no zerocopy -> 212992 same
> TLS_HW count==exact close() -> 226965 full
> TLS_HW count>EOF close_notify, then close() -> 226965 full
> TLS_SW count>EOF close(), hw-tx-offload off -> 226965 full
>
> So it is specific to the device-offload TX path: the final record of a
> count > EOF sendfile() appears not to be finalized/flushed at EOF, only
> by a following write. A bounded count, a trailing write (close_notify),
> or software kTLS all avoid it. TLS_TX_ZEROCOPY_RO makes no difference.
> We are currently using the exact-count workaround in a preview environment.
>
> We may be misreading the code, so this is only a pointer: with
> count > EOF tls_push_data() fills the last record without reaching the
> size==0 case; on the device path tls_device_record_close() for that
> pending record appears to run only on the next push, and an abrupt
> teardown appears to discard it. The software path seems to flush
> pending TX records on close (tls_sw_release_resources_tx), which would
> explain why it is unaffected.
>
> Reproducer:
> https://gist.github.com/totallyunknown/a8f0ad3c54e40befde2f5a8d360fa6be
>
> It installs kTLS with fixed test crypto material via
> setsockopt(TLS_TX/TLS_RX), sends a file using the raw sendfile(2)
> syscall, and compares count > EOF against exact-count and close_notify.
> The v14 selftest (patch 9/9) sends via send() only and ends cleanly, so
> it misses this; a sendfile() + count > EOF case reproduces it
> deterministically for us.
>
> Question: should the device offload finalize and flush the connection's
> final record at EOF / on close, the way software kTLS does, or is a
> trailing write required by contract? And should a fix live in net/tls
> (device record close on the final partial record / the close path) or
> on the mlx5 side?
>
close() should be sufficient here.
I will fix this in net/tls/tls_device.c:tls_device_splice_eof().
tls_device_splice_eof() only checked tls_is_partially_sent_record(),
but it also needs to handle tls_is_pending_open_record()
(pending_open_record_frags). On the device TX path, that state occurs
when tls_push_data() exits with MSG_MORE set, which is what
splice/sendfile does while count > EOF still leaves requested bytes
outstanding.
So EOF can be reached with a final open record still pending. The fix
is to close and push that record from tls_device_splice_eof(), so
close() remains sufficient and no trailing write is required.
^ permalink raw reply
* Re: [PATCH net] net: airoha: dma map xmit frags with skb_frag_dma_map()
From: Harshitha Ramamurthy @ 2026-06-25 22:59 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260625-airoha-eth-skb_frag_dma_map-v1-1-31d9e460aae6@kernel.org>
On Thu, Jun 25, 2026 at 2:43 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Map xmit skb fragments using skb_frag_dma_map() instead of
> dma_map_single(skb_frag_address()). skb_frag_address() relies on
> page_address() to obtain a kernel virtual address, which is not
> guaranteed to work for all page types (e.g. highmem pages or
> user-pinned pages from MSG_ZEROCOPY).
> skb_frag_dma_map() maps the fragment directly via its struct page and
> offset through dma_map_page(), avoiding the need for a kernel virtual
> address entirely.
> Introduce an enum airoha_dma_map_type to track how each queue entry was
> mapped (single vs page), so that the matching unmap function is called
> on completion and in error paths.
>
> Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
> ---
> drivers/net/ethernet/airoha/airoha_eth.c | 61 ++++++++++++++++++++------------
> drivers/net/ethernet/airoha/airoha_eth.h | 7 ++++
> 2 files changed, 45 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 932b3a3df2e5..1caf6766f2c0 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -944,6 +944,25 @@ static void airoha_qdma_wake_netdev_txqs(struct airoha_queue *q)
> q->txq_stopped = false;
> }
>
> +static void airoha_unmap_xmit_buf(struct airoha_eth *eth,
> + struct airoha_queue_entry *e)
> +{
> + switch (e->dma_type) {
> + case AIROHA_DMA_MAP_PAGE:
> + dma_unmap_page(eth->dev, e->dma_addr, e->dma_len,
> + DMA_TO_DEVICE);
> + break;
> + case AIROHA_DMA_MAP_SINGLE:
> + dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> + DMA_TO_DEVICE);
> + break;
> + case AIROHA_DMA_UNMAPPED:
> + default:
> + break;
> + }
> + e->dma_type = AIROHA_DMA_UNMAPPED;
> +}
> +
> static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
> {
> struct airoha_tx_irq_queue *irq_q;
> @@ -1006,9 +1025,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
> skb = e->skb;
> e->skb = NULL;
>
> - dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> - DMA_TO_DEVICE);
> - e->dma_addr = 0;
> + airoha_unmap_xmit_buf(eth, e);
> list_add_tail(&e->list, &q->tx_list);
>
> WRITE_ONCE(desc->msg0, 0);
> @@ -1177,12 +1194,10 @@ static void airoha_qdma_tx_cleanup(struct airoha_qdma *qdma)
> struct airoha_qdma_desc *desc = &q->desc[j];
> struct sk_buff *skb = e->skb;
>
> - if (!e->dma_addr)
> + if (e->dma_type == AIROHA_DMA_UNMAPPED)
> continue;
>
> - dma_unmap_single(qdma->eth->dev, e->dma_addr,
> - e->dma_len, DMA_TO_DEVICE);
> - e->dma_addr = 0;
> + airoha_unmap_xmit_buf(qdma->eth, e);
> list_add_tail(&e->list, &q->tx_list);
>
> WRITE_ONCE(desc->ctrl, 0);
> @@ -2193,8 +2208,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
> struct netdev_queue *txq;
> struct airoha_queue *q;
> LIST_HEAD(tx_list);
> + dma_addr_t addr;
> int i = 0, qid;
> - void *data;
> u16 index;
> u8 fport;
>
> @@ -2250,24 +2265,22 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
> return NETDEV_TX_BUSY;
> }
>
> - len = skb_headlen(skb);
> - data = skb->data;
> -
> e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
> list);
> + len = skb_headlen(skb);
> + addr = dma_map_single(netdev->dev.parent, skb->data, len,
> + DMA_TO_DEVICE);
> + if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> + goto error_unlock;
> +
> + e->dma_type = AIROHA_DMA_MAP_SINGLE;
> index = e - q->entry;
>
> while (true) {
> struct airoha_qdma_desc *desc = &q->desc[index];
> skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> - dma_addr_t addr;
> u32 val;
>
> - addr = dma_map_single(netdev->dev.parent, data, len,
> - DMA_TO_DEVICE);
> - if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> - goto error_unmap;
> -
> list_move_tail(&e->list, &tx_list);
> e->skb = i == nr_frags - 1 ? skb : NULL;
> e->dma_addr = addr;
> @@ -2291,8 +2304,13 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
> if (++i == nr_frags)
> break;
>
> - data = skb_frag_address(frag);
> len = skb_frag_size(frag);
> + addr = skb_frag_dma_map(netdev->dev.parent, frag, 0, len,
> + DMA_TO_DEVICE);
> + if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> + goto error_unmap;
> +
> + e->dma_type = AIROHA_DMA_MAP_PAGE;
> }
> q->queued += i;
>
> @@ -2313,11 +2331,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
> return NETDEV_TX_OK;
>
> error_unmap:
> - list_for_each_entry(e, &tx_list, list) {
> - dma_unmap_single(netdev->dev.parent, e->dma_addr, e->dma_len,
> - DMA_TO_DEVICE);
> - e->dma_addr = 0;
> - }
> + list_for_each_entry(e, &tx_list, list)
> + airoha_unmap_xmit_buf(dev->eth, e);
> list_splice(&tx_list, &q->tx_list);
> error_unlock:
> spin_unlock_bh(&q->lock);
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
> index d7ff8c5200e2..2765244d937c 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.h
> +++ b/drivers/net/ethernet/airoha/airoha_eth.h
> @@ -170,12 +170,19 @@ enum trtcm_param {
> #define TRTCM_TOKEN_RATE_MASK GENMASK(23, 6)
> #define TRTCM_TOKEN_RATE_FRACTION_MASK GENMASK(5, 0)
>
> +enum airoha_dma_map_type {
> + AIROHA_DMA_UNMAPPED,
> + AIROHA_DMA_MAP_SINGLE,
> + AIROHA_DMA_MAP_PAGE,
> +};
> +
> struct airoha_queue_entry {
> union {
> void *buf;
> struct {
> struct list_head list;
> struct sk_buff *skb;
> + enum airoha_dma_map_type dma_type;
> };
> };
> dma_addr_t dma_addr;
>
> ---
> base-commit: 232c4ca2343d1181cbfc061f9856d9591e397579
> change-id: 20260625-airoha-eth-skb_frag_dma_map-bcccd5d6e4b1
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
>
^ permalink raw reply
page: | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox