netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] net: better drop accounting
@ 2025-08-25 19:59 Eric Dumazet
  2025-08-25 19:59 ` [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-08-25 19:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Willem de Bruijn,
	Kuniyuki Iwashima, Eric Dumazet

Incrementing sk->sk_drops for every dropped packet can
cause serious cache line contention.

Move sk_drops into a separate cache line so that
the consumer reads packets faster.

Add sk->sk_drops1 field for basic NUMA awareness
at low memory cost.

Tested:
see the 2nd patch changelog for test setup.
(One UDP receiving socket)

Before:

Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:

Udp6InDatagrams                 914537             0.0
Udp6InErrors                    6888487            0.0
Udp6RcvbufErrors                6888487            0.0

Eric Dumazet (3):
  net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
  net: move sk_drops out of sock_write_rx group
  net: add new sk->sk_drops1 field

 include/net/sock.h                            | 43 +++++++++++++++++--
 include/net/tcp.h                             |  2 +-
 net/core/datagram.c                           |  2 +-
 net/core/sock.c                               | 15 +++----
 net/ipv4/ping.c                               |  2 +-
 net/ipv4/raw.c                                |  6 +--
 net/ipv4/udp.c                                | 14 +++---
 net/ipv6/datagram.c                           |  2 +-
 net/ipv6/raw.c                                |  8 ++--
 net/ipv6/udp.c                                |  6 +--
 net/iucv/af_iucv.c                            |  4 +-
 net/netlink/af_netlink.c                      |  4 +-
 net/packet/af_packet.c                        |  2 +-
 net/phonet/pep.c                              |  6 +--
 net/phonet/socket.c                           |  2 +-
 net/sctp/diag.c                               |  2 +-
 net/tipc/socket.c                             |  6 +--
 .../selftests/bpf/progs/bpf_iter_netlink.c    |  3 +-
 .../selftests/bpf/progs/bpf_iter_udp4.c       |  2 +-
 .../selftests/bpf/progs/bpf_iter_udp6.c       |  2 +-
 20 files changed, 84 insertions(+), 49 deletions(-)

-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
  2025-08-25 19:59 [PATCH net-next 0/3] net: better drop accounting Eric Dumazet
@ 2025-08-25 19:59 ` Eric Dumazet
  2025-08-25 20:15   ` Kuniyuki Iwashima
  2025-08-25 19:59 ` [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group Eric Dumazet
  2025-08-25 19:59 ` [PATCH net-next 3/3] net: add new sk->sk_drops1 field Eric Dumazet
  2 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-08-25 19:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Willem de Bruijn,
	Kuniyuki Iwashima, Eric Dumazet

We want to split sk->sk_drops in the future to reduce
potential contention on this field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h       | 17 ++++++++++++++++-
 include/net/tcp.h        |  2 +-
 net/core/datagram.c      |  2 +-
 net/core/sock.c          | 14 +++++++-------
 net/ipv4/ping.c          |  2 +-
 net/ipv4/raw.c           |  6 +++---
 net/ipv4/udp.c           | 14 +++++++-------
 net/ipv6/datagram.c      |  2 +-
 net/ipv6/raw.c           |  8 ++++----
 net/ipv6/udp.c           |  6 +++---
 net/iucv/af_iucv.c       |  4 ++--
 net/netlink/af_netlink.c |  4 ++--
 net/packet/af_packet.c   |  2 +-
 net/phonet/pep.c         |  6 +++---
 net/phonet/socket.c      |  2 +-
 net/sctp/diag.c          |  2 +-
 net/tipc/socket.c        |  6 +++---
 17 files changed, 57 insertions(+), 42 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48ad31abf05f5108886bac9831842a..34d7029eb622773e40e7c4ebd422d33b1c0a7836 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2682,11 +2682,26 @@ struct sock_skb_cb {
 #define sock_skb_cb_check_size(size) \
 	BUILD_BUG_ON((size) > SOCK_SKB_CB_OFFSET)
 
+static inline void sk_drops_inc(struct sock *sk)
+{
+	atomic_inc(&sk->sk_drops);
+}
+
+static inline int sk_drops_read(const struct sock *sk)
+{
+	return atomic_read(&sk->sk_drops);
+}
+
+static inline void sk_drops_reset(struct sock *sk)
+{
+	atomic_set(&sk->sk_drops, 0);
+}
+
 static inline void
 sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
 {
 	SOCK_SKB_CB(skb)->dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ?
-						atomic_read(&sk->sk_drops) : 0;
+						sk_drops_read(sk) : 0;
 }
 
 static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2936b8175950faa777f81f3c6b7230bcc375d772..16dc9cebb9d25832eac7a6ad590a9e9e47e85142 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2612,7 +2612,7 @@ static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
  */
 static inline void tcp_listendrop(const struct sock *sk)
 {
-	atomic_inc(&((struct sock *)sk)->sk_drops);
+	sk_drops_inc((struct sock *)sk);
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
 }
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 94cc4705e91da6ba6629ae469ae6507e9c6fdae9..ba8253aa6e07c2b0db361c9dfdaf66243dc1024c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -345,7 +345,7 @@ int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
 		spin_unlock_bh(&sk_queue->lock);
 	}
 
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 	return err;
 }
 EXPORT_SYMBOL(__sk_queue_drop_skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 8002ac6293dcac694962be139eadfa6346b72d5b..75368823969a7992a55a6f40d87ffb8886de2f39 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -491,13 +491,13 @@ int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 
 	if (atomic_read(&sk->sk_rmem_alloc) >= READ_ONCE(sk->sk_rcvbuf)) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		trace_sock_rcvqueue_full(sk, skb);
 		return -ENOMEM;
 	}
 
 	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		return -ENOBUFS;
 	}
 
@@ -562,7 +562,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 	skb->dev = NULL;
 
 	if (sk_rcvqueues_full(sk, READ_ONCE(sk->sk_rcvbuf))) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		reason = SKB_DROP_REASON_SOCKET_RCVBUFF;
 		goto discard_and_relse;
 	}
@@ -585,7 +585,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			reason = SKB_DROP_REASON_PFMEMALLOC;
 		if (err == -ENOBUFS)
 			reason = SKB_DROP_REASON_SOCKET_BACKLOG;
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		goto discard_and_relse;
 	}
 
@@ -2505,7 +2505,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 	newsk->sk_wmem_queued	= 0;
 	newsk->sk_forward_alloc = 0;
 	newsk->sk_reserved_mem  = 0;
-	atomic_set(&newsk->sk_drops, 0);
+	sk_drops_reset(newsk);
 	newsk->sk_send_head	= NULL;
 	newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
 	atomic_set(&newsk->sk_zckey, 0);
@@ -3713,7 +3713,7 @@ void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid)
 	 */
 	smp_wmb();
 	refcount_set(&sk->sk_refcnt, 1);
-	atomic_set(&sk->sk_drops, 0);
+	sk_drops_reset(sk);
 }
 EXPORT_SYMBOL(sock_init_data_uid);
 
@@ -3973,7 +3973,7 @@ void sk_get_meminfo(const struct sock *sk, u32 *mem)
 	mem[SK_MEMINFO_WMEM_QUEUED] = READ_ONCE(sk->sk_wmem_queued);
 	mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
 	mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
-	mem[SK_MEMINFO_DROPS] = atomic_read(&sk->sk_drops);
+	mem[SK_MEMINFO_DROPS] = sk_drops_read(sk);
 }
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 031df4c19fcc5ca18137695c78358c3ad96a2c4a..f119da68fc301be00719213ad33615b6754e6272 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -1119,7 +1119,7 @@ static void ping_v4_format_sock(struct sock *sp, struct seq_file *f,
 		from_kuid_munged(seq_user_ns(f), sk_uid(sp)),
 		0, sock_i_ino(sp),
 		refcount_read(&sp->sk_refcnt), sp,
-		atomic_read(&sp->sk_drops));
+		sk_drops_read(sp));
 }
 
 static int ping_v4_seq_show(struct seq_file *seq, void *v)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1d2c89d63cc71f39d742c8156879847fc4e53c71..0f9f02f6146eef6df3f5bbb4f564e16fbabd1ba2 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -178,7 +178,7 @@ static int raw_v4_input(struct net *net, struct sk_buff *skb,
 
 		if (atomic_read(&sk->sk_rmem_alloc) >=
 		    READ_ONCE(sk->sk_rcvbuf)) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			continue;
 		}
 
@@ -311,7 +311,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
 int raw_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
 		return NET_RX_DROP;
 	}
@@ -1045,7 +1045,7 @@ static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
 		0, 0L, 0,
 		from_kuid_munged(seq_user_ns(seq), sk_uid(sp)),
 		0, sock_i_ino(sp),
-		refcount_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
+		refcount_read(&sp->sk_refcnt), sp, sk_drops_read(sp));
 }
 
 static int raw_seq_show(struct seq_file *seq, void *v)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cc3ce0f762ec211a963464c2dd7ac329a6be1ffd..732bdad43626948168bdb9e40c151787f047bbfd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1787,7 +1787,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
 
 drop:
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 	busylock_release(busy);
 	return err;
 }
@@ -1852,7 +1852,7 @@ static struct sk_buff *__first_packet_length(struct sock *sk,
 					IS_UDPLITE(sk));
 			__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
 					IS_UDPLITE(sk));
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			__skb_unlink(skb, rcvq);
 			*total += skb->truesize;
 			kfree_skb_reason(skb, SKB_DROP_REASON_UDP_CSUM);
@@ -2008,7 +2008,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 
 		__UDP_INC_STATS(net, UDP_MIB_CSUMERRORS, is_udplite);
 		__UDP_INC_STATS(net, UDP_MIB_INERRORS, is_udplite);
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		kfree_skb_reason(skb, SKB_DROP_REASON_UDP_CSUM);
 		goto try_again;
 	}
@@ -2078,7 +2078,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 
 	if (unlikely(err)) {
 		if (!peeking) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_INERRORS, is_udplite);
 		}
@@ -2449,7 +2449,7 @@ static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 drop:
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 	sk_skb_reason_drop(sk, skb, drop_reason);
 	return -1;
 }
@@ -2534,7 +2534,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		nskb = skb_clone(skb, GFP_ATOMIC);
 
 		if (unlikely(!nskb)) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			__UDP_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
 					IS_UDPLITE(sk));
 			__UDP_INC_STATS(net, UDP_MIB_INERRORS,
@@ -3386,7 +3386,7 @@ static void udp4_format_sock(struct sock *sp, struct seq_file *f,
 		from_kuid_munged(seq_user_ns(f), sk_uid(sp)),
 		0, sock_i_ino(sp),
 		refcount_read(&sp->sk_refcnt), sp,
-		atomic_read(&sp->sk_drops));
+		sk_drops_read(sp));
 }
 
 int udp4_seq_show(struct seq_file *seq, void *v)
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 972bf0426d599af43bfd2d0e4236592f34ec7866..33ebe93d80e3cb6d897a3c7f714f94c395856023 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -1068,5 +1068,5 @@ void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
 		   0,
 		   sock_i_ino(sp),
 		   refcount_read(&sp->sk_refcnt), sp,
-		   atomic_read(&sp->sk_drops));
+		   sk_drops_read(sp));
 }
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 4c3f8245c40f155f3efde0d7b8af50e0bef431c7..4026192143ec9f1b071f43874185bc367c950c67 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -163,7 +163,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
 
 		if (atomic_read(&sk->sk_rmem_alloc) >=
 		    READ_ONCE(sk->sk_rcvbuf)) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			continue;
 		}
 
@@ -361,7 +361,7 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	if ((raw6_sk(sk)->checksum || rcu_access_pointer(sk->sk_filter)) &&
 	    skb_checksum_complete(skb)) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
 		return NET_RX_DROP;
 	}
@@ -389,7 +389,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)
 	struct raw6_sock *rp = raw6_sk(sk);
 
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) {
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
 		return NET_RX_DROP;
 	}
@@ -414,7 +414,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (inet_test_bit(HDRINCL, sk)) {
 		if (skb_checksum_complete(skb)) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
 			return NET_RX_DROP;
 		}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 6a68f77da44b55baed42b44c936902f865754140..a35ee6d693a8080b9009f61d23fafd2465b8c625 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -524,7 +524,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	if (unlikely(err)) {
 		if (!peeking) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
 		}
 		kfree_skb(skb);
@@ -908,7 +908,7 @@ static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 	__UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 drop:
 	__UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 	sk_skb_reason_drop(sk, skb, drop_reason);
 	return -1;
 }
@@ -1013,7 +1013,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		}
 		nskb = skb_clone(skb, GFP_ATOMIC);
 		if (unlikely(!nskb)) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			__UDP6_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
 					 IS_UDPLITE(sk));
 			__UDP6_INC_STATS(net, UDP_MIB_INERRORS,
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index cc2b3c44bc05a629d455e99369491b28b4b93884..6c717a7ef292831b49c1dca22ecc2bb7a7179b0f 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -1187,7 +1187,7 @@ static void iucv_process_message(struct sock *sk, struct sk_buff *skb,
 
 	IUCV_SKB_CB(skb)->offset = 0;
 	if (sk_filter(sk, skb)) {
-		atomic_inc(&sk->sk_drops);	/* skb rejected by filter */
+		sk_drops_inc(sk);	/* skb rejected by filter */
 		kfree_skb(skb);
 		return;
 	}
@@ -2011,7 +2011,7 @@ static int afiucv_hs_callback_rx(struct sock *sk, struct sk_buff *skb)
 	skb_reset_network_header(skb);
 	IUCV_SKB_CB(skb)->offset = 0;
 	if (sk_filter(sk, skb)) {
-		atomic_inc(&sk->sk_drops);	/* skb rejected by filter */
+		sk_drops_inc(sk);	/* skb rejected by filter */
 		kfree_skb(skb);
 		return NET_RX_SUCCESS;
 	}
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e2f7080dd5d7cd52722248b719c294cdccf70328..2b46c0cd752a313ad95cf17c46237883d6b85293 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -356,7 +356,7 @@ static void netlink_overrun(struct sock *sk)
 			sk_error_report(sk);
 		}
 	}
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 }
 
 static void netlink_rcv_wake(struct sock *sk)
@@ -2711,7 +2711,7 @@ static int netlink_native_seq_show(struct seq_file *seq, void *v)
 			   sk_wmem_alloc_get(s),
 			   READ_ONCE(nlk->cb_running),
 			   refcount_read(&s->sk_refcnt),
-			   atomic_read(&s->sk_drops),
+			   sk_drops_read(s),
 			   sock_i_ino(s)
 			);
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a7017d7f09272058106181e95367080dc821da69..9d42c4bd6e390c7212fc0a8dde5cc14ba7a00d53 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2265,7 +2265,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 
 drop_n_acct:
 	atomic_inc(&po->tp_drops);
-	atomic_inc(&sk->sk_drops);
+	sk_drops_inc(sk);
 	drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR;
 
 drop_n_restore:
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 62527e1ebb883d2854bcdc5256cd48e85e5c5dbc..4db564d9d522b639e9527d48eaa42a1cd9fbfba7 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -376,7 +376,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	case PNS_PEP_CTRL_REQ:
 		if (skb_queue_len(&pn->ctrlreq_queue) >= PNPIPE_CTRLREQ_MAX) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			break;
 		}
 		__skb_pull(skb, 4);
@@ -397,7 +397,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
 		}
 
 		if (pn->rx_credits == 0) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			err = -ENOBUFS;
 			break;
 		}
@@ -567,7 +567,7 @@ static int pipe_handler_do_rcv(struct sock *sk, struct sk_buff *skb)
 		}
 
 		if (pn->rx_credits == 0) {
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			err = NET_RX_DROP;
 			break;
 		}
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 2b61a40b568e91e340130a9b589e2b7a9346643f..db2d552e9b32e384c332774b99199108abd464f2 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -587,7 +587,7 @@ static int pn_sock_seq_show(struct seq_file *seq, void *v)
 			from_kuid_munged(seq_user_ns(seq), sk_uid(sk)),
 			sock_i_ino(sk),
 			refcount_read(&sk->sk_refcnt), sk,
-			atomic_read(&sk->sk_drops));
+			sk_drops_read(sk));
 	}
 	seq_pad(seq, '\n');
 	return 0;
diff --git a/net/sctp/diag.c b/net/sctp/diag.c
index 23359e522273f0377080007c75eb2c276945f781..996c2018f0e611bd0da2df2f73e90e2f94c463d9 100644
--- a/net/sctp/diag.c
+++ b/net/sctp/diag.c
@@ -173,7 +173,7 @@ static int inet_sctp_diag_fill(struct sock *sk, struct sctp_association *asoc,
 		mem[SK_MEMINFO_WMEM_QUEUED] = sk->sk_wmem_queued;
 		mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
 		mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
-		mem[SK_MEMINFO_DROPS] = atomic_read(&sk->sk_drops);
+		mem[SK_MEMINFO_DROPS] = sk_drops_read(sk);
 
 		if (nla_put(skb, INET_DIAG_SKMEMINFO, sizeof(mem), &mem) < 0)
 			goto errout;
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index e028bf6584992c5ab7307d81082fbe4582e78068..1574a83384f88533cfab330c559512d5878bf0aa 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2366,7 +2366,7 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
 		else if (sk_rmem_alloc_get(sk) + skb->truesize >= limit) {
 			trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL,
 					   "err_overload2!");
-			atomic_inc(&sk->sk_drops);
+			sk_drops_inc(sk);
 			err = TIPC_ERR_OVERLOAD;
 		}
 
@@ -2458,7 +2458,7 @@ static void tipc_sk_enqueue(struct sk_buff_head *inputq, struct sock *sk,
 		trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL, "err_overload!");
 		/* Overload => reject message back to sender */
 		onode = tipc_own_addr(sock_net(sk));
-		atomic_inc(&sk->sk_drops);
+		sk_drops_inc(sk);
 		if (tipc_msg_reverse(onode, &skb, TIPC_ERR_OVERLOAD)) {
 			trace_tipc_sk_rej_msg(sk, skb, TIPC_DUMP_ALL,
 					      "@sk_enqueue!");
@@ -3657,7 +3657,7 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
 	    nla_put_u32(skb, TIPC_NLA_SOCK_STAT_SENDQ,
 			skb_queue_len(&sk->sk_write_queue)) ||
 	    nla_put_u32(skb, TIPC_NLA_SOCK_STAT_DROP,
-			atomic_read(&sk->sk_drops)))
+			sk_drops_read(sk)))
 		goto stat_msg_cancel;
 
 	if (tsk->cong_link_cnt &&
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group
  2025-08-25 19:59 [PATCH net-next 0/3] net: better drop accounting Eric Dumazet
  2025-08-25 19:59 ` [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers Eric Dumazet
@ 2025-08-25 19:59 ` Eric Dumazet
  2025-08-25 20:17   ` Kuniyuki Iwashima
  2025-08-25 19:59 ` [PATCH net-next 3/3] net: add new sk->sk_drops1 field Eric Dumazet
  2 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-08-25 19:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Willem de Bruijn,
	Kuniyuki Iwashima, Eric Dumazet

Move sk_drops into a dedicated cache line.

When a packet flood hits one or more sockets, many cpus
have to update sk->sk_drops.

This slows down consumers, because currently
sk_drops is in sock_write_rx group.

Moving sk->sk_drops into a dedicated cache line
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

Tested with the following stress test, sending about 11 Mpps:

super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket will receive
packets on DUT.

Then measure receiver (DUT) behavior. We can see both
consumer and BH handlers can process more packets per second.

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 855592             0.0
Udp6InErrors                    5621467            0.0
Udp6RcvbufErrors                5621467            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 6 +++---
 net/core/sock.c    | 1 -
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 34d7029eb622773e40e7c4ebd422d33b1c0a7836..f40e3c4883be32c8282694ab215bcf79eb87cbd7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,7 +390,6 @@ struct sock {
 
 	__cacheline_group_begin(sock_write_rx);
 
-	atomic_t		sk_drops;
 	__s32			sk_peek_off;
 	struct sk_buff_head	sk_error_queue;
 	struct sk_buff_head	sk_receive_queue;
@@ -564,13 +563,14 @@ struct sock {
 #ifdef CONFIG_BPF_SYSCALL
 	struct bpf_local_storage __rcu	*sk_bpf_storage;
 #endif
-	struct rcu_head		sk_rcu;
-	netns_tracker		ns_tracker;
 	struct xarray		sk_user_frags;
 
 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 	struct module		*sk_owner;
 #endif
+	atomic_t		sk_drops ____cacheline_aligned_in_smp;
+	struct rcu_head		sk_rcu;
+	netns_tracker		ns_tracker;
 };
 
 struct sock_bh_locked {
diff --git a/net/core/sock.c b/net/core/sock.c
index 75368823969a7992a55a6f40d87ffb8886de2f39..cd7c7ed7ff51070d20658684ff796c58c8c09995 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -4436,7 +4436,6 @@ EXPORT_SYMBOL(sk_ioctl);
 
 static int __init sock_struct_check(void)
 {
-	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_drops);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_peek_off);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_error_queue);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_receive_queue);
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-25 19:59 [PATCH net-next 0/3] net: better drop accounting Eric Dumazet
  2025-08-25 19:59 ` [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers Eric Dumazet
  2025-08-25 19:59 ` [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group Eric Dumazet
@ 2025-08-25 19:59 ` Eric Dumazet
  2025-08-25 20:19   ` Kuniyuki Iwashima
  2025-08-26  6:33   ` Paolo Abeni
  2 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-08-25 19:59 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Willem de Bruijn,
	Kuniyuki Iwashima, Eric Dumazet

sk->sk_drops can be heavily contended when
changed from many cpus.

Instead using too expensive per-cpu data structure,
add a second sk->sk_drops1 field and change
sk_drops_inc() to be NUMA aware.

This patch adds 64 bytes per socket.

For hosts having more than two memory nodes, sk_drops_inc()
might not be optimal and can be refined later.

Tested with the following stress test, sending about 11 Mpps
to a dual socket AMD EPYC 7B13 64-Core.

super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket will receive
packets on DUT.

Then measure receiver (DUT) behavior. We can see
consumer and BH handlers can process more packets per second.

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 855592             0.0
Udp6InErrors                    5621467            0.0
Udp6RcvbufErrors                5621467            0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 914537             0.0
Udp6InErrors                    6888487            0.0
Udp6RcvbufErrors                6888487            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h                            | 20 +++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_netlink.c    |  3 ++-
 .../selftests/bpf/progs/bpf_iter_udp4.c       |  2 +-
 .../selftests/bpf/progs/bpf_iter_udp6.c       |  2 +-
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index f40e3c4883be32c8282694ab215bcf79eb87cbd7..318169eb1a3d40eefac50147012551614abc6f7a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -282,6 +282,7 @@ struct sk_filter;
   *	@sk_err_soft: errors that don't cause failure but are the cause of a
   *		      persistent failure not just 'timed out'
   *	@sk_drops: raw/udp drops counter
+  *	@sk_drops1: second drops counter
   *	@sk_ack_backlog: current listen backlog
   *	@sk_max_ack_backlog: listen backlog set in listen()
   *	@sk_uid: user id of owner
@@ -571,6 +572,11 @@ struct sock {
 	atomic_t		sk_drops ____cacheline_aligned_in_smp;
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
+#if defined(CONFIG_NUMA)
+	atomic_t		sk_drops1 ____cacheline_aligned_in_smp;
+#else
+	atomic_t		sk_drops1;
+#endif
 };
 
 struct sock_bh_locked {
@@ -2684,17 +2690,31 @@ struct sock_skb_cb {
 
 static inline void sk_drops_inc(struct sock *sk)
 {
+#if defined(CONFIG_NUMA)
+	int n = numa_node_id() % 2;
+
+	if (n)
+		atomic_inc(&sk->sk_drops1);
+	else
+		atomic_inc(&sk->sk_drops);
+#else
 	atomic_inc(&sk->sk_drops);
+#endif
 }
 
 static inline int sk_drops_read(const struct sock *sk)
 {
+#if defined(CONFIG_NUMA)
+	return atomic_read(&sk->sk_drops) + atomic_read(&sk->sk_drops1);
+#else
 	return atomic_read(&sk->sk_drops);
+#endif
 }
 
 static inline void sk_drops_reset(struct sock *sk)
 {
 	atomic_set(&sk->sk_drops, 0);
+	atomic_set(&sk->sk_drops1, 0);
 }
 
 static inline void
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
index 00b2ceae81fb0914f2de3634eb342004e8bc3c5b..31ad9fcc6022d5d31b9c6a35daacaad7c887a51f 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
@@ -57,7 +57,8 @@ int dump_netlink(struct bpf_iter__netlink *ctx)
 		inode = SOCK_INODE(sk);
 		bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
 	}
-	BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
+	BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n",
+		       s->sk_drops.counter + s->sk_drops1.counter, ino);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
index ffbd4b116d17ffbb9f14440c788e50490fb0f4e0..192ab5693a7131c1ec5879e539651c21f6f3c9ae 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
@@ -64,7 +64,7 @@ int dump_udp4(struct bpf_iter__udp *ctx)
 		       0, 0L, 0, ctx->uid, 0,
 		       sock_i_ino(&inet->sk),
 		       inet->sk.sk_refcnt.refs.counter, udp_sk,
-		       inet->sk.sk_drops.counter);
+		       inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
index 47ff7754f4fda4c9db92fbf1dc2e6a68f044174e..5170bdf458fa1b9a4eea9240fbaa5934182a7776 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
@@ -72,7 +72,7 @@ int dump_udp6(struct bpf_iter__udp *ctx)
 		       0, 0L, 0, ctx->uid, 0,
 		       sock_i_ino(&inet->sk),
 		       inet->sk.sk_refcnt.refs.counter, udp_sk,
-		       inet->sk.sk_drops.counter);
+		       inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
 
 	return 0;
 }
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers
  2025-08-25 19:59 ` [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers Eric Dumazet
@ 2025-08-25 20:15   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-25 20:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Willem de Bruijn

On Mon, Aug 25, 2025 at 12:59 PM Eric Dumazet <edumazet@google.com> wrote:
>
> We want to split sk->sk_drops in the future to reduce
> potential contention on this field.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>


> ---
>  include/net/sock.h       | 17 ++++++++++++++++-
>  include/net/tcp.h        |  2 +-
>  net/core/datagram.c      |  2 +-
>  net/core/sock.c          | 14 +++++++-------
>  net/ipv4/ping.c          |  2 +-
>  net/ipv4/raw.c           |  6 +++---
>  net/ipv4/udp.c           | 14 +++++++-------
>  net/ipv6/datagram.c      |  2 +-
>  net/ipv6/raw.c           |  8 ++++----
>  net/ipv6/udp.c           |  6 +++---
>  net/iucv/af_iucv.c       |  4 ++--
>  net/netlink/af_netlink.c |  4 ++--
>  net/packet/af_packet.c   |  2 +-
>  net/phonet/pep.c         |  6 +++---
>  net/phonet/socket.c      |  2 +-
>  net/sctp/diag.c          |  2 +-
>  net/tipc/socket.c        |  6 +++---
>  17 files changed, 57 insertions(+), 42 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 63a6a48afb48ad31abf05f5108886bac9831842a..34d7029eb622773e40e7c4ebd422d33b1c0a7836 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2682,11 +2682,26 @@ struct sock_skb_cb {
>  #define sock_skb_cb_check_size(size) \
>         BUILD_BUG_ON((size) > SOCK_SKB_CB_OFFSET)
>
> +static inline void sk_drops_inc(struct sock *sk)
> +{
> +       atomic_inc(&sk->sk_drops);
> +}
> +
> +static inline int sk_drops_read(const struct sock *sk)
> +{
> +       return atomic_read(&sk->sk_drops);
> +}
> +
> +static inline void sk_drops_reset(struct sock *sk)
> +{
> +       atomic_set(&sk->sk_drops, 0);
> +}
> +
>  static inline void
>  sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
>  {
>         SOCK_SKB_CB(skb)->dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ?
> -                                               atomic_read(&sk->sk_drops) : 0;
> +                                               sk_drops_read(sk) : 0;
>  }
>
>  static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 2936b8175950faa777f81f3c6b7230bcc375d772..16dc9cebb9d25832eac7a6ad590a9e9e47e85142 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2612,7 +2612,7 @@ static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
>   */
>  static inline void tcp_listendrop(const struct sock *sk)
>  {
> -       atomic_inc(&((struct sock *)sk)->sk_drops);
> +       sk_drops_inc((struct sock *)sk);
>         __NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
>  }
>
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 94cc4705e91da6ba6629ae469ae6507e9c6fdae9..ba8253aa6e07c2b0db361c9dfdaf66243dc1024c 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -345,7 +345,7 @@ int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
>                 spin_unlock_bh(&sk_queue->lock);
>         }
>
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>         return err;
>  }
>  EXPORT_SYMBOL(__sk_queue_drop_skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 8002ac6293dcac694962be139eadfa6346b72d5b..75368823969a7992a55a6f40d87ffb8886de2f39 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -491,13 +491,13 @@ int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>         struct sk_buff_head *list = &sk->sk_receive_queue;
>
>         if (atomic_read(&sk->sk_rmem_alloc) >= READ_ONCE(sk->sk_rcvbuf)) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 trace_sock_rcvqueue_full(sk, skb);
>                 return -ENOMEM;
>         }
>
>         if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 return -ENOBUFS;
>         }
>
> @@ -562,7 +562,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>         skb->dev = NULL;
>
>         if (sk_rcvqueues_full(sk, READ_ONCE(sk->sk_rcvbuf))) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 reason = SKB_DROP_REASON_SOCKET_RCVBUFF;
>                 goto discard_and_relse;
>         }
> @@ -585,7 +585,7 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>                         reason = SKB_DROP_REASON_PFMEMALLOC;
>                 if (err == -ENOBUFS)
>                         reason = SKB_DROP_REASON_SOCKET_BACKLOG;
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 goto discard_and_relse;
>         }
>
> @@ -2505,7 +2505,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>         newsk->sk_wmem_queued   = 0;
>         newsk->sk_forward_alloc = 0;
>         newsk->sk_reserved_mem  = 0;
> -       atomic_set(&newsk->sk_drops, 0);
> +       sk_drops_reset(newsk);
>         newsk->sk_send_head     = NULL;
>         newsk->sk_userlocks     = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
>         atomic_set(&newsk->sk_zckey, 0);
> @@ -3713,7 +3713,7 @@ void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid)
>          */
>         smp_wmb();
>         refcount_set(&sk->sk_refcnt, 1);
> -       atomic_set(&sk->sk_drops, 0);
> +       sk_drops_reset(sk);
>  }
>  EXPORT_SYMBOL(sock_init_data_uid);
>
> @@ -3973,7 +3973,7 @@ void sk_get_meminfo(const struct sock *sk, u32 *mem)
>         mem[SK_MEMINFO_WMEM_QUEUED] = READ_ONCE(sk->sk_wmem_queued);
>         mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
>         mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
> -       mem[SK_MEMINFO_DROPS] = atomic_read(&sk->sk_drops);
> +       mem[SK_MEMINFO_DROPS] = sk_drops_read(sk);
>  }
>
>  #ifdef CONFIG_PROC_FS
> diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
> index 031df4c19fcc5ca18137695c78358c3ad96a2c4a..f119da68fc301be00719213ad33615b6754e6272 100644
> --- a/net/ipv4/ping.c
> +++ b/net/ipv4/ping.c
> @@ -1119,7 +1119,7 @@ static void ping_v4_format_sock(struct sock *sp, struct seq_file *f,
>                 from_kuid_munged(seq_user_ns(f), sk_uid(sp)),
>                 0, sock_i_ino(sp),
>                 refcount_read(&sp->sk_refcnt), sp,
> -               atomic_read(&sp->sk_drops));
> +               sk_drops_read(sp));
>  }
>
>  static int ping_v4_seq_show(struct seq_file *seq, void *v)
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 1d2c89d63cc71f39d742c8156879847fc4e53c71..0f9f02f6146eef6df3f5bbb4f564e16fbabd1ba2 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -178,7 +178,7 @@ static int raw_v4_input(struct net *net, struct sk_buff *skb,
>
>                 if (atomic_read(&sk->sk_rmem_alloc) >=
>                     READ_ONCE(sk->sk_rcvbuf)) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         continue;
>                 }
>
> @@ -311,7 +311,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  int raw_rcv(struct sock *sk, struct sk_buff *skb)
>  {
>         if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
>                 return NET_RX_DROP;
>         }
> @@ -1045,7 +1045,7 @@ static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
>                 0, 0L, 0,
>                 from_kuid_munged(seq_user_ns(seq), sk_uid(sp)),
>                 0, sock_i_ino(sp),
> -               refcount_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
> +               refcount_read(&sp->sk_refcnt), sp, sk_drops_read(sp));
>  }
>
>  static int raw_seq_show(struct seq_file *seq, void *v)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index cc3ce0f762ec211a963464c2dd7ac329a6be1ffd..732bdad43626948168bdb9e40c151787f047bbfd 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1787,7 +1787,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>         atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
>
>  drop:
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>         busylock_release(busy);
>         return err;
>  }
> @@ -1852,7 +1852,7 @@ static struct sk_buff *__first_packet_length(struct sock *sk,
>                                         IS_UDPLITE(sk));
>                         __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
>                                         IS_UDPLITE(sk));
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         __skb_unlink(skb, rcvq);
>                         *total += skb->truesize;
>                         kfree_skb_reason(skb, SKB_DROP_REASON_UDP_CSUM);
> @@ -2008,7 +2008,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
>
>                 __UDP_INC_STATS(net, UDP_MIB_CSUMERRORS, is_udplite);
>                 __UDP_INC_STATS(net, UDP_MIB_INERRORS, is_udplite);
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 kfree_skb_reason(skb, SKB_DROP_REASON_UDP_CSUM);
>                 goto try_again;
>         }
> @@ -2078,7 +2078,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
>
>         if (unlikely(err)) {
>                 if (!peeking) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         UDP_INC_STATS(sock_net(sk),
>                                       UDP_MIB_INERRORS, is_udplite);
>                 }
> @@ -2449,7 +2449,7 @@ static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
>         __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
>  drop:
>         __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>         sk_skb_reason_drop(sk, skb, drop_reason);
>         return -1;
>  }
> @@ -2534,7 +2534,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>                 nskb = skb_clone(skb, GFP_ATOMIC);
>
>                 if (unlikely(!nskb)) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         __UDP_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
>                                         IS_UDPLITE(sk));
>                         __UDP_INC_STATS(net, UDP_MIB_INERRORS,
> @@ -3386,7 +3386,7 @@ static void udp4_format_sock(struct sock *sp, struct seq_file *f,
>                 from_kuid_munged(seq_user_ns(f), sk_uid(sp)),
>                 0, sock_i_ino(sp),
>                 refcount_read(&sp->sk_refcnt), sp,
> -               atomic_read(&sp->sk_drops));
> +               sk_drops_read(sp));
>  }
>
>  int udp4_seq_show(struct seq_file *seq, void *v)
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index 972bf0426d599af43bfd2d0e4236592f34ec7866..33ebe93d80e3cb6d897a3c7f714f94c395856023 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -1068,5 +1068,5 @@ void __ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
>                    0,
>                    sock_i_ino(sp),
>                    refcount_read(&sp->sk_refcnt), sp,
> -                  atomic_read(&sp->sk_drops));
> +                  sk_drops_read(sp));
>  }
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 4c3f8245c40f155f3efde0d7b8af50e0bef431c7..4026192143ec9f1b071f43874185bc367c950c67 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -163,7 +163,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
>
>                 if (atomic_read(&sk->sk_rmem_alloc) >=
>                     READ_ONCE(sk->sk_rcvbuf)) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         continue;
>                 }
>
> @@ -361,7 +361,7 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
>
>         if ((raw6_sk(sk)->checksum || rcu_access_pointer(sk->sk_filter)) &&
>             skb_checksum_complete(skb)) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
>                 return NET_RX_DROP;
>         }
> @@ -389,7 +389,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)
>         struct raw6_sock *rp = raw6_sk(sk);
>
>         if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) {
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
>                 return NET_RX_DROP;
>         }
> @@ -414,7 +414,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)
>
>         if (inet_test_bit(HDRINCL, sk)) {
>                 if (skb_checksum_complete(skb)) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
>                         return NET_RX_DROP;
>                 }
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 6a68f77da44b55baed42b44c936902f865754140..a35ee6d693a8080b9009f61d23fafd2465b8c625 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -524,7 +524,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>         }
>         if (unlikely(err)) {
>                 if (!peeking) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
>                 }
>                 kfree_skb(skb);
> @@ -908,7 +908,7 @@ static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
>         __UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
>  drop:
>         __UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>         sk_skb_reason_drop(sk, skb, drop_reason);
>         return -1;
>  }
> @@ -1013,7 +1013,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
>                 }
>                 nskb = skb_clone(skb, GFP_ATOMIC);
>                 if (unlikely(!nskb)) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         __UDP6_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
>                                          IS_UDPLITE(sk));
>                         __UDP6_INC_STATS(net, UDP_MIB_INERRORS,
> diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
> index cc2b3c44bc05a629d455e99369491b28b4b93884..6c717a7ef292831b49c1dca22ecc2bb7a7179b0f 100644
> --- a/net/iucv/af_iucv.c
> +++ b/net/iucv/af_iucv.c
> @@ -1187,7 +1187,7 @@ static void iucv_process_message(struct sock *sk, struct sk_buff *skb,
>
>         IUCV_SKB_CB(skb)->offset = 0;
>         if (sk_filter(sk, skb)) {
> -               atomic_inc(&sk->sk_drops);      /* skb rejected by filter */
> +               sk_drops_inc(sk);       /* skb rejected by filter */
>                 kfree_skb(skb);
>                 return;
>         }
> @@ -2011,7 +2011,7 @@ static int afiucv_hs_callback_rx(struct sock *sk, struct sk_buff *skb)
>         skb_reset_network_header(skb);
>         IUCV_SKB_CB(skb)->offset = 0;
>         if (sk_filter(sk, skb)) {
> -               atomic_inc(&sk->sk_drops);      /* skb rejected by filter */
> +               sk_drops_inc(sk);       /* skb rejected by filter */
>                 kfree_skb(skb);
>                 return NET_RX_SUCCESS;
>         }
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index e2f7080dd5d7cd52722248b719c294cdccf70328..2b46c0cd752a313ad95cf17c46237883d6b85293 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -356,7 +356,7 @@ static void netlink_overrun(struct sock *sk)
>                         sk_error_report(sk);
>                 }
>         }
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>  }
>
>  static void netlink_rcv_wake(struct sock *sk)
> @@ -2711,7 +2711,7 @@ static int netlink_native_seq_show(struct seq_file *seq, void *v)
>                            sk_wmem_alloc_get(s),
>                            READ_ONCE(nlk->cb_running),
>                            refcount_read(&s->sk_refcnt),
> -                          atomic_read(&s->sk_drops),
> +                          sk_drops_read(s),
>                            sock_i_ino(s)
>                         );
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index a7017d7f09272058106181e95367080dc821da69..9d42c4bd6e390c7212fc0a8dde5cc14ba7a00d53 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2265,7 +2265,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
>
>  drop_n_acct:
>         atomic_inc(&po->tp_drops);
> -       atomic_inc(&sk->sk_drops);
> +       sk_drops_inc(sk);
>         drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR;
>
>  drop_n_restore:
> diff --git a/net/phonet/pep.c b/net/phonet/pep.c
> index 62527e1ebb883d2854bcdc5256cd48e85e5c5dbc..4db564d9d522b639e9527d48eaa42a1cd9fbfba7 100644
> --- a/net/phonet/pep.c
> +++ b/net/phonet/pep.c
> @@ -376,7 +376,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
>
>         case PNS_PEP_CTRL_REQ:
>                 if (skb_queue_len(&pn->ctrlreq_queue) >= PNPIPE_CTRLREQ_MAX) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         break;
>                 }
>                 __skb_pull(skb, 4);
> @@ -397,7 +397,7 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
>                 }
>
>                 if (pn->rx_credits == 0) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         err = -ENOBUFS;
>                         break;
>                 }
> @@ -567,7 +567,7 @@ static int pipe_handler_do_rcv(struct sock *sk, struct sk_buff *skb)
>                 }
>
>                 if (pn->rx_credits == 0) {
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         err = NET_RX_DROP;
>                         break;
>                 }
> diff --git a/net/phonet/socket.c b/net/phonet/socket.c
> index 2b61a40b568e91e340130a9b589e2b7a9346643f..db2d552e9b32e384c332774b99199108abd464f2 100644
> --- a/net/phonet/socket.c
> +++ b/net/phonet/socket.c
> @@ -587,7 +587,7 @@ static int pn_sock_seq_show(struct seq_file *seq, void *v)
>                         from_kuid_munged(seq_user_ns(seq), sk_uid(sk)),
>                         sock_i_ino(sk),
>                         refcount_read(&sk->sk_refcnt), sk,
> -                       atomic_read(&sk->sk_drops));
> +                       sk_drops_read(sk));
>         }
>         seq_pad(seq, '\n');
>         return 0;
> diff --git a/net/sctp/diag.c b/net/sctp/diag.c
> index 23359e522273f0377080007c75eb2c276945f781..996c2018f0e611bd0da2df2f73e90e2f94c463d9 100644
> --- a/net/sctp/diag.c
> +++ b/net/sctp/diag.c
> @@ -173,7 +173,7 @@ static int inet_sctp_diag_fill(struct sock *sk, struct sctp_association *asoc,
>                 mem[SK_MEMINFO_WMEM_QUEUED] = sk->sk_wmem_queued;
>                 mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
>                 mem[SK_MEMINFO_BACKLOG] = READ_ONCE(sk->sk_backlog.len);
> -               mem[SK_MEMINFO_DROPS] = atomic_read(&sk->sk_drops);
> +               mem[SK_MEMINFO_DROPS] = sk_drops_read(sk);
>
>                 if (nla_put(skb, INET_DIAG_SKMEMINFO, sizeof(mem), &mem) < 0)
>                         goto errout;
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index e028bf6584992c5ab7307d81082fbe4582e78068..1574a83384f88533cfab330c559512d5878bf0aa 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2366,7 +2366,7 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
>                 else if (sk_rmem_alloc_get(sk) + skb->truesize >= limit) {
>                         trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL,
>                                            "err_overload2!");
> -                       atomic_inc(&sk->sk_drops);
> +                       sk_drops_inc(sk);
>                         err = TIPC_ERR_OVERLOAD;
>                 }
>
> @@ -2458,7 +2458,7 @@ static void tipc_sk_enqueue(struct sk_buff_head *inputq, struct sock *sk,
>                 trace_tipc_sk_dump(sk, skb, TIPC_DUMP_ALL, "err_overload!");
>                 /* Overload => reject message back to sender */
>                 onode = tipc_own_addr(sock_net(sk));
> -               atomic_inc(&sk->sk_drops);
> +               sk_drops_inc(sk);
>                 if (tipc_msg_reverse(onode, &skb, TIPC_ERR_OVERLOAD)) {
>                         trace_tipc_sk_rej_msg(sk, skb, TIPC_DUMP_ALL,
>                                               "@sk_enqueue!");
> @@ -3657,7 +3657,7 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
>             nla_put_u32(skb, TIPC_NLA_SOCK_STAT_SENDQ,
>                         skb_queue_len(&sk->sk_write_queue)) ||
>             nla_put_u32(skb, TIPC_NLA_SOCK_STAT_DROP,
> -                       atomic_read(&sk->sk_drops)))
> +                       sk_drops_read(sk)))
>                 goto stat_msg_cancel;
>
>         if (tsk->cong_link_cnt &&
> --
> 2.51.0.261.g7ce5a0a67e-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group
  2025-08-25 19:59 ` [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group Eric Dumazet
@ 2025-08-25 20:17   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-25 20:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Willem de Bruijn

On Mon, Aug 25, 2025 at 12:59 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Move sk_drops into a dedicated cache line.
>
> When a packet flood hits one or more sockets, many cpus
> have to update sk->sk_drops.
>
> This slows down consumers, because currently
> sk_drops is in sock_write_rx group.
>
> Moving sk->sk_drops into a dedicated cache line
> makes sure that consumers no longer suffer from
> false sharing if/when producers only change sk->sk_drops.
>
> Tested with the following stress test, sending about 11 Mpps:
>
> super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
> Note: due to socket lookup, only one UDP socket will receive
> packets on DUT.
>
> Then measure receiver (DUT) behavior. We can see both
> consumer and BH handlers can process more packets per second.
>
> Before:
>
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams                 615091             0.0
> Udp6InErrors                    3904277            0.0
> Udp6RcvbufErrors                3904277            0.0
>
> After:
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams                 855592             0.0
> Udp6InErrors                    5621467            0.0
> Udp6RcvbufErrors                5621467            0.0
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>


> ---
>  include/net/sock.h | 6 +++---
>  net/core/sock.c    | 1 -
>  2 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 34d7029eb622773e40e7c4ebd422d33b1c0a7836..f40e3c4883be32c8282694ab215bcf79eb87cbd7 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -390,7 +390,6 @@ struct sock {
>
>         __cacheline_group_begin(sock_write_rx);
>
> -       atomic_t                sk_drops;
>         __s32                   sk_peek_off;
>         struct sk_buff_head     sk_error_queue;
>         struct sk_buff_head     sk_receive_queue;
> @@ -564,13 +563,14 @@ struct sock {
>  #ifdef CONFIG_BPF_SYSCALL
>         struct bpf_local_storage __rcu  *sk_bpf_storage;
>  #endif
> -       struct rcu_head         sk_rcu;
> -       netns_tracker           ns_tracker;
>         struct xarray           sk_user_frags;
>
>  #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
>         struct module           *sk_owner;
>  #endif
> +       atomic_t                sk_drops ____cacheline_aligned_in_smp;
> +       struct rcu_head         sk_rcu;
> +       netns_tracker           ns_tracker;
>  };
>
>  struct sock_bh_locked {
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 75368823969a7992a55a6f40d87ffb8886de2f39..cd7c7ed7ff51070d20658684ff796c58c8c09995 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -4436,7 +4436,6 @@ EXPORT_SYMBOL(sk_ioctl);
>
>  static int __init sock_struct_check(void)
>  {
> -       CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_drops);
>         CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_peek_off);
>         CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_error_queue);
>         CACHELINE_ASSERT_GROUP_MEMBER(struct sock, sock_write_rx, sk_receive_queue);
> --
> 2.51.0.261.g7ce5a0a67e-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-25 19:59 ` [PATCH net-next 3/3] net: add new sk->sk_drops1 field Eric Dumazet
@ 2025-08-25 20:19   ` Kuniyuki Iwashima
  2025-08-26  5:46     ` Eric Dumazet
  2025-08-26  6:33   ` Paolo Abeni
  1 sibling, 1 reply; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-25 20:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Willem de Bruijn

On Mon, Aug 25, 2025 at 12:59 PM Eric Dumazet <edumazet@google.com> wrote:
>
> sk->sk_drops can be heavily contended when
> changed from many cpus.
>
> Instead using too expensive per-cpu data structure,
> add a second sk->sk_drops1 field and change
> sk_drops_inc() to be NUMA aware.
>
> This patch adds 64 bytes per socket.
>
> For hosts having more than two memory nodes, sk_drops_inc()
> might not be optimal and can be refined later.
>
> Tested with the following stress test, sending about 11 Mpps
> to a dual socket AMD EPYC 7B13 64-Core.
>
> super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
> Note: due to socket lookup, only one UDP socket will receive
> packets on DUT.
>
> Then measure receiver (DUT) behavior. We can see
> consumer and BH handlers can process more packets per second.
>
> Before:
>
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams                 855592             0.0
> Udp6InErrors                    5621467            0.0
> Udp6RcvbufErrors                5621467            0.0
>
> After:
> nstat -n ; sleep 1 ; nstat | grep Udp
> Udp6InDatagrams                 914537             0.0
> Udp6InErrors                    6888487            0.0
> Udp6RcvbufErrors                6888487            0.0
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/sock.h                            | 20 +++++++++++++++++++
>  .../selftests/bpf/progs/bpf_iter_netlink.c    |  3 ++-
>  .../selftests/bpf/progs/bpf_iter_udp4.c       |  2 +-
>  .../selftests/bpf/progs/bpf_iter_udp6.c       |  2 +-
>  4 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f40e3c4883be32c8282694ab215bcf79eb87cbd7..318169eb1a3d40eefac50147012551614abc6f7a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -282,6 +282,7 @@ struct sk_filter;
>    *    @sk_err_soft: errors that don't cause failure but are the cause of a
>    *                  persistent failure not just 'timed out'
>    *    @sk_drops: raw/udp drops counter
> +  *    @sk_drops1: second drops counter
>    *    @sk_ack_backlog: current listen backlog
>    *    @sk_max_ack_backlog: listen backlog set in listen()
>    *    @sk_uid: user id of owner
> @@ -571,6 +572,11 @@ struct sock {
>         atomic_t                sk_drops ____cacheline_aligned_in_smp;
>         struct rcu_head         sk_rcu;
>         netns_tracker           ns_tracker;
> +#if defined(CONFIG_NUMA)
> +       atomic_t                sk_drops1 ____cacheline_aligned_in_smp;
> +#else
> +       atomic_t                sk_drops1;

Is this #else to be friendly to bpf selftests ?


> +#endif
>  };
>
>  struct sock_bh_locked {
> @@ -2684,17 +2690,31 @@ struct sock_skb_cb {
>
>  static inline void sk_drops_inc(struct sock *sk)
>  {
> +#if defined(CONFIG_NUMA)
> +       int n = numa_node_id() % 2;
> +
> +       if (n)
> +               atomic_inc(&sk->sk_drops1);
> +       else
> +               atomic_inc(&sk->sk_drops);
> +#else
>         atomic_inc(&sk->sk_drops);
> +#endif
>  }
>
>  static inline int sk_drops_read(const struct sock *sk)
>  {
> +#if defined(CONFIG_NUMA)
> +       return atomic_read(&sk->sk_drops) + atomic_read(&sk->sk_drops1);
> +#else
>         return atomic_read(&sk->sk_drops);
> +#endif
>  }
>
>  static inline void sk_drops_reset(struct sock *sk)
>  {
>         atomic_set(&sk->sk_drops, 0);
> +       atomic_set(&sk->sk_drops1, 0);
>  }
>
>  static inline void
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> index 00b2ceae81fb0914f2de3634eb342004e8bc3c5b..31ad9fcc6022d5d31b9c6a35daacaad7c887a51f 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> @@ -57,7 +57,8 @@ int dump_netlink(struct bpf_iter__netlink *ctx)
>                 inode = SOCK_INODE(sk);
>                 bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
>         }
> -       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
> +       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n",
> +                      s->sk_drops.counter + s->sk_drops1.counter, ino);
>
>         return 0;
>  }
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> index ffbd4b116d17ffbb9f14440c788e50490fb0f4e0..192ab5693a7131c1ec5879e539651c21f6f3c9ae 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> @@ -64,7 +64,7 @@ int dump_udp4(struct bpf_iter__udp *ctx)
>                        0, 0L, 0, ctx->uid, 0,
>                        sock_i_ino(&inet->sk),
>                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> -                      inet->sk.sk_drops.counter);
> +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
>
>         return 0;
>  }
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> index 47ff7754f4fda4c9db92fbf1dc2e6a68f044174e..5170bdf458fa1b9a4eea9240fbaa5934182a7776 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> @@ -72,7 +72,7 @@ int dump_udp6(struct bpf_iter__udp *ctx)
>                        0, 0L, 0, ctx->uid, 0,
>                        sock_i_ino(&inet->sk),
>                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> -                      inet->sk.sk_drops.counter);
> +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
>
>         return 0;
>  }
> --
> 2.51.0.261.g7ce5a0a67e-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-25 20:19   ` Kuniyuki Iwashima
@ 2025-08-26  5:46     ` Eric Dumazet
  2025-08-26  5:56       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-08-26  5:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Willem de Bruijn

On Mon, Aug 25, 2025 at 1:19 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Mon, Aug 25, 2025 at 12:59 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > sk->sk_drops can be heavily contended when
> > changed from many cpus.
> >
> > Instead using too expensive per-cpu data structure,
> > add a second sk->sk_drops1 field and change
> > sk_drops_inc() to be NUMA aware.
> >
> > This patch adds 64 bytes per socket.
> >
> > For hosts having more than two memory nodes, sk_drops_inc()
> > might not be optimal and can be refined later.
> >
> > Tested with the following stress test, sending about 11 Mpps
> > to a dual socket AMD EPYC 7B13 64-Core.
> >
> > super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
> > Note: due to socket lookup, only one UDP socket will receive
> > packets on DUT.
> >
> > Then measure receiver (DUT) behavior. We can see
> > consumer and BH handlers can process more packets per second.
> >
> > Before:
> >
> > nstat -n ; sleep 1 ; nstat | grep Udp
> > Udp6InDatagrams                 855592             0.0
> > Udp6InErrors                    5621467            0.0
> > Udp6RcvbufErrors                5621467            0.0
> >
> > After:
> > nstat -n ; sleep 1 ; nstat | grep Udp
> > Udp6InDatagrams                 914537             0.0
> > Udp6InErrors                    6888487            0.0
> > Udp6RcvbufErrors                6888487            0.0
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/net/sock.h                            | 20 +++++++++++++++++++
> >  .../selftests/bpf/progs/bpf_iter_netlink.c    |  3 ++-
> >  .../selftests/bpf/progs/bpf_iter_udp4.c       |  2 +-
> >  .../selftests/bpf/progs/bpf_iter_udp6.c       |  2 +-
> >  4 files changed, 24 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index f40e3c4883be32c8282694ab215bcf79eb87cbd7..318169eb1a3d40eefac50147012551614abc6f7a 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -282,6 +282,7 @@ struct sk_filter;
> >    *    @sk_err_soft: errors that don't cause failure but are the cause of a
> >    *                  persistent failure not just 'timed out'
> >    *    @sk_drops: raw/udp drops counter
> > +  *    @sk_drops1: second drops counter
> >    *    @sk_ack_backlog: current listen backlog
> >    *    @sk_max_ack_backlog: listen backlog set in listen()
> >    *    @sk_uid: user id of owner
> > @@ -571,6 +572,11 @@ struct sock {
> >         atomic_t                sk_drops ____cacheline_aligned_in_smp;
> >         struct rcu_head         sk_rcu;
> >         netns_tracker           ns_tracker;
> > +#if defined(CONFIG_NUMA)
> > +       atomic_t                sk_drops1 ____cacheline_aligned_in_smp;
> > +#else
> > +       atomic_t                sk_drops1;
>
> Is this #else to be friendly to bpf selftests ?

Sure !

What problem are you seeing exactly ?


>
>
> > +#endif
> >  };
> >
> >  struct sock_bh_locked {
> > @@ -2684,17 +2690,31 @@ struct sock_skb_cb {
> >
> >  static inline void sk_drops_inc(struct sock *sk)
> >  {
> > +#if defined(CONFIG_NUMA)
> > +       int n = numa_node_id() % 2;
> > +
> > +       if (n)
> > +               atomic_inc(&sk->sk_drops1);
> > +       else
> > +               atomic_inc(&sk->sk_drops);
> > +#else
> >         atomic_inc(&sk->sk_drops);
> > +#endif
> >  }
> >
> >  static inline int sk_drops_read(const struct sock *sk)
> >  {
> > +#if defined(CONFIG_NUMA)
> > +       return atomic_read(&sk->sk_drops) + atomic_read(&sk->sk_drops1);
> > +#else
> >         return atomic_read(&sk->sk_drops);
> > +#endif
> >  }
> >
> >  static inline void sk_drops_reset(struct sock *sk)
> >  {
> >         atomic_set(&sk->sk_drops, 0);
> > +       atomic_set(&sk->sk_drops1, 0);
> >  }
> >
> >  static inline void
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > index 00b2ceae81fb0914f2de3634eb342004e8bc3c5b..31ad9fcc6022d5d31b9c6a35daacaad7c887a51f 100644
> > --- a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > @@ -57,7 +57,8 @@ int dump_netlink(struct bpf_iter__netlink *ctx)
> >                 inode = SOCK_INODE(sk);
> >                 bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
> >         }
> > -       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
> > +       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n",
> > +                      s->sk_drops.counter + s->sk_drops1.counter, ino);
> >
> >         return 0;
> >  }
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > index ffbd4b116d17ffbb9f14440c788e50490fb0f4e0..192ab5693a7131c1ec5879e539651c21f6f3c9ae 100644
> > --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > @@ -64,7 +64,7 @@ int dump_udp4(struct bpf_iter__udp *ctx)
> >                        0, 0L, 0, ctx->uid, 0,
> >                        sock_i_ino(&inet->sk),
> >                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> > -                      inet->sk.sk_drops.counter);
> > +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
> >
> >         return 0;
> >  }
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > index 47ff7754f4fda4c9db92fbf1dc2e6a68f044174e..5170bdf458fa1b9a4eea9240fbaa5934182a7776 100644
> > --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > @@ -72,7 +72,7 @@ int dump_udp6(struct bpf_iter__udp *ctx)
> >                        0, 0L, 0, ctx->uid, 0,
> >                        sock_i_ino(&inet->sk),
> >                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> > -                      inet->sk.sk_drops.counter);
> > +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
> >
> >         return 0;
> >  }
> > --
> > 2.51.0.261.g7ce5a0a67e-goog
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-26  5:46     ` Eric Dumazet
@ 2025-08-26  5:56       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 13+ messages in thread
From: Kuniyuki Iwashima @ 2025-08-26  5:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Willem de Bruijn

On Mon, Aug 25, 2025 at 10:46 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Aug 25, 2025 at 1:19 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 12:59 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > sk->sk_drops can be heavily contended when
> > > changed from many cpus.
> > >
> > > Instead using too expensive per-cpu data structure,
> > > add a second sk->sk_drops1 field and change
> > > sk_drops_inc() to be NUMA aware.
> > >
> > > This patch adds 64 bytes per socket.
> > >
> > > For hosts having more than two memory nodes, sk_drops_inc()
> > > might not be optimal and can be refined later.
> > >
> > > Tested with the following stress test, sending about 11 Mpps
> > > to a dual socket AMD EPYC 7B13 64-Core.
> > >
> > > super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
> > > Note: due to socket lookup, only one UDP socket will receive
> > > packets on DUT.
> > >
> > > Then measure receiver (DUT) behavior. We can see
> > > consumer and BH handlers can process more packets per second.
> > >
> > > Before:
> > >
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams                 855592             0.0
> > > Udp6InErrors                    5621467            0.0
> > > Udp6RcvbufErrors                5621467            0.0
> > >
> > > After:
> > > nstat -n ; sleep 1 ; nstat | grep Udp
> > > Udp6InDatagrams                 914537             0.0
> > > Udp6InErrors                    6888487            0.0
> > > Udp6RcvbufErrors                6888487            0.0
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  include/net/sock.h                            | 20 +++++++++++++++++++
> > >  .../selftests/bpf/progs/bpf_iter_netlink.c    |  3 ++-
> > >  .../selftests/bpf/progs/bpf_iter_udp4.c       |  2 +-
> > >  .../selftests/bpf/progs/bpf_iter_udp6.c       |  2 +-
> > >  4 files changed, 24 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index f40e3c4883be32c8282694ab215bcf79eb87cbd7..318169eb1a3d40eefac50147012551614abc6f7a 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -282,6 +282,7 @@ struct sk_filter;
> > >    *    @sk_err_soft: errors that don't cause failure but are the cause of a
> > >    *                  persistent failure not just 'timed out'
> > >    *    @sk_drops: raw/udp drops counter
> > > +  *    @sk_drops1: second drops counter
> > >    *    @sk_ack_backlog: current listen backlog
> > >    *    @sk_max_ack_backlog: listen backlog set in listen()
> > >    *    @sk_uid: user id of owner
> > > @@ -571,6 +572,11 @@ struct sock {
> > >         atomic_t                sk_drops ____cacheline_aligned_in_smp;
> > >         struct rcu_head         sk_rcu;
> > >         netns_tracker           ns_tracker;
> > > +#if defined(CONFIG_NUMA)
> > > +       atomic_t                sk_drops1 ____cacheline_aligned_in_smp;
> > > +#else
> > > +       atomic_t                sk_drops1;
> >
> > Is this #else to be friendly to bpf selftests ?
>
> Sure !
>
> What problem are you seeing exactly ?

No problem, sk_drops is aligned anyway, I was just curious :)

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

Thank you, Eric !

>
>
> >
> >
> > > +#endif
> > >  };
> > >
> > >  struct sock_bh_locked {
> > > @@ -2684,17 +2690,31 @@ struct sock_skb_cb {
> > >
> > >  static inline void sk_drops_inc(struct sock *sk)
> > >  {
> > > +#if defined(CONFIG_NUMA)
> > > +       int n = numa_node_id() % 2;
> > > +
> > > +       if (n)
> > > +               atomic_inc(&sk->sk_drops1);
> > > +       else
> > > +               atomic_inc(&sk->sk_drops);
> > > +#else
> > >         atomic_inc(&sk->sk_drops);
> > > +#endif
> > >  }
> > >
> > >  static inline int sk_drops_read(const struct sock *sk)
> > >  {
> > > +#if defined(CONFIG_NUMA)
> > > +       return atomic_read(&sk->sk_drops) + atomic_read(&sk->sk_drops1);
> > > +#else
> > >         return atomic_read(&sk->sk_drops);
> > > +#endif
> > >  }
> > >
> > >  static inline void sk_drops_reset(struct sock *sk)
> > >  {
> > >         atomic_set(&sk->sk_drops, 0);
> > > +       atomic_set(&sk->sk_drops1, 0);
> > >  }
> > >
> > >  static inline void
> > > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > > index 00b2ceae81fb0914f2de3634eb342004e8bc3c5b..31ad9fcc6022d5d31b9c6a35daacaad7c887a51f 100644
> > > --- a/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_netlink.c
> > > @@ -57,7 +57,8 @@ int dump_netlink(struct bpf_iter__netlink *ctx)
> > >                 inode = SOCK_INODE(sk);
> > >                 bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
> > >         }
> > > -       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
> > > +       BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n",
> > > +                      s->sk_drops.counter + s->sk_drops1.counter, ino);
> > >
> > >         return 0;
> > >  }
> > > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > > index ffbd4b116d17ffbb9f14440c788e50490fb0f4e0..192ab5693a7131c1ec5879e539651c21f6f3c9ae 100644
> > > --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp4.c
> > > @@ -64,7 +64,7 @@ int dump_udp4(struct bpf_iter__udp *ctx)
> > >                        0, 0L, 0, ctx->uid, 0,
> > >                        sock_i_ino(&inet->sk),
> > >                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> > > -                      inet->sk.sk_drops.counter);
> > > +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
> > >
> > >         return 0;
> > >  }
> > > diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > > index 47ff7754f4fda4c9db92fbf1dc2e6a68f044174e..5170bdf458fa1b9a4eea9240fbaa5934182a7776 100644
> > > --- a/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > > +++ b/tools/testing/selftests/bpf/progs/bpf_iter_udp6.c
> > > @@ -72,7 +72,7 @@ int dump_udp6(struct bpf_iter__udp *ctx)
> > >                        0, 0L, 0, ctx->uid, 0,
> > >                        sock_i_ino(&inet->sk),
> > >                        inet->sk.sk_refcnt.refs.counter, udp_sk,
> > > -                      inet->sk.sk_drops.counter);
> > > +                      inet->sk.sk_drops.counter + inet->sk.sk_drops1.counter);
> > >
> > >         return 0;
> > >  }
> > > --
> > > 2.51.0.261.g7ce5a0a67e-goog
> > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-25 19:59 ` [PATCH net-next 3/3] net: add new sk->sk_drops1 field Eric Dumazet
  2025-08-25 20:19   ` Kuniyuki Iwashima
@ 2025-08-26  6:33   ` Paolo Abeni
  2025-08-26  6:46     ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2025-08-26  6:33 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, netdev, eric.dumazet, Willem de Bruijn,
	Kuniyuki Iwashima

On 8/25/25 9:59 PM, Eric Dumazet wrote:
> sk->sk_drops can be heavily contended when
> changed from many cpus.
> 
> Instead using too expensive per-cpu data structure,
> add a second sk->sk_drops1 field and change
> sk_drops_inc() to be NUMA aware.
> 
> This patch adds 64 bytes per socket.

I'm wondering: since the main target for dealing with drops are UDP
sockets, have you considered adding sk_drops1 to udp_sock, instead?

Plus an additional conditional/casting in sk_drops_{read,inc,reset}.

That would save some memory also offer the opportunity to use more
memory to deal with  NUMA hosts.

(I had the crazy idea to keep sk_drop on a contended cacheline and use 2
(or more) cacheline aligned fields for udp_sock only).

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-26  6:33   ` Paolo Abeni
@ 2025-08-26  6:46     ` Eric Dumazet
  2025-08-26  7:16       ` Paolo Abeni
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-08-26  6:46 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, netdev,
	eric.dumazet, Willem de Bruijn, Kuniyuki Iwashima

On Mon, Aug 25, 2025 at 11:34 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/25/25 9:59 PM, Eric Dumazet wrote:
> > sk->sk_drops can be heavily contended when
> > changed from many cpus.
> >
> > Instead using too expensive per-cpu data structure,
> > add a second sk->sk_drops1 field and change
> > sk_drops_inc() to be NUMA aware.
> >
> > This patch adds 64 bytes per socket.
>
> I'm wondering: since the main target for dealing with drops are UDP
> sockets, have you considered adding sk_drops1 to udp_sock, instead?

I actually saw the issues on RAW sockets, some applications were using them
in a non appropriate way. This was not an attack on single UDP sockets, but
a self-inflicted issue on RAW sockets.

Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 7 16:29:43 2024 +0000

    ipv6: raw: check sk->sk_rcvbuf earlier

    There is no point cloning an skb and having to free the clone
    if the receive queue of the raw socket is full.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Link: https://lore.kernel.org/r/20240307162943.2523817-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>


>
> Plus an additional conditional/casting in sk_drops_{read,inc,reset}.
>
> That would save some memory also offer the opportunity to use more
> memory to deal with  NUMA hosts.
>
> (I had the crazy idea to keep sk_drop on a contended cacheline and use 2
> (or more) cacheline aligned fields for udp_sock only).

I am working on rmem_alloc batches on both producer and consumer
as a follow up of recent thread on netdev :

https://lore.kernel.org/netdev/aKh_yi0gASYajhev@bzorp3/T/#m392d5c87ab08d6ae005c23ffc8a3186cbac07cf2

Right now, when multiple cpus (running on different NUMA nodes) are
feeding packets to __udp_enqueue_schedule_skb()
we are touching two cache lines, my plan is to reduce this to a single one.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-26  6:46     ` Eric Dumazet
@ 2025-08-26  7:16       ` Paolo Abeni
  2025-08-26 12:11         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2025-08-26  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, netdev,
	eric.dumazet, Willem de Bruijn, Kuniyuki Iwashima

On 8/26/25 8:46 AM, Eric Dumazet wrote:
> On Mon, Aug 25, 2025 at 11:34 PM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 8/25/25 9:59 PM, Eric Dumazet wrote:
>>> sk->sk_drops can be heavily contended when
>>> changed from many cpus.
>>>
>>> Instead using too expensive per-cpu data structure,
>>> add a second sk->sk_drops1 field and change
>>> sk_drops_inc() to be NUMA aware.
>>>
>>> This patch adds 64 bytes per socket.
>>
>> I'm wondering: since the main target for dealing with drops are UDP
>> sockets, have you considered adding sk_drops1 to udp_sock, instead?
> 
> I actually saw the issues on RAW sockets, some applications were using them
> in a non appropriate way. This was not an attack on single UDP sockets, but
> a self-inflicted issue on RAW sockets.
> 
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Thu Mar 7 16:29:43 2024 +0000
> 
>     ipv6: raw: check sk->sk_rcvbuf earlier
> 
>     There is no point cloning an skb and having to free the clone
>     if the receive queue of the raw socket is full.
> 
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Reviewed-by: Willem de Bruijn <willemb@google.com>
>     Link: https://lore.kernel.org/r/20240307162943.2523817-1-edumazet@google.com
>     Signed-off-by: Jakub Kicinski <kuba@kernel.org>

I see, thanks for the pointer. Perhaps something alike the following
(completely untested) could fit? With similar delta for raw sock and
sk_drops_{read,inc,reset} would check sk_drop_counters and ev. use it
instead of sk->sk_drop. Otherwise I have no objections at all!
---
diff --git a/include/net/sock.h b/include/net/sock.h
index 63a6a48afb48..3dd76c04bd86 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -102,6 +102,11 @@ struct net;
 typedef __u32 __bitwise __portpair;
 typedef __u64 __bitwise __addrpair;

+struct socket_drop_counters {
+	atomic_t sk_drops0 ____cacheline_aligned_in_smp;
+	atomic_t sk_drops1 ____cacheline_aligned_in_smp;
+};
+
 /**
  *	struct sock_common - minimal network layer representation of sockets
  *	@skc_daddr: Foreign IPv4 addr
@@ -449,6 +454,7 @@ struct sock {
 #ifdef CONFIG_XFRM
 	struct xfrm_policy __rcu *sk_policy[2];
 #endif
+	struct socket_drop_counters *sk_drop_counters;
 	__cacheline_group_end(sock_read_rxtx);

 	__cacheline_group_begin(sock_write_rxtx);
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 4e1a672af4c5..45eec01fbbb2 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -108,6 +108,8 @@ struct udp_sock {
 	 * the last UDP socket cacheline.
 	 */
 	struct hlist_node	tunnel_list;
+
+	struct socket_drop_counters drop_counters;
};

 #define udp_test_bit(nr, sk)			\
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cc3ce0f762ec..eff90755b6ac 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1818,6 +1818,7 @@ static void udp_destruct_sock(struct sock *sk)
 int udp_init_sock(struct sock *sk)
 {
 	udp_lib_init_sock(sk);
+	sk->sk_drop_counters = &udp_sk(sk)->drop_counters;
 	sk->sk_destruct = udp_destruct_sock;
 	set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
 	return 0;
---

>> Plus an additional conditional/casting in sk_drops_{read,inc,reset}.
>>
>> That would save some memory also offer the opportunity to use more
>> memory to deal with  NUMA hosts.
>>
>> (I had the crazy idea to keep sk_drop on a contended cacheline and use 2
>> (or more) cacheline aligned fields for udp_sock only).
> 
> I am working on rmem_alloc batches on both producer and consumer
> as a follow up of recent thread on netdev :
> 
> https://lore.kernel.org/netdev/aKh_yi0gASYajhev@bzorp3/T/#m392d5c87ab08d6ae005c23ffc8a3186cbac07cf2
> 
> Right now, when multiple cpus (running on different NUMA nodes) are
> feeding packets to __udp_enqueue_schedule_skb()
> we are touching two cache lines, my plan is to reduce this to a single one.

Obviously looking forward to it!

Thanks,

Paolo


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] net: add new sk->sk_drops1 field
  2025-08-26  7:16       ` Paolo Abeni
@ 2025-08-26 12:11         ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-08-26 12:11 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Simon Horman, netdev,
	eric.dumazet, Willem de Bruijn, Kuniyuki Iwashima

On Tue, Aug 26, 2025 at 12:16 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/26/25 8:46 AM, Eric Dumazet wrote:
> > On Mon, Aug 25, 2025 at 11:34 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >> On 8/25/25 9:59 PM, Eric Dumazet wrote:
> >>> sk->sk_drops can be heavily contended when
> >>> changed from many cpus.
> >>>
> >>> Instead using too expensive per-cpu data structure,
> >>> add a second sk->sk_drops1 field and change
> >>> sk_drops_inc() to be NUMA aware.
> >>>
> >>> This patch adds 64 bytes per socket.
> >>
> >> I'm wondering: since the main target for dealing with drops are UDP
> >> sockets, have you considered adding sk_drops1 to udp_sock, instead?
> >
> > I actually saw the issues on RAW sockets, some applications were using them
> > in a non appropriate way. This was not an attack on single UDP sockets, but
> > a self-inflicted issue on RAW sockets.
> >
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Thu Mar 7 16:29:43 2024 +0000
> >
> >     ipv6: raw: check sk->sk_rcvbuf earlier
> >
> >     There is no point cloning an skb and having to free the clone
> >     if the receive queue of the raw socket is full.
> >
> >     Signed-off-by: Eric Dumazet <edumazet@google.com>
> >     Reviewed-by: Willem de Bruijn <willemb@google.com>
> >     Link: https://lore.kernel.org/r/20240307162943.2523817-1-edumazet@google.com
> >     Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>
> I see, thanks for the pointer. Perhaps something alike the following
> (completely untested) could fit? With similar delta for raw sock and
> sk_drops_{read,inc,reset} would check sk_drop_counters and ev. use it
> instead of sk->sk_drop. Otherwise I have no objections at all!

Good idea.

In v2, I will only add the extra space for UDP and RAW sockets.

Thanks !

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-08-26 12:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-25 19:59 [PATCH net-next 0/3] net: better drop accounting Eric Dumazet
2025-08-25 19:59 ` [PATCH net-next 1/3] net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers Eric Dumazet
2025-08-25 20:15   ` Kuniyuki Iwashima
2025-08-25 19:59 ` [PATCH net-next 2/3] net: move sk_drops out of sock_write_rx group Eric Dumazet
2025-08-25 20:17   ` Kuniyuki Iwashima
2025-08-25 19:59 ` [PATCH net-next 3/3] net: add new sk->sk_drops1 field Eric Dumazet
2025-08-25 20:19   ` Kuniyuki Iwashima
2025-08-26  5:46     ` Eric Dumazet
2025-08-26  5:56       ` Kuniyuki Iwashima
2025-08-26  6:33   ` Paolo Abeni
2025-08-26  6:46     ` Eric Dumazet
2025-08-26  7:16       ` Paolo Abeni
2025-08-26 12:11         ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).