All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] tcp: backlog processing optims
@ 2023-09-11 17:05 Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Eric Dumazet @ 2023-09-11 17:05 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet, Eric Dumazet

First patches are mostly preparing the ground for the last one.

Last patch of the series implements sort of ACK reduction
only for the cases a TCP receiver is under high stress,
which happens for high throughput flows.

This gives us a ~20% increase of single TCP flow (100Gbit -> 120Gbit)

Eric Dumazet (4):
  tcp: no longer release socket ownership in tcp_release_cb()
  net: sock_release_ownership() cleanup
  net: call prot->release_cb() when processing backlog
  tcp: defer regular ACK while processing socket backlog

 Documentation/networking/ip-sysctl.rst |  7 +++++++
 include/linux/tcp.h                    | 14 ++++++++------
 include/net/netns/ipv4.h               |  1 +
 include/net/sock.h                     |  9 ++++-----
 net/core/sock.c                        |  6 +++---
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp_input.c                   |  8 ++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  | 15 ++++-----------
 9 files changed, 45 insertions(+), 25 deletions(-)

-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb()
  2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
@ 2023-09-11 17:05 ` Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2023-09-11 17:05 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet, Eric Dumazet

This partially reverts c3f9b01849ef ("tcp: tcp_release_cb()
should release socket ownership").

prequeue has been removed by Florian in commit e7942d0633c4
("tcp: remove prequeue support")

__tcp_checksum_complete_user() being gone, we no longer
have to release socket ownership in tcp_release_cb().

This is a prereq for third patch in the series
("net: call prot->release_cb() when processing backlog").

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c       |  3 ---
 net/ipv4/tcp_output.c | 10 ----------
 2 files changed, 13 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 16584e2dd6481a3fc28d796db785439f0446703b..21610e3845a5042f7c648ccb3e0d90126df20a0b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3519,9 +3519,6 @@ void release_sock(struct sock *sk)
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 
-	/* Warning : release_cb() might need to release sk ownership,
-	 * ie call sock_release_ownership(sk) before us.
-	 */
 	if (sk->sk_prot->release_cb)
 		sk->sk_prot->release_cb(sk);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ccfc8bbf745586cd23dcf02d755d6981dc92742e..b4cac12d0e6348aaa3a3957b0091ea7fe6553731 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1101,16 +1101,6 @@ void tcp_release_cb(struct sock *sk)
 		tcp_tsq_write(sk);
 		__sock_put(sk);
 	}
-	/* Here begins the tricky part :
-	 * We are called from release_sock() with :
-	 * 1) BH disabled
-	 * 2) sk_lock.slock spinlock held
-	 * 3) socket owned by us (sk->sk_lock.owned == 1)
-	 *
-	 * But following code is meant to be called from BH handlers,
-	 * so we should keep BH disabled, but early release socket ownership
-	 */
-	sock_release_ownership(sk);
 
 	if (flags & TCPF_WRITE_TIMER_DEFERRED) {
 		tcp_write_timer_handler(sk);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 2/4] net: sock_release_ownership() cleanup
  2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
@ 2023-09-11 17:05 ` Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2023-09-11 17:05 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet, Eric Dumazet

sock_release_ownership() should only be called by user
owning the socket lock.

After prior commit, we can remove one condition.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b770261fbdaf59d4d1c0b30adb2592c56442e9e3..676146e9d18117e3a04fbbfb0a06e511a80d537c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1823,12 +1823,11 @@ static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
 
 static inline void sock_release_ownership(struct sock *sk)
 {
-	if (sock_owned_by_user_nocheck(sk)) {
-		sk->sk_lock.owned = 0;
+	DEBUG_NET_WARN_ON_ONCE(!sock_owned_by_user_nocheck(sk));
+	sk->sk_lock.owned = 0;
 
-		/* The sk_lock has mutex_unlock() semantics: */
-		mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
-	}
+	/* The sk_lock has mutex_unlock() semantics: */
+	mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
 }
 
 /* no reclassification while locks are held */
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog
  2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
  2023-09-11 17:05 ` [PATCH net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
@ 2023-09-11 17:05 ` Eric Dumazet
  2023-09-12 16:58   ` Paolo Abeni
  2023-09-11 17:05 ` [PATCH net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
  2023-09-12 17:30 ` [PATCH net-next 0/4] tcp: backlog processing optims patchwork-bot+netdevbpf
  4 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2023-09-11 17:05 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet, Eric Dumazet

__sk_flush_backlog() / sk_flush_backlog() are used
when TCP recvmsg()/sendmsg() process large chunks,
to not let packets in the backlog too long.

It makes sense to call tcp_release_cb() to also
process actions held in sk->sk_tsq_flags for smoother
scheduling.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index 21610e3845a5042f7c648ccb3e0d90126df20a0b..bb89b88bc1e8a042c4ee40b3c8345dc58cb1b369 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3001,6 +3001,9 @@ void __sk_flush_backlog(struct sock *sk)
 {
 	spin_lock_bh(&sk->sk_lock.slock);
 	__release_sock(sk);
+
+	if (sk->sk_prot->release_cb)
+		sk->sk_prot->release_cb(sk);
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL_GPL(__sk_flush_backlog);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
                   ` (2 preceding siblings ...)
  2023-09-11 17:05 ` [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
@ 2023-09-11 17:05 ` Eric Dumazet
  2023-09-12 17:30 ` [PATCH net-next 0/4] tcp: backlog processing optims patchwork-bot+netdevbpf
  4 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2023-09-11 17:05 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet, Eric Dumazet, Dave Taht

This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -> 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
---
 Documentation/networking/ip-sysctl.rst |  7 +++++++
 include/linux/tcp.h                    | 14 ++++++++------
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp_input.c                   |  8 ++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  |  5 ++++-
 7 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index a66054d0763a69d9e7cfae8e6242ac6d254e9169..5bfa1837968cee5eacafc77b216729b495bf65b8 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -745,6 +745,13 @@ tcp_comp_sack_nr - INTEGER
 
 	Default : 44
 
+tcp_backlog_ack_defer - BOOLEAN
+	If set, user thread processing socket backlog tries sending
+	one ACK for the whole queue. This helps to avoid potential
+	long latencies at end of a TCP socket syscall.
+
+	Default : true
+
 tcp_slow_start_after_idle - BOOLEAN
 	If set, provide RFC2861 behavior and time out the congestion
 	window after an idle period.  An idle period is defined at
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3c5efeeb024f651c90ae4a9ca704dcf16e4adb11..44d946161d4a7e52b05c196a1e1d37db25329650 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -463,15 +463,17 @@ enum tsq_enum {
 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
 				    * tcp_v{4|6}_mtu_reduced()
 				    */
+	TCP_ACK_DEFERRED,	   /* TX pure ack is deferred */
 };
 
 enum tsq_flags {
-	TSQF_THROTTLED			= (1UL << TSQ_THROTTLED),
-	TSQF_QUEUED			= (1UL << TSQ_QUEUED),
-	TCPF_TSQ_DEFERRED		= (1UL << TCP_TSQ_DEFERRED),
-	TCPF_WRITE_TIMER_DEFERRED	= (1UL << TCP_WRITE_TIMER_DEFERRED),
-	TCPF_DELACK_TIMER_DEFERRED	= (1UL << TCP_DELACK_TIMER_DEFERRED),
-	TCPF_MTU_REDUCED_DEFERRED	= (1UL << TCP_MTU_REDUCED_DEFERRED),
+	TSQF_THROTTLED			= BIT(TSQ_THROTTLED),
+	TSQF_QUEUED			= BIT(TSQ_QUEUED),
+	TCPF_TSQ_DEFERRED		= BIT(TCP_TSQ_DEFERRED),
+	TCPF_WRITE_TIMER_DEFERRED	= BIT(TCP_WRITE_TIMER_DEFERRED),
+	TCPF_DELACK_TIMER_DEFERRED	= BIT(TCP_DELACK_TIMER_DEFERRED),
+	TCPF_MTU_REDUCED_DEFERRED	= BIT(TCP_MTU_REDUCED_DEFERRED),
+	TCPF_ACK_DEFERRED		= BIT(TCP_ACK_DEFERRED),
 };
 
 #define tcp_sk(ptr) container_of_const(ptr, struct tcp_sock, inet_conn.icsk_inet.sk)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7a41c4791536732005cedbb80c223b86aa43249e..d96d05b0881973aafd064ffa9418a22038bbfbf4 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -132,6 +132,7 @@ struct netns_ipv4 {
 	u8 sysctl_tcp_syncookies;
 	u8 sysctl_tcp_migrate_req;
 	u8 sysctl_tcp_comp_sack_nr;
+	u8 sysctl_tcp_backlog_ack_defer;
 	int sysctl_tcp_reordering;
 	u8 sysctl_tcp_retries1;
 	u8 sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6ac890b4073f4583b0f98ee3294babb91bbcf482..e7f024d93572a6682ac951f9cb1debbaa0450443 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1366,6 +1366,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dou8vec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "tcp_backlog_ack_defer",
+		.data		= &init_net.ipv4.sysctl_tcp_backlog_ack_defer,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname       = "tcp_reflect_tos",
 		.data           = &init_net.ipv4.sysctl_tcp_reflect_tos,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 06fe1cf645d5a386331548484de2beb68e846404..41b471748437b646709158339bd6f79719661198 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5553,6 +5553,14 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    tcp_in_quickack_mode(sk) ||
 	    /* Protocol state mandates a one-time immediate ACK */
 	    inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
+		/* If we are running from __release_sock() in user context,
+		 * Defer the ack until tcp_release_cb().
+		 */
+		if (sock_owned_by_user_nocheck(sk) &&
+		    READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_backlog_ack_defer)) {
+			set_bit(TCP_ACK_DEFERRED, &sk->sk_tsq_flags);
+			return;
+		}
 send_now:
 		tcp_send_ack(sk);
 		return;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 27140e5cdc060ddcdc8973759f68ed612a60617a..f13eb7e23d03f3681055257e6ebea0612ae3f9b3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3263,6 +3263,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
 	net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC;
 	net->ipv4.sysctl_tcp_comp_sack_nr = 44;
+	net->ipv4.sysctl_tcp_backlog_ack_defer = 1;
 	net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
 	net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 0;
 	atomic_set(&net->ipv4.tfo_active_disable_times, 0);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4cac12d0e6348aaa3a3957b0091ea7fe6553731..1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1077,7 +1077,8 @@ static void tcp_tasklet_func(struct tasklet_struct *t)
 #define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |		\
 			  TCPF_WRITE_TIMER_DEFERRED |	\
 			  TCPF_DELACK_TIMER_DEFERRED |	\
-			  TCPF_MTU_REDUCED_DEFERRED)
+			  TCPF_MTU_REDUCED_DEFERRED |	\
+			  TCPF_ACK_DEFERRED)
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -1114,6 +1115,8 @@ void tcp_release_cb(struct sock *sk)
 		inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
 		__sock_put(sk);
 	}
+	if ((flags & TCPF_ACK_DEFERRED) && inet_csk_ack_scheduled(sk))
+		tcp_send_ack(sk);
 }
 EXPORT_SYMBOL(tcp_release_cb);
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog
  2023-09-11 17:05 ` [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
@ 2023-09-12 16:58   ` Paolo Abeni
  2023-09-12 17:01     ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2023-09-12 16:58 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	eric.dumazet

On Mon, 2023-09-11 at 17:05 +0000, Eric Dumazet wrote:
> __sk_flush_backlog() / sk_flush_backlog() are used
> when TCP recvmsg()/sendmsg() process large chunks,
> to not let packets in the backlog too long.
> 
> It makes sense to call tcp_release_cb() to also
> process actions held in sk->sk_tsq_flags for smoother
> scheduling.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/sock.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 21610e3845a5042f7c648ccb3e0d90126df20a0b..bb89b88bc1e8a042c4ee40b3c8345dc58cb1b369 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3001,6 +3001,9 @@ void __sk_flush_backlog(struct sock *sk)
>  {
>  	spin_lock_bh(&sk->sk_lock.slock);
>  	__release_sock(sk);
> +
> +	if (sk->sk_prot->release_cb)
> +		sk->sk_prot->release_cb(sk);

Out of sheer curiosity, I'm wondering if adding an
indirect_call_wrapper here could make any difference?

I guess not much, and in any case it could be a follow-up.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog
  2023-09-12 16:58   ` Paolo Abeni
@ 2023-09-12 17:01     ` Eric Dumazet
  2023-09-12 17:19       ` Paolo Abeni
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2023-09-12 17:01 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, netdev, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, eric.dumazet

On Tue, Sep 12, 2023 at 6:59 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2023-09-11 at 17:05 +0000, Eric Dumazet wrote:
> > __sk_flush_backlog() / sk_flush_backlog() are used
> > when TCP recvmsg()/sendmsg() process large chunks,
> > to not let packets in the backlog too long.
> >
> > It makes sense to call tcp_release_cb() to also
> > process actions held in sk->sk_tsq_flags for smoother
> > scheduling.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/sock.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 21610e3845a5042f7c648ccb3e0d90126df20a0b..bb89b88bc1e8a042c4ee40b3c8345dc58cb1b369 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -3001,6 +3001,9 @@ void __sk_flush_backlog(struct sock *sk)
> >  {
> >       spin_lock_bh(&sk->sk_lock.slock);
> >       __release_sock(sk);
> > +
> > +     if (sk->sk_prot->release_cb)
> > +             sk->sk_prot->release_cb(sk);
>
> Out of sheer curiosity, I'm wondering if adding an
> indirect_call_wrapper here could make any difference?
>
> I guess not much, and in any case it could be a follow-up.
>

I think it would make sense, particularly from release_sock()

We have such a change in our kernel, for some reason its author never
upstreamed it :/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog
  2023-09-12 17:01     ` Eric Dumazet
@ 2023-09-12 17:19       ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2023-09-12 17:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, netdev, Soheil Hassas Yeganeh,
	Neal Cardwell, Yuchung Cheng, eric.dumazet

On Tue, 2023-09-12 at 19:01 +0200, Eric Dumazet wrote:
> On Tue, Sep 12, 2023 at 6:59 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Mon, 2023-09-11 at 17:05 +0000, Eric Dumazet wrote:
> > > __sk_flush_backlog() / sk_flush_backlog() are used
> > > when TCP recvmsg()/sendmsg() process large chunks,
> > > to not let packets in the backlog too long.
> > > 
> > > It makes sense to call tcp_release_cb() to also
> > > process actions held in sk->sk_tsq_flags for smoother
> > > scheduling.
> > > 
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  net/core/sock.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index 21610e3845a5042f7c648ccb3e0d90126df20a0b..bb89b88bc1e8a042c4ee40b3c8345dc58cb1b369 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -3001,6 +3001,9 @@ void __sk_flush_backlog(struct sock *sk)
> > >  {
> > >       spin_lock_bh(&sk->sk_lock.slock);
> > >       __release_sock(sk);
> > > +
> > > +     if (sk->sk_prot->release_cb)
> > > +             sk->sk_prot->release_cb(sk);
> > 
> > Out of sheer curiosity, I'm wondering if adding an
> > indirect_call_wrapper here could make any difference?
> > 
> > I guess not much, and in any case it could be a follow-up.
> > 
> 
> I think it would make sense, particularly from release_sock()
> 
> We have such a change in our kernel, for some reason its author never
> upstreamed it :/

Can be a follow-up, no need to resend the whole series, I'm applying it
now! The pw backlog is already scaring as is, I prefer the path with
less patches flying ;)

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/4] tcp: backlog processing optims
  2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
                   ` (3 preceding siblings ...)
  2023-09-11 17:05 ` [PATCH net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
@ 2023-09-12 17:30 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-09-12 17:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, netdev, soheil, ncardwell, ycheng,
	eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 11 Sep 2023 17:05:27 +0000 you wrote:
> First patches are mostly preparing the ground for the last one.
> 
> Last patch of the series implements sort of ACK reduction
> only for the cases a TCP receiver is under high stress,
> which happens for high throughput flows.
> 
> This gives us a ~20% increase of single TCP flow (100Gbit -> 120Gbit)
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] tcp: no longer release socket ownership in tcp_release_cb()
    https://git.kernel.org/netdev/net-next/c/b49d252216e4
  - [net-next,2/4] net: sock_release_ownership() cleanup
    https://git.kernel.org/netdev/net-next/c/11445469dec8
  - [net-next,3/4] net: call prot->release_cb() when processing backlog
    https://git.kernel.org/netdev/net-next/c/4505dc2a5228
  - [net-next,4/4] tcp: defer regular ACK while processing socket backlog
    https://git.kernel.org/netdev/net-next/c/133c4c0d3717

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-09-12 17:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-11 17:05 [PATCH net-next 0/4] tcp: backlog processing optims Eric Dumazet
2023-09-11 17:05 ` [PATCH net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
2023-09-11 17:05 ` [PATCH net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
2023-09-11 17:05 ` [PATCH net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
2023-09-12 16:58   ` Paolo Abeni
2023-09-12 17:01     ` Eric Dumazet
2023-09-12 17:19       ` Paolo Abeni
2023-09-11 17:05 ` [PATCH net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
2023-09-12 17:30 ` [PATCH net-next 0/4] tcp: backlog processing optims patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.