[RFC net-next 0/4] tcp: backlog processing optims

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net-next 0/4] tcp: backlog processing optims
@ 2023-09-06 20:10 Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-06 20:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Eric Dumazet

First patches are mostly preparing the ground for the last one.

Last patch of the series implements sort of ACK reduction
only for the cases a TCP receiver is under high stress,
which happens for high throughput flows.

This gives us a ~20% increase of single TCP flow (100Gbit -> 120Gbit)

Eric Dumazet (4):
  tcp: no longer release socket ownership in tcp_release_cb()
  net: sock_release_ownership() cleanup
  net: call prot->release_cb() when processing backlog
  tcp: defer regular ACK while processing socket backlog

 Documentation/networking/ip-sysctl.rst |  7 +++++++
 include/linux/tcp.h                    | 14 ++++++++------
 include/net/netns/ipv4.h               |  1 +
 include/net/sock.h                     |  9 ++++-----
 net/core/sock.c                        |  6 +++---
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp_input.c                   |  8 ++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  | 15 ++++-----------
 9 files changed, 45 insertions(+), 25 deletions(-)

-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb()
  2023-09-06 20:10 [RFC net-next 0/4] tcp: backlog processing optims Eric Dumazet
@ 2023-09-06 20:10 ` Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-06 20:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Eric Dumazet

This partially reverts c3f9b01849ef ("tcp: tcp_release_cb()
should release socket ownership").

prequeue has been removed by Florian in commit e7942d0633c4
("tcp: remove prequeue support")

__tcp_checksum_complete_user() being gone, we no longer
have to release socket ownership in tcp_release_cb().

This is a prereq for third patch in the series
("net: call prot->release_cb() when processing backlog").

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c       |  3 ---
 net/ipv4/tcp_output.c | 10 ----------
 2 files changed, 13 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 16584e2dd6481a3fc28d796db785439f0446703b..21610e3845a5042f7c648ccb3e0d90126df20a0b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3519,9 +3519,6 @@ void release_sock(struct sock *sk)
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 
-	/* Warning : release_cb() might need to release sk ownership,
-	 * ie call sock_release_ownership(sk) before us.
-	 */
 	if (sk->sk_prot->release_cb)
 		sk->sk_prot->release_cb(sk);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ccfc8bbf745586cd23dcf02d755d6981dc92742e..b4cac12d0e6348aaa3a3957b0091ea7fe6553731 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1101,16 +1101,6 @@ void tcp_release_cb(struct sock *sk)
 		tcp_tsq_write(sk);
 		__sock_put(sk);
 	}
-	/* Here begins the tricky part :
-	 * We are called from release_sock() with :
-	 * 1) BH disabled
-	 * 2) sk_lock.slock spinlock held
-	 * 3) socket owned by us (sk->sk_lock.owned == 1)
-	 *
-	 * But following code is meant to be called from BH handlers,
-	 * so we should keep BH disabled, but early release socket ownership
-	 */
-	sock_release_ownership(sk);
 
 	if (flags & TCPF_WRITE_TIMER_DEFERRED) {
 		tcp_write_timer_handler(sk);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC net-next 2/4] net: sock_release_ownership() cleanup
  2023-09-06 20:10 [RFC net-next 0/4] tcp: backlog processing optims Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
@ 2023-09-06 20:10 ` Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
  3 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-06 20:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Eric Dumazet

sock_release_ownership() should only be called by user
owning the socket lock.

After prior commit, we can remove one condition.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b770261fbdaf59d4d1c0b30adb2592c56442e9e3..676146e9d18117e3a04fbbfb0a06e511a80d537c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1823,12 +1823,11 @@ static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
 
 static inline void sock_release_ownership(struct sock *sk)
 {
-	if (sock_owned_by_user_nocheck(sk)) {
-		sk->sk_lock.owned = 0;
+	DEBUG_NET_WARN_ON_ONCE(!sock_owned_by_user_nocheck(sk));
+	sk->sk_lock.owned = 0;
 
-		/* The sk_lock has mutex_unlock() semantics: */
-		mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
-	}
+	/* The sk_lock has mutex_unlock() semantics: */
+	mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
 }
 
 /* no reclassification while locks are held */
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC net-next 3/4] net: call prot->release_cb() when processing backlog
  2023-09-06 20:10 [RFC net-next 0/4] tcp: backlog processing optims Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
@ 2023-09-06 20:10 ` Eric Dumazet
  2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
  3 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-06 20:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Eric Dumazet

__sk_flush_backlog() / sk_flush_backlog() are used
when TCP recvmsg()/sendmsg() process large chunks,
to not let packets in the backlog too long.

It makes sense to call tcp_release_cb() to also
process actions held in sk->sk_tsq_flags for smoother
scheduling.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index 21610e3845a5042f7c648ccb3e0d90126df20a0b..bb89b88bc1e8a042c4ee40b3c8345dc58cb1b369 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3001,6 +3001,9 @@ void __sk_flush_backlog(struct sock *sk)
 {
 	spin_lock_bh(&sk->sk_lock.slock);
 	__release_sock(sk);
+
+	if (sk->sk_prot->release_cb)
+		sk->sk_prot->release_cb(sk);
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL_GPL(__sk_flush_backlog);
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-06 20:10 [RFC net-next 0/4] tcp: backlog processing optims Eric Dumazet
                   ` (2 preceding siblings ...)
  2023-09-06 20:10 ` [RFC net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
@ 2023-09-06 20:10 ` Eric Dumazet
  2023-09-06 20:31   ` Yuchung Cheng
                     ` (3 more replies)
  3 siblings, 4 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-06 20:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Yuchung Cheng, Eric Dumazet

This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -> 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst |  7 +++++++
 include/linux/tcp.h                    | 14 ++++++++------
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
 net/ipv4/tcp_input.c                   |  8 ++++++++
 net/ipv4/tcp_ipv4.c                    |  1 +
 net/ipv4/tcp_output.c                  |  5 ++++-
 7 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index a66054d0763a69d9e7cfae8e6242ac6d254e9169..5bfa1837968cee5eacafc77b216729b495bf65b8 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -745,6 +745,13 @@ tcp_comp_sack_nr - INTEGER
 
 	Default : 44
 
+tcp_backlog_ack_defer - BOOLEAN
+	If set, user thread processing socket backlog tries sending
+	one ACK for the whole queue. This helps to avoid potential
+	long latencies at end of a TCP socket syscall.
+
+	Default : true
+
 tcp_slow_start_after_idle - BOOLEAN
 	If set, provide RFC2861 behavior and time out the congestion
 	window after an idle period.  An idle period is defined at
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3c5efeeb024f651c90ae4a9ca704dcf16e4adb11..44d946161d4a7e52b05c196a1e1d37db25329650 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -463,15 +463,17 @@ enum tsq_enum {
 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
 				    * tcp_v{4|6}_mtu_reduced()
 				    */
+	TCP_ACK_DEFERRED,	   /* TX pure ack is deferred */
 };
 
 enum tsq_flags {
-	TSQF_THROTTLED			= (1UL << TSQ_THROTTLED),
-	TSQF_QUEUED			= (1UL << TSQ_QUEUED),
-	TCPF_TSQ_DEFERRED		= (1UL << TCP_TSQ_DEFERRED),
-	TCPF_WRITE_TIMER_DEFERRED	= (1UL << TCP_WRITE_TIMER_DEFERRED),
-	TCPF_DELACK_TIMER_DEFERRED	= (1UL << TCP_DELACK_TIMER_DEFERRED),
-	TCPF_MTU_REDUCED_DEFERRED	= (1UL << TCP_MTU_REDUCED_DEFERRED),
+	TSQF_THROTTLED			= BIT(TSQ_THROTTLED),
+	TSQF_QUEUED			= BIT(TSQ_QUEUED),
+	TCPF_TSQ_DEFERRED		= BIT(TCP_TSQ_DEFERRED),
+	TCPF_WRITE_TIMER_DEFERRED	= BIT(TCP_WRITE_TIMER_DEFERRED),
+	TCPF_DELACK_TIMER_DEFERRED	= BIT(TCP_DELACK_TIMER_DEFERRED),
+	TCPF_MTU_REDUCED_DEFERRED	= BIT(TCP_MTU_REDUCED_DEFERRED),
+	TCPF_ACK_DEFERRED		= BIT(TCP_ACK_DEFERRED),
 };
 
 #define tcp_sk(ptr) container_of_const(ptr, struct tcp_sock, inet_conn.icsk_inet.sk)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7a41c4791536732005cedbb80c223b86aa43249e..d96d05b0881973aafd064ffa9418a22038bbfbf4 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -132,6 +132,7 @@ struct netns_ipv4 {
 	u8 sysctl_tcp_syncookies;
 	u8 sysctl_tcp_migrate_req;
 	u8 sysctl_tcp_comp_sack_nr;
+	u8 sysctl_tcp_backlog_ack_defer;
 	int sysctl_tcp_reordering;
 	u8 sysctl_tcp_retries1;
 	u8 sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2afb0870648bc583504714c7d31d8fb0848e18c7..472c3953f40854f3ae85950cadffcd0cb6c4f83c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1366,6 +1366,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dou8vec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "tcp_backlog_ack_defer",
+		.data		= &init_net.ipv4.sysctl_tcp_backlog_ack_defer,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname       = "tcp_reflect_tos",
 		.data           = &init_net.ipv4.sysctl_tcp_reflect_tos,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 06fe1cf645d5a386331548484de2beb68e846404..41b471748437b646709158339bd6f79719661198 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5553,6 +5553,14 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    tcp_in_quickack_mode(sk) ||
 	    /* Protocol state mandates a one-time immediate ACK */
 	    inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
+		/* If we are running from __release_sock() in user context,
+		 * Defer the ack until tcp_release_cb().
+		 */
+		if (sock_owned_by_user_nocheck(sk) &&
+		    READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_backlog_ack_defer)) {
+			set_bit(TCP_ACK_DEFERRED, &sk->sk_tsq_flags);
+			return;
+		}
 send_now:
 		tcp_send_ack(sk);
 		return;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 27140e5cdc060ddcdc8973759f68ed612a60617a..f13eb7e23d03f3681055257e6ebea0612ae3f9b3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3263,6 +3263,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
 	net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC;
 	net->ipv4.sysctl_tcp_comp_sack_nr = 44;
+	net->ipv4.sysctl_tcp_backlog_ack_defer = 1;
 	net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
 	net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 0;
 	atomic_set(&net->ipv4.tfo_active_disable_times, 0);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4cac12d0e6348aaa3a3957b0091ea7fe6553731..1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1077,7 +1077,8 @@ static void tcp_tasklet_func(struct tasklet_struct *t)
 #define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |		\
 			  TCPF_WRITE_TIMER_DEFERRED |	\
 			  TCPF_DELACK_TIMER_DEFERRED |	\
-			  TCPF_MTU_REDUCED_DEFERRED)
+			  TCPF_MTU_REDUCED_DEFERRED |	\
+			  TCPF_ACK_DEFERRED)
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -1114,6 +1115,8 @@ void tcp_release_cb(struct sock *sk)
 		inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
 		__sock_put(sk);
 	}
+	if ((flags & TCPF_ACK_DEFERRED) && inet_csk_ack_scheduled(sk))
+		tcp_send_ack(sk);
 }
 EXPORT_SYMBOL(tcp_release_cb);
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
@ 2023-09-06 20:31   ` Yuchung Cheng
  2023-09-06 23:07     ` Stephen Hemminger
  2023-09-07 14:07   ` Neal Cardwell
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Yuchung Cheng @ 2023-09-06 20:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell

On Wed, Sep 6, 2023 at 1:10 PM Eric Dumazet <edumazet@google.com> wrote:
>
> This idea came after a particular workload requested
> the quickack attribute set on routes, and a performance
> drop was noticed for large bulk transfers.
>
> For high throughput flows, it is best to use one cpu
> running the user thread issuing socket system calls,
> and a separate cpu to process incoming packets from BH context.
> (With TSO/GRO, bottleneck is usually the 'user' cpu)
>
> Problem is the user thread can spend a lot of time while holding
> the socket lock, forcing BH handler to queue most of incoming
> packets in the socket backlog.
>
> Whenever the user thread releases the socket lock, it must first
> process all accumulated packets in the backlog, potentially
> adding latency spikes. Due to flood mitigation, having too many
> packets in the backlog increases chance of unexpected drops.
>
> Backlog processing unfortunately shifts a fair amount of cpu cycles
> from the BH cpu to the 'user' cpu, thus reducing max throughput.
>
> This patch takes advantage of the backlog processing,
> and the fact that ACK are mostly cumulative.
>
> The idea is to detect we are in the backlog processing
> and defer all eligible ACK into a single one,
> sent from tcp_release_cb().
>
> This saves cpu cycles on both sides, and network resources.
>
> Performance of a single TCP flow on a 200Gbit NIC:
>
> - Throughput is increased by 20% (100Gbit -> 120Gbit).
> - Number of generated ACK per second shrinks from 240,000 to 40,000.
> - Number of backlog drops per second shrinks from 230 to 0.
>
> Benchmark context:
>  - Regular netperf TCP_STREAM (no zerocopy)
>  - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
>  - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
>
> This feature is guarded by a new sysctl, and enabled by default:
>  /proc/sys/net/ipv4/tcp_backlog_ack_defer
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>

Looks great. Any reason any workload may want to switch this feature
off? I can't think of one...

> ---
>  Documentation/networking/ip-sysctl.rst |  7 +++++++
>  include/linux/tcp.h                    | 14 ++++++++------
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
>  net/ipv4/tcp_input.c                   |  8 ++++++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  |  5 ++++-
>  7 files changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index a66054d0763a69d9e7cfae8e6242ac6d254e9169..5bfa1837968cee5eacafc77b216729b495bf65b8 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -745,6 +745,13 @@ tcp_comp_sack_nr - INTEGER
>
>         Default : 44
>
> +tcp_backlog_ack_defer - BOOLEAN
> +       If set, user thread processing socket backlog tries sending
> +       one ACK for the whole queue. This helps to avoid potential
> +       long latencies at end of a TCP socket syscall.
> +
> +       Default : true
> +
>  tcp_slow_start_after_idle - BOOLEAN
>         If set, provide RFC2861 behavior and time out the congestion
>         window after an idle period.  An idle period is defined at
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 3c5efeeb024f651c90ae4a9ca704dcf16e4adb11..44d946161d4a7e52b05c196a1e1d37db25329650 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -463,15 +463,17 @@ enum tsq_enum {
>         TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
>                                     * tcp_v{4|6}_mtu_reduced()
>                                     */
> +       TCP_ACK_DEFERRED,          /* TX pure ack is deferred */
>  };
>
>  enum tsq_flags {
> -       TSQF_THROTTLED                  = (1UL << TSQ_THROTTLED),
> -       TSQF_QUEUED                     = (1UL << TSQ_QUEUED),
> -       TCPF_TSQ_DEFERRED               = (1UL << TCP_TSQ_DEFERRED),
> -       TCPF_WRITE_TIMER_DEFERRED       = (1UL << TCP_WRITE_TIMER_DEFERRED),
> -       TCPF_DELACK_TIMER_DEFERRED      = (1UL << TCP_DELACK_TIMER_DEFERRED),
> -       TCPF_MTU_REDUCED_DEFERRED       = (1UL << TCP_MTU_REDUCED_DEFERRED),
> +       TSQF_THROTTLED                  = BIT(TSQ_THROTTLED),
> +       TSQF_QUEUED                     = BIT(TSQ_QUEUED),
> +       TCPF_TSQ_DEFERRED               = BIT(TCP_TSQ_DEFERRED),
> +       TCPF_WRITE_TIMER_DEFERRED       = BIT(TCP_WRITE_TIMER_DEFERRED),
> +       TCPF_DELACK_TIMER_DEFERRED      = BIT(TCP_DELACK_TIMER_DEFERRED),
> +       TCPF_MTU_REDUCED_DEFERRED       = BIT(TCP_MTU_REDUCED_DEFERRED),
> +       TCPF_ACK_DEFERRED               = BIT(TCP_ACK_DEFERRED),
>  };
>
>  #define tcp_sk(ptr) container_of_const(ptr, struct tcp_sock, inet_conn.icsk_inet.sk)
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 7a41c4791536732005cedbb80c223b86aa43249e..d96d05b0881973aafd064ffa9418a22038bbfbf4 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -132,6 +132,7 @@ struct netns_ipv4 {
>         u8 sysctl_tcp_syncookies;
>         u8 sysctl_tcp_migrate_req;
>         u8 sysctl_tcp_comp_sack_nr;
> +       u8 sysctl_tcp_backlog_ack_defer;
>         int sysctl_tcp_reordering;
>         u8 sysctl_tcp_retries1;
>         u8 sysctl_tcp_retries2;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2afb0870648bc583504714c7d31d8fb0848e18c7..472c3953f40854f3ae85950cadffcd0cb6c4f83c 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1366,6 +1366,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .proc_handler   = proc_dou8vec_minmax,
>                 .extra1         = SYSCTL_ZERO,
>         },
> +       {
> +               .procname       = "tcp_backlog_ack_defer",
> +               .data           = &init_net.ipv4.sysctl_tcp_backlog_ack_defer,
> +               .maxlen         = sizeof(u8),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dou8vec_minmax,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE,
> +       },
>         {
>                 .procname       = "tcp_reflect_tos",
>                 .data           = &init_net.ipv4.sysctl_tcp_reflect_tos,
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 06fe1cf645d5a386331548484de2beb68e846404..41b471748437b646709158339bd6f79719661198 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5553,6 +5553,14 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
>             tcp_in_quickack_mode(sk) ||
>             /* Protocol state mandates a one-time immediate ACK */
>             inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
> +               /* If we are running from __release_sock() in user context,
> +                * Defer the ack until tcp_release_cb().
> +                */
> +               if (sock_owned_by_user_nocheck(sk) &&
> +                   READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_backlog_ack_defer)) {
> +                       set_bit(TCP_ACK_DEFERRED, &sk->sk_tsq_flags);
> +                       return;
> +               }
>  send_now:
>                 tcp_send_ack(sk);
>                 return;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 27140e5cdc060ddcdc8973759f68ed612a60617a..f13eb7e23d03f3681055257e6ebea0612ae3f9b3 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3263,6 +3263,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
>         net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC;
>         net->ipv4.sysctl_tcp_comp_sack_nr = 44;
> +       net->ipv4.sysctl_tcp_backlog_ack_defer = 1;
>         net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
>         net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 0;
>         atomic_set(&net->ipv4.tfo_active_disable_times, 0);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b4cac12d0e6348aaa3a3957b0091ea7fe6553731..1fc1f879cfd6c28cd655bb8f02eff6624eec2ffc 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1077,7 +1077,8 @@ static void tcp_tasklet_func(struct tasklet_struct *t)
>  #define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |          \
>                           TCPF_WRITE_TIMER_DEFERRED |   \
>                           TCPF_DELACK_TIMER_DEFERRED |  \
> -                         TCPF_MTU_REDUCED_DEFERRED)
> +                         TCPF_MTU_REDUCED_DEFERRED |   \
> +                         TCPF_ACK_DEFERRED)
>  /**
>   * tcp_release_cb - tcp release_sock() callback
>   * @sk: socket
> @@ -1114,6 +1115,8 @@ void tcp_release_cb(struct sock *sk)
>                 inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
>                 __sock_put(sk);
>         }
> +       if ((flags & TCPF_ACK_DEFERRED) && inet_csk_ack_scheduled(sk))
> +               tcp_send_ack(sk);
>  }
>  EXPORT_SYMBOL(tcp_release_cb);
>
> --
> 2.42.0.283.g2d96d420d3-goog
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-06 20:31   ` Yuchung Cheng
@ 2023-09-06 23:07     ` Stephen Hemminger
  0 siblings, 0 replies; 15+ messages in thread
From: Stephen Hemminger @ 2023-09-06 23:07 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell

On Wed, 6 Sep 2023 13:31:17 -0700
Yuchung Cheng <ycheng@google.com> wrote:

> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>  
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> 
> Looks great. Any reason any workload may want to switch this feature
> off? I can't think of one...

Never underestimate the likelihood that some TCP stack or middlebox
is broken and needs more ACK's to work.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
  2023-09-06 20:31   ` Yuchung Cheng
@ 2023-09-07 14:07   ` Neal Cardwell
  2023-09-07 14:35     ` Soheil Hassas Yeganeh
       [not found]   ` <CAA93jw7Fuov-vmxiZdW7My-AVWCOFQo4XVm9bNwAg4Td2CUNCA@mail.gmail.com>
  2023-09-07 17:09   ` Jakub Kicinski
  3 siblings, 1 reply; 15+ messages in thread
From: Neal Cardwell @ 2023-09-07 14:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	eric.dumazet, Soheil Hassas Yeganeh, Yuchung Cheng

On Wed, Sep 6, 2023 at 4:10 PM Eric Dumazet <edumazet@google.com> wrote:
>
> This idea came after a particular workload requested
> the quickack attribute set on routes, and a performance
> drop was noticed for large bulk transfers.
>
> For high throughput flows, it is best to use one cpu
> running the user thread issuing socket system calls,
> and a separate cpu to process incoming packets from BH context.
> (With TSO/GRO, bottleneck is usually the 'user' cpu)
>
> Problem is the user thread can spend a lot of time while holding
> the socket lock, forcing BH handler to queue most of incoming
> packets in the socket backlog.
>
> Whenever the user thread releases the socket lock, it must first
> process all accumulated packets in the backlog, potentially
> adding latency spikes. Due to flood mitigation, having too many
> packets in the backlog increases chance of unexpected drops.
>
> Backlog processing unfortunately shifts a fair amount of cpu cycles
> from the BH cpu to the 'user' cpu, thus reducing max throughput.
>
> This patch takes advantage of the backlog processing,
> and the fact that ACK are mostly cumulative.
>
> The idea is to detect we are in the backlog processing
> and defer all eligible ACK into a single one,
> sent from tcp_release_cb().
>
> This saves cpu cycles on both sides, and network resources.
>
> Performance of a single TCP flow on a 200Gbit NIC:
>
> - Throughput is increased by 20% (100Gbit -> 120Gbit).
> - Number of generated ACK per second shrinks from 240,000 to 40,000.
> - Number of backlog drops per second shrinks from 230 to 0.
>
> Benchmark context:
>  - Regular netperf TCP_STREAM (no zerocopy)
>  - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
>  - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
>
> This feature is guarded by a new sysctl, and enabled by default:
>  /proc/sys/net/ipv4/tcp_backlog_ack_defer
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

Yet another fantastic optimization. Thanks, Eric!

neal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-07 14:07   ` Neal Cardwell
@ 2023-09-07 14:35     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 15+ messages in thread
From: Soheil Hassas Yeganeh @ 2023-09-07 14:35 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, eric.dumazet, Yuchung Cheng

On Thu, Sep 7, 2023 at 10:08 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Wed, Sep 6, 2023 at 4:10 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > This idea came after a particular workload requested
> > the quickack attribute set on routes, and a performance
> > drop was noticed for large bulk transfers.
> >
> > For high throughput flows, it is best to use one cpu
> > running the user thread issuing socket system calls,
> > and a separate cpu to process incoming packets from BH context.
> > (With TSO/GRO, bottleneck is usually the 'user' cpu)
> >
> > Problem is the user thread can spend a lot of time while holding
> > the socket lock, forcing BH handler to queue most of incoming
> > packets in the socket backlog.
> >
> > Whenever the user thread releases the socket lock, it must first
> > process all accumulated packets in the backlog, potentially
> > adding latency spikes. Due to flood mitigation, having too many
> > packets in the backlog increases chance of unexpected drops.
> >
> > Backlog processing unfortunately shifts a fair amount of cpu cycles
> > from the BH cpu to the 'user' cpu, thus reducing max throughput.
> >
> > This patch takes advantage of the backlog processing,
> > and the fact that ACK are mostly cumulative.
> >
> > The idea is to detect we are in the backlog processing
> > and defer all eligible ACK into a single one,
> > sent from tcp_release_cb().
> >
> > This saves cpu cycles on both sides, and network resources.
> >
> > Performance of a single TCP flow on a 200Gbit NIC:
> >
> > - Throughput is increased by 20% (100Gbit -> 120Gbit).
> > - Number of generated ACK per second shrinks from 240,000 to 40,000.
> > - Number of backlog drops per second shrinks from 230 to 0.
> >
> > Benchmark context:
> >  - Regular netperf TCP_STREAM (no zerocopy)
> >  - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
> >  - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
> >
> > This feature is guarded by a new sysctl, and enabled by default:
> >  /proc/sys/net/ipv4/tcp_backlog_ack_defer
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> Acked-by: Neal Cardwell <ncardwell@google.com>
>
> Yet another fantastic optimization. Thanks, Eric!

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

This is really superb!  Thank you, Eric!


> neal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
       [not found]   ` <CAA93jw7Fuov-vmxiZdW7My-AVWCOFQo4XVm9bNwAg4Td2CUNCA@mail.gmail.com>
@ 2023-09-07 16:54     ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2023-09-07 16:54 UTC (permalink / raw)
  To: Dave Taht
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	eric.dumazet, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Thu, Sep 7, 2023 at 6:50 PM Dave Taht <dave.taht@gmail.com> wrote:

> When you once said you were going to find a way to reduce the number of acks, I did not expect this.
> Has this been tested on arm or at lower rates, such as a gigabit? Or against WiFi?
>

Not yet, but my guess is that it could benefit wifi drivers that are
not using GRO,
particularly if the TCP payload is in skb->head instead of a page fragment,
defeating tcp_try_coalesce() ability to 'pretend GRO was there in the
first place'.

I sent an RFC so that others could perform tests ;)

> Regardless:
>
> Acked-by: Dave Taht <dave.taht@gmail.com>
>
> --
> Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
> Dave Täht CSO, LibreQos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
                     ` (2 preceding siblings ...)
       [not found]   ` <CAA93jw7Fuov-vmxiZdW7My-AVWCOFQo4XVm9bNwAg4Td2CUNCA@mail.gmail.com>
@ 2023-09-07 17:09   ` Jakub Kicinski
  2023-09-07 17:16     ` Eric Dumazet
  3 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2023-09-07 17:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, netdev, eric.dumazet,
	Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Wed,  6 Sep 2023 20:10:46 +0000 Eric Dumazet wrote:
> This idea came after a particular workload requested
> the quickack attribute set on routes, and a performance
> drop was noticed for large bulk transfers.

Is it okay if I asked why quickack?
Is it related to delay-based CC?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-07 17:09   ` Jakub Kicinski
@ 2023-09-07 17:16     ` Eric Dumazet
  2023-09-07 18:00       ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2023-09-07 17:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Paolo Abeni, netdev, eric.dumazet,
	Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Thu, Sep 7, 2023 at 7:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  6 Sep 2023 20:10:46 +0000 Eric Dumazet wrote:
> > This idea came after a particular workload requested
> > the quickack attribute set on routes, and a performance
> > drop was noticed for large bulk transfers.
>
> Is it okay if I asked why quickack?
> Is it related to delay-based CC?

Note the patch is also helping the 'regular' mode, without "quickack 1" .

This is CC related in any way, but some TCP tx zerocopy workload, sending
one chunk at a time, waiting for the TCP tx zerocopy completion in
order to proceed for the next chunk,
because the 'next chunk'  is re-using the memory.

The receiver application is not sending back a message (otherwise the
'delayed ack' would be piggybacked in the reply),
and it also does not know what size of the message was expected (so no
SO_RCVLOWAT or anything could be attempted)

For this kind of workload, it is crucial the last ACK is not delayed, at all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-07 17:16     ` Eric Dumazet
@ 2023-09-07 18:00       ` Jakub Kicinski
  2023-09-07 18:05         ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2023-09-07 18:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, netdev, eric.dumazet,
	Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Thu, 7 Sep 2023 19:16:01 +0200 Eric Dumazet wrote:
> > Is it okay if I asked why quickack?
> > Is it related to delay-based CC?  
> 
> Note the patch is also helping the 'regular' mode, without "quickack 1" .
> 
> This is CC related in any way, but some TCP tx zerocopy workload, sending
> one chunk at a time, waiting for the TCP tx zerocopy completion in
> order to proceed for the next chunk,
> because the 'next chunk'  is re-using the memory.
> 
> The receiver application is not sending back a message (otherwise the
> 'delayed ack' would be piggybacked in the reply),
> and it also does not know what size of the message was expected (so no
> SO_RCVLOWAT or anything could be attempted)
> 
> For this kind of workload, it is crucial the last ACK is not delayed, at all.

Interesting. Some folks at Meta were recently looking into parsing RPCs
in the kernel to avoid unnecessary wakeups. Poor man's KCM using BPF
sockmaps. Passing message size hints from the sender would solve so
many problems..

In any case, I don't mean to question the patch :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-07 18:00       ` Jakub Kicinski
@ 2023-09-07 18:05         ` Eric Dumazet
  2023-09-07 18:10           ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2023-09-07 18:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Paolo Abeni, netdev, eric.dumazet,
	Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Thu, Sep 7, 2023 at 8:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 7 Sep 2023 19:16:01 +0200 Eric Dumazet wrote:
> > > Is it okay if I asked why quickack?
> > > Is it related to delay-based CC?
> >
> > Note the patch is also helping the 'regular' mode, without "quickack 1" .
> >
> > This is CC related in any way, but some TCP tx zerocopy workload, sending
> > one chunk at a time, waiting for the TCP tx zerocopy completion in
> > order to proceed for the next chunk,
> > because the 'next chunk'  is re-using the memory.
> >
> > The receiver application is not sending back a message (otherwise the
> > 'delayed ack' would be piggybacked in the reply),
> > and it also does not know what size of the message was expected (so no
> > SO_RCVLOWAT or anything could be attempted)
> >
> > For this kind of workload, it is crucial the last ACK is not delayed, at all.
>
> Interesting. Some folks at Meta were recently looking into parsing RPCs
> in the kernel to avoid unnecessary wakeups. Poor man's KCM using BPF
> sockmaps. Passing message size hints from the sender would solve so
> many problems..

Yes, RPC headers make things easier for sure.

(we internally have something similar named autolowat, where we parse
headers to set sk->sk_rcvlowat dynamically)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog
  2023-09-07 18:05         ` Eric Dumazet
@ 2023-09-07 18:10           ` Jakub Kicinski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-09-07 18:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, netdev, eric.dumazet,
	Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng

On Thu, 7 Sep 2023 20:05:29 +0200 Eric Dumazet wrote:
> > Interesting. Some folks at Meta were recently looking into parsing RPCs
> > in the kernel to avoid unnecessary wakeups. Poor man's KCM using BPF
> > sockmaps. Passing message size hints from the sender would solve so
> > many problems..  
> 
> Yes, RPC headers make things easier for sure.
> 
> (we internally have something similar named autolowat, where we parse
> headers to set sk->sk_rcvlowat dynamically)

Could this be turned into a BPF hook, possibly?
That's basically what I suggested the BPF people did but they like 
to rewrite everything.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-09-07 18:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-06 20:10 [RFC net-next 0/4] tcp: backlog processing optims Eric Dumazet
2023-09-06 20:10 ` [RFC net-next 1/4] tcp: no longer release socket ownership in tcp_release_cb() Eric Dumazet
2023-09-06 20:10 ` [RFC net-next 2/4] net: sock_release_ownership() cleanup Eric Dumazet
2023-09-06 20:10 ` [RFC net-next 3/4] net: call prot->release_cb() when processing backlog Eric Dumazet
2023-09-06 20:10 ` [RFC net-next 4/4] tcp: defer regular ACK while processing socket backlog Eric Dumazet
2023-09-06 20:31   ` Yuchung Cheng
2023-09-06 23:07     ` Stephen Hemminger
2023-09-07 14:07   ` Neal Cardwell
2023-09-07 14:35     ` Soheil Hassas Yeganeh
     [not found]   ` <CAA93jw7Fuov-vmxiZdW7My-AVWCOFQo4XVm9bNwAg4Td2CUNCA@mail.gmail.com>
2023-09-07 16:54     ` Eric Dumazet
2023-09-07 17:09   ` Jakub Kicinski
2023-09-07 17:16     ` Eric Dumazet
2023-09-07 18:00       ` Jakub Kicinski
2023-09-07 18:05         ` Eric Dumazet
2023-09-07 18:10           ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).