[RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
@ 2026-02-25  7:46 Leon Hwang
  2026-02-25  8:31 ` Eric Dumazet
  2026-02-26  1:43 ` Jakub Kicinski
  0 siblings, 2 replies; 13+ messages in thread
From: Leon Hwang @ 2026-02-25  7:46 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen, Leon Hwang,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	Leon Hwang, linux-doc, linux-kernel

Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
address a memory leak scenario related to TCP sockets.

Issue:
When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
current implementation does not clear the socket's receive queue. This
causes SKBs in the queue to remain allocated until the socket is
explicitly closed by the application. As a consequence:

1. The page pool pages held by these SKBs are not released.
2. The associated page pool cannot be freed.

RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
CLOSE_WAIT state, "all segment queues should be flushed." However, the
current implementation does not flush the receive queue.

Solution:
Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that,
when enabled, causes the kernel to purge the receive queue when a RST
packet is received in CLOSE_WAIT state. This allows immediate release
of SKBs and their associated memory resources.

The feature is disabled by default to maintain backward compatibility
with existing behavior.

Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
---
 Documentation/networking/ip-sysctl.rst         | 18 ++++++++++++++++++
 .../net_cachelines/netns_ipv4_sysctl.rst       |  1 +
 include/net/netns/ipv4.h                       |  1 +
 net/ipv4/sysctl_net_ipv4.c                     |  9 +++++++++
 net/ipv4/tcp_input.c                           | 16 ++++++++++++++++
 5 files changed, 45 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index d1eeb5323af0..71a529462baa 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER
 
 	Default: 120,000
 
+tcp_purge_receive_queue - BOOLEAN
+	When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the
+	default behavior is to not clear its receive queue.  As a result,
+	any SKBs in the queue are not freed until the socket is closed.
+	Consequently, the pages held by these SKBs are not released, which
+	can also prevent the associated page pool from being freed.
+
+	If enabled, the receive queue is purged upon receiving the RST,
+	allowing the SKBs and their associated memory to be released
+	promptly.
+
+	Possible values:
+
+	- 0 (disabled)
+	- 1 (enabled)
+
+	Default: 0 (disabled)
+
 UDP variables
 =============
 
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index beaf1880a19b..f2c42e7d84a9 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -123,6 +123,7 @@ unsigned_long                   sysctl_tcp_comp_sack_delay_ns
 unsigned_long                   sysctl_tcp_comp_sack_slack_ns                                                        __tcp_ack_snd_check
 int                             sysctl_max_syn_backlog
 int                             sysctl_tcp_fastopen
+u8                              sysctl_tcp_purge_receive_queue
 struct_tcp_congestion_ops       tcp_congestion_control                                                               init_cc
 struct_tcp_fastopen_context     tcp_fastopen_ctx
 unsigned_int                    sysctl_tcp_fastopen_blackhole_timeout
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8e971c7bf164..ab973f30f502 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -220,6 +220,7 @@ struct netns_ipv4 {
 	u8 sysctl_tcp_nometrics_save;
 	u8 sysctl_tcp_no_ssthresh_metrics_save;
 	u8 sysctl_tcp_workaround_signed_windows;
+	u8 sysctl_tcp_purge_receive_queue;
 	int sysctl_tcp_challenge_ack_limit;
 	u8 sysctl_tcp_min_tso_segs;
 	u8 sysctl_tcp_reflect_tos;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 643763bc2142..da30970bb5d5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ONE_THOUSAND,
 		.extra2		= &tcp_rto_max_max,
 	},
+	{
+		.procname       = "tcp_purge_receive_queue",
+		.data           = &init_net.ipv4.sysctl_tcp_purge_receive_queue,
+		.maxlen         = sizeof(u8),
+		.mode           = 0644,
+		.proc_handler   = proc_dou8vec_minmax,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
 };
 
 static __net_init int ipv4_sysctl_init_net(struct net *net)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6c3f1d031444..43f32fb5831d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error);
 /* When we get a reset we do this. */
 void tcp_reset(struct sock *sk, struct sk_buff *skb)
 {
+	const struct net *net = sock_net(sk);
 	int err;
 
 	trace_tcp_receive_reset(sk);
@@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
 		err = ECONNREFUSED;
 		break;
 	case TCP_CLOSE_WAIT:
+		/* RFC9293 3.10.7.4. Other States
+		 *   Second, check the RST bit:
+		 *     CLOSE-WAIT STATE
+		 *
+		 * If the RST bit is set, then any outstanding RECEIVEs and
+		 * SEND should receive "reset" responses.  All segment queues
+		 * should be flushed.  Users should also receive an unsolicited
+		 * general "connection reset" signal.  Enter the CLOSED state,
+		 * delete the TCB, and return.
+		 *
+		 * If net.ipv4.tcp_purge_receive_queue is enabled,
+		 * sk_receive_queue will be flushed too.
+		 */
+		if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue))
+			skb_queue_purge(&sk->sk_receive_queue);
 		err = EPIPE;
 		break;
 	case TCP_CLOSE:
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-02-25  7:46 [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl Leon Hwang
@ 2026-02-25  8:31 ` Eric Dumazet
  2026-02-25  9:48   ` Leon Hwang
  2026-02-26  1:43 ` Jakub Kicinski
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2026-02-25  8:31 UTC (permalink / raw)
  To: Leon Hwang
  Cc: netdev, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	Leon Hwang, linux-doc, linux-kernel

On Wed, Feb 25, 2026 at 8:46 AM Leon Hwang <leon.huangfu@shopee.com> wrote:
>
> Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
> address a memory leak scenario related to TCP sockets.

We use the term "memory leak" for a persistent loss of memory (until reboot)

Lets not abuse it and confuse various AI/human agents which will
declare emergency situations
caused by an inexistent fatal error.

>
> Issue:
> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> current implementation does not clear the socket's receive queue. This
> causes SKBs in the queue to remain allocated until the socket is
> explicitly closed by the application. As a consequence:
>
> 1. The page pool pages held by these SKBs are not released.

This situation also applies for normal TCP_ESTABLISHED sockets, when
applications
do not drain the receive queue.

As long the application has not called close(), kernel should not
assume the application
will _not_ read the data that was received.


> 2. The associated page pool cannot be freed.
>
> RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
> CLOSE_WAIT state, "all segment queues should be flushed." However, the
> current implementation does not flush the receive queue.

Some buggy stacks send RST anyway after FIN. I think that forcingly
purging good data
received before the RST would add many surprises.

>
> Solution:
> Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that,
> when enabled, causes the kernel to purge the receive queue when a RST
> packet is received in CLOSE_WAIT state. This allows immediate release
> of SKBs and their associated memory resources.
>
> The feature is disabled by default to maintain backward compatibility
> with existing behavior.
>
> Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
> ---
>  Documentation/networking/ip-sysctl.rst         | 18 ++++++++++++++++++
>  .../net_cachelines/netns_ipv4_sysctl.rst       |  1 +
>  include/net/netns/ipv4.h                       |  1 +
>  net/ipv4/sysctl_net_ipv4.c                     |  9 +++++++++
>  net/ipv4/tcp_input.c                           | 16 ++++++++++++++++
>  5 files changed, 45 insertions(+)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index d1eeb5323af0..71a529462baa 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER
>
>         Default: 120,000
>
> +tcp_purge_receive_queue - BOOLEAN
> +       When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the
> +       default behavior is to not clear its receive queue.  As a result,
> +       any SKBs in the queue are not freed until the socket is closed.
> +       Consequently, the pages held by these SKBs are not released, which
> +       can also prevent the associated page pool from being freed.
> +
> +       If enabled, the receive queue is purged upon receiving the RST,
> +       allowing the SKBs and their associated memory to be released
> +       promptly.
> +
> +       Possible values:
> +
> +       - 0 (disabled)
> +       - 1 (enabled)
> +
> +       Default: 0 (disabled)
> +
>  UDP variables
>  =============
>
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19b..f2c42e7d84a9 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -123,6 +123,7 @@ unsigned_long                   sysctl_tcp_comp_sack_delay_ns
>  unsigned_long                   sysctl_tcp_comp_sack_slack_ns                                                        __tcp_ack_snd_check
>  int                             sysctl_max_syn_backlog
>  int                             sysctl_tcp_fastopen
> +u8                              sysctl_tcp_purge_receive_queue
>  struct_tcp_congestion_ops       tcp_congestion_control                                                               init_cc
>  struct_tcp_fastopen_context     tcp_fastopen_ctx
>  unsigned_int                    sysctl_tcp_fastopen_blackhole_timeout
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 8e971c7bf164..ab973f30f502 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -220,6 +220,7 @@ struct netns_ipv4 {
>         u8 sysctl_tcp_nometrics_save;
>         u8 sysctl_tcp_no_ssthresh_metrics_save;
>         u8 sysctl_tcp_workaround_signed_windows;
> +       u8 sysctl_tcp_purge_receive_queue;
>         int sysctl_tcp_challenge_ack_limit;
>         u8 sysctl_tcp_min_tso_segs;
>         u8 sysctl_tcp_reflect_tos;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 643763bc2142..da30970bb5d5 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ONE_THOUSAND,
>                 .extra2         = &tcp_rto_max_max,
>         },
> +       {
> +               .procname       = "tcp_purge_receive_queue",
> +               .data           = &init_net.ipv4.sysctl_tcp_purge_receive_queue,
> +               .maxlen         = sizeof(u8),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dou8vec_minmax,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE,
> +       },
>  };
>
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 6c3f1d031444..43f32fb5831d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error);
>  /* When we get a reset we do this. */
>  void tcp_reset(struct sock *sk, struct sk_buff *skb)
>  {
> +       const struct net *net = sock_net(sk);
>         int err;
>
>         trace_tcp_receive_reset(sk);
> @@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
>                 err = ECONNREFUSED;
>                 break;
>         case TCP_CLOSE_WAIT:
> +               /* RFC9293 3.10.7.4. Other States
> +                *   Second, check the RST bit:
> +                *     CLOSE-WAIT STATE
> +                *
> +                * If the RST bit is set, then any outstanding RECEIVEs and
> +                * SEND should receive "reset" responses.  All segment queues
> +                * should be flushed.  Users should also receive an unsolicited
> +                * general "connection reset" signal.  Enter the CLOSED state,
> +                * delete the TCB, and return.
> +                *
> +                * If net.ipv4.tcp_purge_receive_queue is enabled,
> +                * sk_receive_queue will be flushed too.
> +                */
> +               if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue))
> +                       skb_queue_purge(&sk->sk_receive_queue);
>                 err = EPIPE;
>                 break;
>         case TCP_CLOSE:
> --
> 2.52.0
>

Please prepare a packetdrill test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-02-25  8:31 ` Eric Dumazet
@ 2026-02-25  9:48   ` Leon Hwang
  0 siblings, 0 replies; 13+ messages in thread
From: Leon Hwang @ 2026-02-25  9:48 UTC (permalink / raw)
  To: Eric Dumazet, Leon Hwang
  Cc: netdev, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel



On 25/2/26 16:31, Eric Dumazet wrote:
> On Wed, Feb 25, 2026 at 8:46 AM Leon Hwang <leon.huangfu@shopee.com> wrote:
>>
>> Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
>> address a memory leak scenario related to TCP sockets.
> 
> We use the term "memory leak" for a persistent loss of memory (until reboot)
> 

Thanks for the clarification.

> Lets not abuse it and confuse various AI/human agents which will
> declare emergency situations
> caused by an inexistent fatal error.
> 

I'll reword it in the next revision.

>>
>> Issue:
>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>> current implementation does not clear the socket's receive queue. This
>> causes SKBs in the queue to remain allocated until the socket is
>> explicitly closed by the application. As a consequence:
>>
>> 1. The page pool pages held by these SKBs are not released.
> 
> This situation also applies for normal TCP_ESTABLISHED sockets, when
> applications
> do not drain the receive queue.
> 
> As long the application has not called close(), kernel should not
> assume the application
> will _not_ read the data that was received.
> 

Understood.

This patch provides an option to drain the receive queue in the
CLOSE_WAIT + RST case, instead of purging it unconditionally upon
receiving a RST packet.

> 
>> 2. The associated page pool cannot be freed.
>>
>> RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
>> CLOSE_WAIT state, "all segment queues should be flushed." However, the
>> current implementation does not flush the receive queue.
> 
> Some buggy stacks send RST anyway after FIN. I think that forcingly
> purging good data
> received before the RST would add many surprises.
> 

Understood.

There is a tcp_write_queue_purge(sk) call in tcp_done_with_error(),
which means sk_write_queue is always purged when a RST packet is
received. I assume the reason for purging sk_write_queue is that any
pending transmissions become meaningless once a RST is received.

Would it be better to defer kb_queue_purge(&sk->sk_receive_queue) until
after tcp_done_with_error()?

[...]

>>
> 
> Please prepare a packetdrill test.

Ack.

I'll add a packetdrill test in the next revision.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-02-25  7:46 [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl Leon Hwang
  2026-02-25  8:31 ` Eric Dumazet
@ 2026-02-26  1:43 ` Jakub Kicinski
  2026-03-02  9:55   ` Leon Hwang
  1 sibling, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2026-02-26  1:43 UTC (permalink / raw)
  To: Leon Hwang
  Cc: netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, David Ahern, Neal Cardwell,
	Kuniyuki Iwashima, Ilpo Järvinen, Ido Schimmel,
	kerneljasonxing, lance.yang, jiayuan.chen, Leon Hwang, linux-doc,
	linux-kernel

On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> Issue:
> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> current implementation does not clear the socket's receive queue. This
> causes SKBs in the queue to remain allocated until the socket is
> explicitly closed by the application. As a consequence:
> 
> 1. The page pool pages held by these SKBs are not released.

On what kernel version and driver are you observing this?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-02-26  1:43 ` Jakub Kicinski
@ 2026-03-02  9:55   ` Leon Hwang
  2026-03-03  0:22     ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-03-02  9:55 UTC (permalink / raw)
  To: Jakub Kicinski, Leon Hwang
  Cc: netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, David Ahern, Neal Cardwell,
	Kuniyuki Iwashima, Ilpo Järvinen, Ido Schimmel,
	kerneljasonxing, lance.yang, jiayuan.chen, linux-doc,
	linux-kernel



On 26/2/26 09:43, Jakub Kicinski wrote:
> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>> Issue:
>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>> current implementation does not clear the socket's receive queue. This
>> causes SKBs in the queue to remain allocated until the socket is
>> explicitly closed by the application. As a consequence:
>>
>> 1. The page pool pages held by these SKBs are not released.
> 
> On what kernel version and driver are you observing this?

# uname -r
6.19.0-061900-generic

# ethtool -i eth0
driver: mlx5_core
version: 6.19.0-061900-generic
firmware-version: 26.43.2566 (MT_0000000531)

In addition, the Python scripts below reproduce that SKBs remain in the
receive queue.

Thanks,
Leon

---

server.py:

import socket
import time

HOST, PORT = "127.0.0.1", 9999

s = socket.socket()
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 8 * 1024)

s.bind((HOST, PORT))
s.listen(1)

conn, addr = s.accept()
print("accepted", addr)

time.sleep(1)

print("Read 1st:", conn.recv(1))

try:
    conn.send(b"A")
    print("sent 1 byte to client")
except Exception as e:
    print("send failed:", e)

time.sleep(1)

conn.settimeout(0.2)
try:
    b = conn.recv(1)
    print("recv(1) after RST:", b, "len=", len(b))
except Exception as e:
    print("recv(1) after RST raised:", repr(e))

print("Conn remains opening..")

try:
    print("Press Ctrl+C to stop...")
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    print("\nProgram interrupted by user. Exiting.")

conn.close()
s.close()


client.py:

import socket
import time

HOST, PORT = "127.0.0.1", 9999

c = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c.connect((HOST, PORT))

payload = b"x" * (4 * 1024)  # 4KiB
c.sendall(payload)
time.sleep(0.1)
c.close()

time.sleep(3)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-02  9:55   ` Leon Hwang
@ 2026-03-03  0:22     ` Jakub Kicinski
  2026-03-03  2:12       ` Leon Hwang
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2026-03-03  0:22 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Leon Hwang, netdev, David S . Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel

On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> On 26/2/26 09:43, Jakub Kicinski wrote:
> > On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:  
> >> Issue:
> >> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >> current implementation does not clear the socket's receive queue. This
> >> causes SKBs in the queue to remain allocated until the socket is
> >> explicitly closed by the application. As a consequence:
> >>
> >> 1. The page pool pages held by these SKBs are not released.  
> > 
> > On what kernel version and driver are you observing this?  
> 
> # uname -r
> 6.19.0-061900-generic
> 
> # ethtool -i eth0
> driver: mlx5_core
> version: 6.19.0-061900-generic
> firmware-version: 26.43.2566 (MT_0000000531)

Okay... this kernel + driver should just patiently wait for the page
pool to go away. 

What is the actual, end user problem that you're trying to solve?
A few kB of data waiting to be freed is not a huge problem..

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  0:22     ` Jakub Kicinski
@ 2026-03-03  2:12       ` Leon Hwang
  2026-03-03  3:55         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-03-03  2:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Leon Hwang, netdev, David S . Miller, Eric Dumazet, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel



On 3/3/26 08:22, Jakub Kicinski wrote:
> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:  
>>>> Issue:
>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>> current implementation does not clear the socket's receive queue. This
>>>> causes SKBs in the queue to remain allocated until the socket is
>>>> explicitly closed by the application. As a consequence:
>>>>
>>>> 1. The page pool pages held by these SKBs are not released.  
>>>
>>> On what kernel version and driver are you observing this?  
>>
>> # uname -r
>> 6.19.0-061900-generic
>>
>> # ethtool -i eth0
>> driver: mlx5_core
>> version: 6.19.0-061900-generic
>> firmware-version: 26.43.2566 (MT_0000000531)
> 
> Okay... this kernel + driver should just patiently wait for the page
> pool to go away. 
> 
> What is the actual, end user problem that you're trying to solve?
> A few kB of data waiting to be freed is not a huge problem..

Yes, it is not a huge problem.

The actual end-user issue was discussed in
"page_pool: Add page_pool_release_stalled tracepoint" [1].

I think it would be useful to provide a way for SREs to purge the
receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
released at the same time.

Links:
[1]
https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  2:12       ` Leon Hwang
@ 2026-03-03  3:55         ` Eric Dumazet
  2026-03-03  6:26           ` Leon Hwang
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2026-03-03  3:55 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel

On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 08:22, Jakub Kicinski wrote:
> > On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>> Issue:
> >>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>> current implementation does not clear the socket's receive queue. This
> >>>> causes SKBs in the queue to remain allocated until the socket is
> >>>> explicitly closed by the application. As a consequence:
> >>>>
> >>>> 1. The page pool pages held by these SKBs are not released.
> >>>
> >>> On what kernel version and driver are you observing this?
> >>
> >> # uname -r
> >> 6.19.0-061900-generic
> >>
> >> # ethtool -i eth0
> >> driver: mlx5_core
> >> version: 6.19.0-061900-generic
> >> firmware-version: 26.43.2566 (MT_0000000531)
> >
> > Okay... this kernel + driver should just patiently wait for the page
> > pool to go away.
> >
> > What is the actual, end user problem that you're trying to solve?
> > A few kB of data waiting to be freed is not a huge problem..
>
> Yes, it is not a huge problem.
>
> The actual end-user issue was discussed in
> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>
> I think it would be useful to provide a way for SREs to purge the
> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> released at the same time.
>
> Links:
> [1]
> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/

Perhaps SRE could use this in an emergency?

ss -t -a state close-wait -K

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  3:55         ` Eric Dumazet
@ 2026-03-03  6:26           ` Leon Hwang
  2026-03-03  7:55             ` Leon Hwang
  0 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-03-03  6:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel



On 3/3/26 11:55, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>> Issue:
>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>> explicitly closed by the application. As a consequence:
>>>>>>
>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>
>>>>> On what kernel version and driver are you observing this?
>>>>
>>>> # uname -r
>>>> 6.19.0-061900-generic
>>>>
>>>> # ethtool -i eth0
>>>> driver: mlx5_core
>>>> version: 6.19.0-061900-generic
>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>
>>> Okay... this kernel + driver should just patiently wait for the page
>>> pool to go away.
>>>
>>> What is the actual, end user problem that you're trying to solve?
>>> A few kB of data waiting to be freed is not a huge problem..
>>
>> Yes, it is not a huge problem.
>>
>> The actual end-user issue was discussed in
>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>
>> I think it would be useful to provide a way for SREs to purge the
>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>> released at the same time.
>>
>> Links:
>> [1]
>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> 
> Perhaps SRE could use this in an emergency?
> 
> ss -t -a state close-wait -K

This ss command is acceptable in an emergency.

A sysctl option would be better for persistent SRE operations.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  6:26           ` Leon Hwang
@ 2026-03-03  7:55             ` Leon Hwang
  2026-03-03  8:17               ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-03-03  7:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel



On 3/3/26 14:26, Leon Hwang wrote:
> 
> 
> On 3/3/26 11:55, Eric Dumazet wrote:
>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>
>>>
>>>
>>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>>> Issue:
>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>>> explicitly closed by the application. As a consequence:
>>>>>>>
>>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>>
>>>>>> On what kernel version and driver are you observing this?
>>>>>
>>>>> # uname -r
>>>>> 6.19.0-061900-generic
>>>>>
>>>>> # ethtool -i eth0
>>>>> driver: mlx5_core
>>>>> version: 6.19.0-061900-generic
>>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>>
>>>> Okay... this kernel + driver should just patiently wait for the page
>>>> pool to go away.
>>>>
>>>> What is the actual, end user problem that you're trying to solve?
>>>> A few kB of data waiting to be freed is not a huge problem..
>>>
>>> Yes, it is not a huge problem.
>>>
>>> The actual end-user issue was discussed in
>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>>
>>> I think it would be useful to provide a way for SREs to purge the
>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>>> released at the same time.
>>>
>>> Links:
>>> [1]
>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
>>
>> Perhaps SRE could use this in an emergency?
>>
>> ss -t -a state close-wait -K
> 
> This ss command is acceptable in an emergency.
> 

However, once a CLOSE_WAIT TCP socket receives an RST packet, it
transitions to the CLOSE state. A socket in the CLOSE state cannot be
killed using the ss approach.

The SKBs remain in the receive queue of the CLOSE socket until it is
closed by the user-space application.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  7:55             ` Leon Hwang
@ 2026-03-03  8:17               ` Eric Dumazet
  2026-03-03  8:54                 ` Leon Hwang
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2026-03-03  8:17 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel

On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 14:26, Leon Hwang wrote:
> >
> >
> > On 3/3/26 11:55, Eric Dumazet wrote:
> >> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>>
> >>>
> >>>
> >>> On 3/3/26 08:22, Jakub Kicinski wrote:
> >>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>>>>> Issue:
> >>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>>>>> current implementation does not clear the socket's receive queue. This
> >>>>>>> causes SKBs in the queue to remain allocated until the socket is
> >>>>>>> explicitly closed by the application. As a consequence:
> >>>>>>>
> >>>>>>> 1. The page pool pages held by these SKBs are not released.
> >>>>>>
> >>>>>> On what kernel version and driver are you observing this?
> >>>>>
> >>>>> # uname -r
> >>>>> 6.19.0-061900-generic
> >>>>>
> >>>>> # ethtool -i eth0
> >>>>> driver: mlx5_core
> >>>>> version: 6.19.0-061900-generic
> >>>>> firmware-version: 26.43.2566 (MT_0000000531)
> >>>>
> >>>> Okay... this kernel + driver should just patiently wait for the page
> >>>> pool to go away.
> >>>>
> >>>> What is the actual, end user problem that you're trying to solve?
> >>>> A few kB of data waiting to be freed is not a huge problem..
> >>>
> >>> Yes, it is not a huge problem.
> >>>
> >>> The actual end-user issue was discussed in
> >>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
> >>>
> >>> I think it would be useful to provide a way for SREs to purge the
> >>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> >>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> >>> released at the same time.
> >>>
> >>> Links:
> >>> [1]
> >>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> >>
> >> Perhaps SRE could use this in an emergency?
> >>
> >> ss -t -a state close-wait -K
> >
> > This ss command is acceptable in an emergency.
> >
>
> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
> transitions to the CLOSE state. A socket in the CLOSE state cannot be
> killed using the ss approach.
>
> The SKBs remain in the receive queue of the CLOSE socket until it is
> closed by the user-space application.

Why user-space application does not drain the receive queue ?

Is there a missing EPOLLIN or something ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  8:17               ` Eric Dumazet
@ 2026-03-03  8:54                 ` Leon Hwang
  2026-03-03  8:56                   ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-03-03  8:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel



On 3/3/26 16:17, Eric Dumazet wrote:
> On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 3/3/26 14:26, Leon Hwang wrote:
>>>
>>>
>>> On 3/3/26 11:55, Eric Dumazet wrote:
>>>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 3/3/26 08:22, Jakub Kicinski wrote:
>>>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
>>>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
>>>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
>>>>>>>>> Issue:
>>>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
>>>>>>>>> current implementation does not clear the socket's receive queue. This
>>>>>>>>> causes SKBs in the queue to remain allocated until the socket is
>>>>>>>>> explicitly closed by the application. As a consequence:
>>>>>>>>>
>>>>>>>>> 1. The page pool pages held by these SKBs are not released.
>>>>>>>>
>>>>>>>> On what kernel version and driver are you observing this?
>>>>>>>
>>>>>>> # uname -r
>>>>>>> 6.19.0-061900-generic
>>>>>>>
>>>>>>> # ethtool -i eth0
>>>>>>> driver: mlx5_core
>>>>>>> version: 6.19.0-061900-generic
>>>>>>> firmware-version: 26.43.2566 (MT_0000000531)
>>>>>>
>>>>>> Okay... this kernel + driver should just patiently wait for the page
>>>>>> pool to go away.
>>>>>>
>>>>>> What is the actual, end user problem that you're trying to solve?
>>>>>> A few kB of data waiting to be freed is not a huge problem..
>>>>>
>>>>> Yes, it is not a huge problem.
>>>>>
>>>>> The actual end-user issue was discussed in
>>>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
>>>>>
>>>>> I think it would be useful to provide a way for SREs to purge the
>>>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
>>>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
>>>>> released at the same time.
>>>>>
>>>>> Links:
>>>>> [1]
>>>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
>>>>
>>>> Perhaps SRE could use this in an emergency?
>>>>
>>>> ss -t -a state close-wait -K
>>>
>>> This ss command is acceptable in an emergency.
>>>
>>
>> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
>> transitions to the CLOSE state. A socket in the CLOSE state cannot be
>> killed using the ss approach.
>>
>> The SKBs remain in the receive queue of the CLOSE socket until it is
>> closed by the user-space application.
> 
> Why user-space application does not drain the receive queue ?
> 
> Is there a missing EPOLLIN or something ?

The user-space application uses a TCP connection pool. It establishes
several TCP connections at startup and keeps them in the pool.

However, the application does not always drain their receive queues.
Instead, it selects one connection from the pool using a hash algorithm
for communication with the TCP server. When it attempts to write data
through a socket in the CLOSE state, it receives -EPIPE and then closes
it. As a result, TCP connections whose underlying socket state is CLOSE
may retain an SKB in their receive queues if they are not selected for
communication.

I proposed a solution to address this issue: close the TCP connection if
the underlying sk_err is non-zero.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl
  2026-03-03  8:54                 ` Leon Hwang
@ 2026-03-03  8:56                   ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2026-03-03  8:56 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Jakub Kicinski, Leon Hwang, netdev, David S . Miller, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, David Ahern,
	Neal Cardwell, Kuniyuki Iwashima, Ilpo Järvinen,
	Ido Schimmel, kerneljasonxing, lance.yang, jiayuan.chen,
	linux-doc, linux-kernel

On Tue, Mar 3, 2026 at 9:54 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 3/3/26 16:17, Eric Dumazet wrote:
> > On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >>
> >>
> >> On 3/3/26 14:26, Leon Hwang wrote:
> >>>
> >>>
> >>> On 3/3/26 11:55, Eric Dumazet wrote:
> >>>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/3/26 08:22, Jakub Kicinski wrote:
> >>>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >>>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>>>>>>> Issue:
> >>>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>>>>>>> current implementation does not clear the socket's receive queue. This
> >>>>>>>>> causes SKBs in the queue to remain allocated until the socket is
> >>>>>>>>> explicitly closed by the application. As a consequence:
> >>>>>>>>>
> >>>>>>>>> 1. The page pool pages held by these SKBs are not released.
> >>>>>>>>
> >>>>>>>> On what kernel version and driver are you observing this?
> >>>>>>>
> >>>>>>> # uname -r
> >>>>>>> 6.19.0-061900-generic
> >>>>>>>
> >>>>>>> # ethtool -i eth0
> >>>>>>> driver: mlx5_core
> >>>>>>> version: 6.19.0-061900-generic
> >>>>>>> firmware-version: 26.43.2566 (MT_0000000531)
> >>>>>>
> >>>>>> Okay... this kernel + driver should just patiently wait for the page
> >>>>>> pool to go away.
> >>>>>>
> >>>>>> What is the actual, end user problem that you're trying to solve?
> >>>>>> A few kB of data waiting to be freed is not a huge problem..
> >>>>>
> >>>>> Yes, it is not a huge problem.
> >>>>>
> >>>>> The actual end-user issue was discussed in
> >>>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
> >>>>>
> >>>>> I think it would be useful to provide a way for SREs to purge the
> >>>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> >>>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> >>>>> released at the same time.
> >>>>>
> >>>>> Links:
> >>>>> [1]
> >>>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@linux.dev/
> >>>>
> >>>> Perhaps SRE could use this in an emergency?
> >>>>
> >>>> ss -t -a state close-wait -K
> >>>
> >>> This ss command is acceptable in an emergency.
> >>>
> >>
> >> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
> >> transitions to the CLOSE state. A socket in the CLOSE state cannot be
> >> killed using the ss approach.
> >>
> >> The SKBs remain in the receive queue of the CLOSE socket until it is
> >> closed by the user-space application.
> >
> > Why user-space application does not drain the receive queue ?
> >
> > Is there a missing EPOLLIN or something ?
>
> The user-space application uses a TCP connection pool. It establishes
> several TCP connections at startup and keeps them in the pool.
>
> However, the application does not always drain their receive queues.
> Instead, it selects one connection from the pool using a hash algorithm
> for communication with the TCP server. When it attempts to write data
> through a socket in the CLOSE state, it receives -EPIPE and then closes
> it. As a result, TCP connections whose underlying socket state is CLOSE
> may retain an SKB in their receive queues if they are not selected for
> communication.
>
> I proposed a solution to address this issue: close the TCP connection if
> the underlying sk_err is non-zero.

Okay, makes sense to fix the root cause. Applications can be fixed in
a matter of hours,
while kernels can stick to hosts for years.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-03  8:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25  7:46 [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl Leon Hwang
2026-02-25  8:31 ` Eric Dumazet
2026-02-25  9:48   ` Leon Hwang
2026-02-26  1:43 ` Jakub Kicinski
2026-03-02  9:55   ` Leon Hwang
2026-03-03  0:22     ` Jakub Kicinski
2026-03-03  2:12       ` Leon Hwang
2026-03-03  3:55         ` Eric Dumazet
2026-03-03  6:26           ` Leon Hwang
2026-03-03  7:55             ` Leon Hwang
2026-03-03  8:17               ` Eric Dumazet
2026-03-03  8:54                 ` Leon Hwang
2026-03-03  8:56                   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox