* [PATCH] sunrpc: hold a ref on netns for tcp sockets
@ 2024-03-19 19:49 Josef Bacik
2024-03-19 19:53 ` Chuck Lever III
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Josef Bacik @ 2024-03-19 19:49 UTC (permalink / raw)
To: chuck.lever, jlayton, linux-nfs, edumazet, kuba, kernel-team
We've been seeing variations of the following panic in production
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:ip6_pol_route+0x59/0x7a0
Call Trace:
<IRQ>
? __die+0x78/0xc0
? page_fault_oops+0x286/0x380
? fib6_table_lookup+0x95/0xf40
? exc_page_fault+0x5d/0x110
? asm_exc_page_fault+0x22/0x30
? ip6_pol_route+0x59/0x7a0
? unlink_anon_vmas+0x370/0x370
fib6_rule_lookup+0x56/0x1b0
? update_blocked_averages+0x2c6/0x6a0
ip6_route_output_flags+0xd2/0x130
ip6_dst_lookup_tail+0x3b/0x220
ip6_dst_lookup_flow+0x2c/0x80
inet6_sk_rebuild_header+0x14c/0x1e0
? tcp_release_cb+0x150/0x150
__tcp_retransmit_skb+0x68/0x6b0
? tcp_current_mss+0xca/0x150
? tcp_release_cb+0x150/0x150
tcp_send_loss_probe+0x8e/0x220
tcp_write_timer+0xbe/0x2d0
run_timer_softirq+0x272/0x840
? hrtimer_interrupt+0x2c9/0x5f0
? sched_clock_cpu+0xc/0x170
irq_exit_rcu+0x171/0x330
sysvec_apic_timer_interrupt+0x6d/0x80
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x16/0x20
RIP: 0010:cpuidle_enter_state+0xe7/0x243
Inspecting the vmcore with drgn you can see why this is a NULL pointer deref
>>> prog.crashed_thread().stack_trace()[0]
#0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40
2212 if (net->ipv6.devconf_all->forwarding == 0)
2213 strict |= RT6_LOOKUP_F_REACHABLE;
>>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all
(struct ipv6_devconf *)0x0
Looking at the socket you can see that it's been closed
>>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags'))
'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE'
>>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False)
'TCPF_FIN_WAIT1'
This occurs in our container setup where we have an NFS mount that
belongs to the containers network namespace. On container shutdown our
netns goes away, which sets net->ipv6.defconf_all = NULL, and then we
panic. In the kernel we're responsible for destroying our sockets when
the network namespace exits, or holding a reference on the network
namespace for our sockets so this doesn't happen.
Even once we shutdown the socket we can still have TCP timers that fire
in the background, hence this panic. SUNRPC shuts down the socket and
throws away all knowledge of it, but it's still doing things in the
background.
Fix this by grabbing a reference on the network namespace for any tcp
sockets we open. With this patch I'm able to cycle my 500 node stress
tier over and over again without panicing, whereas previously I was
losing 10-20 nodes every shutdown cycle.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bb81050c870e..f02387751a94 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
if (!transport->inet) {
struct sock *sk = sock->sk;
+ struct net *net = sock_net(sk);
/* Avoid temporary address, they are bad for long-lived
* connections such as NFS mounts.
@@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
tcp_sock_set_nodelay(sk);
lock_sock(sk);
+ /*
+ * Because timers can fire after the fact we need to hold a
+ * reference on the netns for this socket.
+ */
+ if (!sk->sk_net_refcnt) {
+ if (!maybe_get_net(net)) {
+ release_sock(sk);
+ return -ENOTCONN;
+ }
+ /*
+ * For kernel sockets we have a tracker put in place for
+ * the tracing, we need to free this to maintaine
+ * consistent tracking info.
+ */
+ __netns_tracker_free(net, &sk->ns_tracker, false);
+ sk->sk_net_refcnt = 1;
+ netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL);
+ sock_inuse_add(net, 1);
+ }
xs_save_old_callbacks(transport, sk);
sk->sk_user_data = xprt;
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] sunrpc: hold a ref on netns for tcp sockets 2024-03-19 19:49 [PATCH] sunrpc: hold a ref on netns for tcp sockets Josef Bacik @ 2024-03-19 19:53 ` Chuck Lever III 2024-03-19 20:07 ` [PATCH][RESEND] " Josef Bacik 2024-03-19 21:59 ` Trond Myklebust 2 siblings, 0 replies; 10+ messages in thread From: Chuck Lever III @ 2024-03-19 19:53 UTC (permalink / raw) To: Josef Bacik Cc: Jeff Layton, Linux NFS Mailing List, Eric Dumazet, Jakub Kicinski, kernel-team@fb.com > On Mar 19, 2024, at 3:49 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > We've been seeing variations of the following panic in production > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > RIP: 0010:ip6_pol_route+0x59/0x7a0 > Call Trace: > <IRQ> > ? __die+0x78/0xc0 > ? page_fault_oops+0x286/0x380 > ? fib6_table_lookup+0x95/0xf40 > ? exc_page_fault+0x5d/0x110 > ? asm_exc_page_fault+0x22/0x30 > ? ip6_pol_route+0x59/0x7a0 > ? unlink_anon_vmas+0x370/0x370 > fib6_rule_lookup+0x56/0x1b0 > ? update_blocked_averages+0x2c6/0x6a0 > ip6_route_output_flags+0xd2/0x130 > ip6_dst_lookup_tail+0x3b/0x220 > ip6_dst_lookup_flow+0x2c/0x80 > inet6_sk_rebuild_header+0x14c/0x1e0 > ? tcp_release_cb+0x150/0x150 > __tcp_retransmit_skb+0x68/0x6b0 > ? tcp_current_mss+0xca/0x150 > ? tcp_release_cb+0x150/0x150 > tcp_send_loss_probe+0x8e/0x220 > tcp_write_timer+0xbe/0x2d0 > run_timer_softirq+0x272/0x840 > ? hrtimer_interrupt+0x2c9/0x5f0 > ? sched_clock_cpu+0xc/0x170 > irq_exit_rcu+0x171/0x330 > sysvec_apic_timer_interrupt+0x6d/0x80 > </IRQ> > <TASK> > asm_sysvec_apic_timer_interrupt+0x16/0x20 > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > Inspecting the vmcore with drgn you can see why this is a NULL pointer deref > >>>> prog.crashed_thread().stack_trace()[0] > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40 > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > >>>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > (struct ipv6_devconf *)0x0 > > Looking at the socket you can see that it's been closed > >>>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags')) > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' >>>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > 'TCPF_FIN_WAIT1' > > This occurs in our container setup where we have an NFS mount that > belongs to the containers network namespace. On container shutdown our > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > panic. In the kernel we're responsible for destroying our sockets when > the network namespace exits, or holding a reference on the network > namespace for our sockets so this doesn't happen. > > Even once we shutdown the socket we can still have TCP timers that fire > in the background, hence this panic. SUNRPC shuts down the socket and > throws away all knowledge of it, but it's still doing things in the > background. > > Fix this by grabbing a reference on the network namespace for any tcp > sockets we open. With this patch I'm able to cycle my 500 node stress > tier over and over again without panicing, whereas previously I was > losing 10-20 nodes every shutdown cycle. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > index bb81050c870e..f02387751a94 100644 > --- a/net/sunrpc/xprtsock.c > +++ b/net/sunrpc/xprtsock.c > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) > > if (!transport->inet) { > struct sock *sk = sock->sk; > + struct net *net = sock_net(sk); > > /* Avoid temporary address, they are bad for long-lived > * connections such as NFS mounts. > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) > tcp_sock_set_nodelay(sk); > > lock_sock(sk); > + /* > + * Because timers can fire after the fact we need to hold a > + * reference on the netns for this socket. > + */ > + if (!sk->sk_net_refcnt) { > + if (!maybe_get_net(net)) { > + release_sock(sk); > + return -ENOTCONN; > + } > + /* > + * For kernel sockets we have a tracker put in place for > + * the tracing, we need to free this to maintaine > + * consistent tracking info. > + */ > + __netns_tracker_free(net, &sk->ns_tracker, false); > > + sk->sk_net_refcnt = 1; > + netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL); > + sock_inuse_add(net, 1); > + } > xs_save_old_callbacks(transport, sk); > > sk->sk_user_data = xprt; > -- > 2.43.0 > Hi Josef- This is an RPC client side fix. Can you resend to Trond and Anna? -- Chuck Lever ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-19 19:49 [PATCH] sunrpc: hold a ref on netns for tcp sockets Josef Bacik 2024-03-19 19:53 ` Chuck Lever III @ 2024-03-19 20:07 ` Josef Bacik 2024-03-19 21:59 ` Trond Myklebust 2 siblings, 0 replies; 10+ messages in thread From: Josef Bacik @ 2024-03-19 20:07 UTC (permalink / raw) To: trond.myklebust, anna, linux-nfs, edumazet, kuba, kernel-team We've been seeing variations of the following panic in production BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:ip6_pol_route+0x59/0x7a0 Call Trace: <IRQ> ? __die+0x78/0xc0 ? page_fault_oops+0x286/0x380 ? fib6_table_lookup+0x95/0xf40 ? exc_page_fault+0x5d/0x110 ? asm_exc_page_fault+0x22/0x30 ? ip6_pol_route+0x59/0x7a0 ? unlink_anon_vmas+0x370/0x370 fib6_rule_lookup+0x56/0x1b0 ? update_blocked_averages+0x2c6/0x6a0 ip6_route_output_flags+0xd2/0x130 ip6_dst_lookup_tail+0x3b/0x220 ip6_dst_lookup_flow+0x2c/0x80 inet6_sk_rebuild_header+0x14c/0x1e0 ? tcp_release_cb+0x150/0x150 __tcp_retransmit_skb+0x68/0x6b0 ? tcp_current_mss+0xca/0x150 ? tcp_release_cb+0x150/0x150 tcp_send_loss_probe+0x8e/0x220 tcp_write_timer+0xbe/0x2d0 run_timer_softirq+0x272/0x840 ? hrtimer_interrupt+0x2c9/0x5f0 ? sched_clock_cpu+0xc/0x170 irq_exit_rcu+0x171/0x330 sysvec_apic_timer_interrupt+0x6d/0x80 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x16/0x20 RIP: 0010:cpuidle_enter_state+0xe7/0x243 Inspecting the vmcore with drgn you can see why this is a NULL pointer deref >>> prog.crashed_thread().stack_trace()[0] #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40 2212 if (net->ipv6.devconf_all->forwarding == 0) 2213 strict |= RT6_LOOKUP_F_REACHABLE; >>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all (struct ipv6_devconf *)0x0 Looking at the socket you can see that it's been closed >>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags')) 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' >>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) 'TCPF_FIN_WAIT1' This occurs in our container setup where we have an NFS mount that belongs to the containers network namespace. On container shutdown our netns goes away, which sets net->ipv6.defconf_all = NULL, and then we panic. In the kernel we're responsible for destroying our sockets when the network namespace exits, or holding a reference on the network namespace for our sockets so this doesn't happen. Even once we shutdown the socket we can still have TCP timers that fire in the background, hence this panic. SUNRPC shuts down the socket and throws away all knowledge of it, but it's still doing things in the background. Fix this by grabbing a reference on the network namespace for any tcp sockets we open. With this patch I'm able to cycle my 500 node stress tier over and over again without panicing, whereas previously I was losing 10-20 nodes every shutdown cycle. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't realize there was a division of the client and server side of SUNRPC. net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index bb81050c870e..f02387751a94 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) if (!transport->inet) { struct sock *sk = sock->sk; + struct net *net = sock_net(sk); /* Avoid temporary address, they are bad for long-lived * connections such as NFS mounts. @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) tcp_sock_set_nodelay(sk); lock_sock(sk); + /* + * Because timers can fire after the fact we need to hold a + * reference on the netns for this socket. + */ + if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return -ENOTCONN; + } + /* + * For kernel sockets we have a tracker put in place for + * the tracing, we need to free this to maintaine + * consistent tracking info. + */ + __netns_tracker_free(net, &sk->ns_tracker, false); + sk->sk_net_refcnt = 1; + netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + } xs_save_old_callbacks(transport, sk); sk->sk_user_data = xprt; -- 2.43.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-19 19:49 [PATCH] sunrpc: hold a ref on netns for tcp sockets Josef Bacik 2024-03-19 19:53 ` Chuck Lever III 2024-03-19 20:07 ` [PATCH][RESEND] " Josef Bacik @ 2024-03-19 21:59 ` Trond Myklebust 2024-03-20 14:10 ` Josef Bacik 2 siblings, 1 reply; 10+ messages in thread From: Trond Myklebust @ 2024-03-19 21:59 UTC (permalink / raw) To: anna@kernel.org, kernel-team@fb.com, josef@toxicpanda.com, linux-nfs@vger.kernel.org, edumazet@google.com, kuba@kernel.org On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > We've been seeing variations of the following panic in production > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > RIP: 0010:ip6_pol_route+0x59/0x7a0 > Call Trace: > <IRQ> > ? __die+0x78/0xc0 > ? page_fault_oops+0x286/0x380 > ? fib6_table_lookup+0x95/0xf40 > ? exc_page_fault+0x5d/0x110 > ? asm_exc_page_fault+0x22/0x30 > ? ip6_pol_route+0x59/0x7a0 > ? unlink_anon_vmas+0x370/0x370 > fib6_rule_lookup+0x56/0x1b0 > ? update_blocked_averages+0x2c6/0x6a0 > ip6_route_output_flags+0xd2/0x130 > ip6_dst_lookup_tail+0x3b/0x220 > ip6_dst_lookup_flow+0x2c/0x80 > inet6_sk_rebuild_header+0x14c/0x1e0 > ? tcp_release_cb+0x150/0x150 > __tcp_retransmit_skb+0x68/0x6b0 > ? tcp_current_mss+0xca/0x150 > ? tcp_release_cb+0x150/0x150 > tcp_send_loss_probe+0x8e/0x220 > tcp_write_timer+0xbe/0x2d0 > run_timer_softirq+0x272/0x840 > ? hrtimer_interrupt+0x2c9/0x5f0 > ? sched_clock_cpu+0xc/0x170 > irq_exit_rcu+0x171/0x330 > sysvec_apic_timer_interrupt+0x6d/0x80 > </IRQ> > <TASK> > asm_sysvec_apic_timer_interrupt+0x16/0x20 > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > Inspecting the vmcore with drgn you can see why this is a NULL > pointer deref > > >>> prog.crashed_thread().stack_trace()[0] > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > ip6_pol_route at net/ipv6/route.c:2212:40 > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > >>> > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > (struct ipv6_devconf *)0x0 > > Looking at the socket you can see that it's been closed > > >>> > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > __sk_common.skc_flags, prog.type('enum sock_flags')) > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > >>> decode_enum_type_flags(1 << > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > 'TCPF_FIN_WAIT1' > > This occurs in our container setup where we have an NFS mount that > belongs to the containers network namespace. On container shutdown > our > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > panic. In the kernel we're responsible for destroying our sockets > when > the network namespace exits, or holding a reference on the network > namespace for our sockets so this doesn't happen. > > Even once we shutdown the socket we can still have TCP timers that > fire > in the background, hence this panic. SUNRPC shuts down the socket > and > throws away all knowledge of it, but it's still doing things in the > background. > > Fix this by grabbing a reference on the network namespace for any tcp > sockets we open. With this patch I'm able to cycle my 500 node > stress > tier over and over again without panicing, whereas previously I was > losing 10-20 nodes every shutdown cycle. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > realize there was > a division of the client and server side of SUNRPC. > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > index bb81050c870e..f02387751a94 100644 > --- a/net/sunrpc/xprtsock.c > +++ b/net/sunrpc/xprtsock.c > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > rpc_xprt *xprt, struct socket *sock) > > if (!transport->inet) { > struct sock *sk = sock->sk; > + struct net *net = sock_net(sk); > > /* Avoid temporary address, they are bad for long- > lived > * connections such as NFS mounts. > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > rpc_xprt *xprt, struct socket *sock) > tcp_sock_set_nodelay(sk); > > lock_sock(sk); > + /* > + * Because timers can fire after the fact we need to > hold a > + * reference on the netns for this socket. > + */ > + if (!sk->sk_net_refcnt) { > + if (!maybe_get_net(net)) { > + release_sock(sk); > + return -ENOTCONN; > + } > + /* > + * For kernel sockets we have a tracker put > in place for > + * the tracing, we need to free this to > maintaine > + * consistent tracking info. > + */ > + __netns_tracker_free(net, &sk->ns_tracker, > false); > > + sk->sk_net_refcnt = 1; > + netns_tracker_alloc(net, &sk->ns_tracker, > GFP_KERNEL); > + sock_inuse_add(net, 1); > + } > xs_save_old_callbacks(transport, sk); > > sk->sk_user_data = xprt; Hmm... Doesn't this end up being more or less equivalent to calling __sock_create() with the kernel flag being set to 0? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-19 21:59 ` Trond Myklebust @ 2024-03-20 14:10 ` Josef Bacik 2024-03-20 14:28 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Josef Bacik @ 2024-03-20 14:10 UTC (permalink / raw) To: Trond Myklebust Cc: anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, edumazet@google.com, kuba@kernel.org On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > We've been seeing variations of the following panic in production > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > Call Trace: > > <IRQ> > > ? __die+0x78/0xc0 > > ? page_fault_oops+0x286/0x380 > > ? fib6_table_lookup+0x95/0xf40 > > ? exc_page_fault+0x5d/0x110 > > ? asm_exc_page_fault+0x22/0x30 > > ? ip6_pol_route+0x59/0x7a0 > > ? unlink_anon_vmas+0x370/0x370 > > fib6_rule_lookup+0x56/0x1b0 > > ? update_blocked_averages+0x2c6/0x6a0 > > ip6_route_output_flags+0xd2/0x130 > > ip6_dst_lookup_tail+0x3b/0x220 > > ip6_dst_lookup_flow+0x2c/0x80 > > inet6_sk_rebuild_header+0x14c/0x1e0 > > ? tcp_release_cb+0x150/0x150 > > __tcp_retransmit_skb+0x68/0x6b0 > > ? tcp_current_mss+0xca/0x150 > > ? tcp_release_cb+0x150/0x150 > > tcp_send_loss_probe+0x8e/0x220 > > tcp_write_timer+0xbe/0x2d0 > > run_timer_softirq+0x272/0x840 > > ? hrtimer_interrupt+0x2c9/0x5f0 > > ? sched_clock_cpu+0xc/0x170 > > irq_exit_rcu+0x171/0x330 > > sysvec_apic_timer_interrupt+0x6d/0x80 > > </IRQ> > > <TASK> > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > pointer deref > > > > >>> prog.crashed_thread().stack_trace()[0] > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > >>> > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > (struct ipv6_devconf *)0x0 > > > > Looking at the socket you can see that it's been closed > > > > >>> > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > >>> decode_enum_type_flags(1 << > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > 'TCPF_FIN_WAIT1' > > > > This occurs in our container setup where we have an NFS mount that > > belongs to the containers network namespace. On container shutdown > > our > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > panic. In the kernel we're responsible for destroying our sockets > > when > > the network namespace exits, or holding a reference on the network > > namespace for our sockets so this doesn't happen. > > > > Even once we shutdown the socket we can still have TCP timers that > > fire > > in the background, hence this panic. SUNRPC shuts down the socket > > and > > throws away all knowledge of it, but it's still doing things in the > > background. > > > > Fix this by grabbing a reference on the network namespace for any tcp > > sockets we open. With this patch I'm able to cycle my 500 node > > stress > > tier over and over again without panicing, whereas previously I was > > losing 10-20 nodes every shutdown cycle. > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > --- > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > realize there was > > a division of the client and server side of SUNRPC. > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > 1 file changed, 20 insertions(+) > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > index bb81050c870e..f02387751a94 100644 > > --- a/net/sunrpc/xprtsock.c > > +++ b/net/sunrpc/xprtsock.c > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > rpc_xprt *xprt, struct socket *sock) > > > > if (!transport->inet) { > > struct sock *sk = sock->sk; > > + struct net *net = sock_net(sk); > > > > /* Avoid temporary address, they are bad for long- > > lived > > * connections such as NFS mounts. > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > rpc_xprt *xprt, struct socket *sock) > > tcp_sock_set_nodelay(sk); > > > > lock_sock(sk); > > + /* > > + * Because timers can fire after the fact we need to > > hold a > > + * reference on the netns for this socket. > > + */ > > + if (!sk->sk_net_refcnt) { > > + if (!maybe_get_net(net)) { > > + release_sock(sk); > > + return -ENOTCONN; > > + } > > + /* > > + * For kernel sockets we have a tracker put > > in place for > > + * the tracing, we need to free this to > > maintaine > > + * consistent tracking info. > > + */ > > + __netns_tracker_free(net, &sk->ns_tracker, > > false); > > > > + sk->sk_net_refcnt = 1; > > + netns_tracker_alloc(net, &sk->ns_tracker, > > GFP_KERNEL); > > + sock_inuse_add(net, 1); > > + } > > xs_save_old_callbacks(transport, sk); > > > > sk->sk_user_data = xprt; > > Hmm... Doesn't this end up being more or less equivalent to calling > __sock_create() with the kernel flag being set to 0? AFAICT yes, but there are a lot of other things that happen with kern being set to 1, so I think this is a safer bet, and is analagous to this other fix 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-20 14:10 ` Josef Bacik @ 2024-03-20 14:28 ` Eric Dumazet 2024-03-20 14:56 ` Josef Bacik 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2024-03-20 14:28 UTC (permalink / raw) To: Josef Bacik Cc: Trond Myklebust, anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, kuba@kernel.org On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > > We've been seeing variations of the following panic in production > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > > Call Trace: > > > <IRQ> > > > ? __die+0x78/0xc0 > > > ? page_fault_oops+0x286/0x380 > > > ? fib6_table_lookup+0x95/0xf40 > > > ? exc_page_fault+0x5d/0x110 > > > ? asm_exc_page_fault+0x22/0x30 > > > ? ip6_pol_route+0x59/0x7a0 > > > ? unlink_anon_vmas+0x370/0x370 > > > fib6_rule_lookup+0x56/0x1b0 > > > ? update_blocked_averages+0x2c6/0x6a0 > > > ip6_route_output_flags+0xd2/0x130 > > > ip6_dst_lookup_tail+0x3b/0x220 > > > ip6_dst_lookup_flow+0x2c/0x80 > > > inet6_sk_rebuild_header+0x14c/0x1e0 > > > ? tcp_release_cb+0x150/0x150 > > > __tcp_retransmit_skb+0x68/0x6b0 > > > ? tcp_current_mss+0xca/0x150 > > > ? tcp_release_cb+0x150/0x150 > > > tcp_send_loss_probe+0x8e/0x220 > > > tcp_write_timer+0xbe/0x2d0 > > > run_timer_softirq+0x272/0x840 > > > ? hrtimer_interrupt+0x2c9/0x5f0 > > > ? sched_clock_cpu+0xc/0x170 > > > irq_exit_rcu+0x171/0x330 > > > sysvec_apic_timer_interrupt+0x6d/0x80 > > > </IRQ> > > > <TASK> > > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > > pointer deref > > > > > > >>> prog.crashed_thread().stack_trace()[0] > > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > > > >>> > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > > (struct ipv6_devconf *)0x0 > > > > > > Looking at the socket you can see that it's been closed > > > > > > >>> > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > > >>> decode_enum_type_flags(1 << > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > > 'TCPF_FIN_WAIT1' > > > > > > This occurs in our container setup where we have an NFS mount that > > > belongs to the containers network namespace. On container shutdown > > > our > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > > panic. In the kernel we're responsible for destroying our sockets > > > when > > > the network namespace exits, or holding a reference on the network > > > namespace for our sockets so this doesn't happen. > > > > > > Even once we shutdown the socket we can still have TCP timers that > > > fire > > > in the background, hence this panic. SUNRPC shuts down the socket > > > and > > > throws away all knowledge of it, but it's still doing things in the > > > background. > > > > > > Fix this by grabbing a reference on the network namespace for any tcp > > > sockets we open. With this patch I'm able to cycle my 500 node > > > stress > > > tier over and over again without panicing, whereas previously I was > > > losing 10-20 nodes every shutdown cycle. > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > --- > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > > realize there was > > > a division of the client and server side of SUNRPC. > > > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > index bb81050c870e..f02387751a94 100644 > > > --- a/net/sunrpc/xprtsock.c > > > +++ b/net/sunrpc/xprtsock.c > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > > rpc_xprt *xprt, struct socket *sock) > > > > > > if (!transport->inet) { > > > struct sock *sk = sock->sk; > > > + struct net *net = sock_net(sk); > > > > > > /* Avoid temporary address, they are bad for long- > > > lived > > > * connections such as NFS mounts. > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > > rpc_xprt *xprt, struct socket *sock) > > > tcp_sock_set_nodelay(sk); > > > > > > lock_sock(sk); > > > + /* > > > + * Because timers can fire after the fact we need to > > > hold a > > > + * reference on the netns for this socket. > > > + */ > > > + if (!sk->sk_net_refcnt) { > > > + if (!maybe_get_net(net)) { > > > + release_sock(sk); > > > + return -ENOTCONN; > > > + } > > > + /* > > > + * For kernel sockets we have a tracker put > > > in place for > > > + * the tracing, we need to free this to > > > maintaine > > > + * consistent tracking info. > > > + */ > > > + __netns_tracker_free(net, &sk->ns_tracker, > > > false); > > > > > > + sk->sk_net_refcnt = 1; > > > + netns_tracker_alloc(net, &sk->ns_tracker, > > > GFP_KERNEL); > > > + sock_inuse_add(net, 1); > > > + } > > > xs_save_old_callbacks(transport, sk); > > > > > > sk->sk_user_data = xprt; > > > > Hmm... Doesn't this end up being more or less equivalent to calling > > __sock_create() with the kernel flag being set to 0? > > AFAICT yes, but there are a lot of other things that happen with kern being set > to 1, so I think this is a safer bet, and is analagous to this other fix > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, > Hmm... this would prevent a netns with one or more TCP flows owned by this layer to be dismantled, even if all other processes/sockets disappeared ? Have you looked at my suggestion instead ? https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ I never formally submitted this patch because I got no feedback. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-20 14:28 ` Eric Dumazet @ 2024-03-20 14:56 ` Josef Bacik 2024-03-20 15:00 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Josef Bacik @ 2024-03-20 14:56 UTC (permalink / raw) To: Eric Dumazet Cc: Trond Myklebust, anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, kuba@kernel.org On Wed, Mar 20, 2024 at 03:28:15PM +0100, Eric Dumazet wrote: > On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > > > We've been seeing variations of the following panic in production > > > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > > > Call Trace: > > > > <IRQ> > > > > ? __die+0x78/0xc0 > > > > ? page_fault_oops+0x286/0x380 > > > > ? fib6_table_lookup+0x95/0xf40 > > > > ? exc_page_fault+0x5d/0x110 > > > > ? asm_exc_page_fault+0x22/0x30 > > > > ? ip6_pol_route+0x59/0x7a0 > > > > ? unlink_anon_vmas+0x370/0x370 > > > > fib6_rule_lookup+0x56/0x1b0 > > > > ? update_blocked_averages+0x2c6/0x6a0 > > > > ip6_route_output_flags+0xd2/0x130 > > > > ip6_dst_lookup_tail+0x3b/0x220 > > > > ip6_dst_lookup_flow+0x2c/0x80 > > > > inet6_sk_rebuild_header+0x14c/0x1e0 > > > > ? tcp_release_cb+0x150/0x150 > > > > __tcp_retransmit_skb+0x68/0x6b0 > > > > ? tcp_current_mss+0xca/0x150 > > > > ? tcp_release_cb+0x150/0x150 > > > > tcp_send_loss_probe+0x8e/0x220 > > > > tcp_write_timer+0xbe/0x2d0 > > > > run_timer_softirq+0x272/0x840 > > > > ? hrtimer_interrupt+0x2c9/0x5f0 > > > > ? sched_clock_cpu+0xc/0x170 > > > > irq_exit_rcu+0x171/0x330 > > > > sysvec_apic_timer_interrupt+0x6d/0x80 > > > > </IRQ> > > > > <TASK> > > > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > > > pointer deref > > > > > > > > >>> prog.crashed_thread().stack_trace()[0] > > > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > > > > > >>> > > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > > > (struct ipv6_devconf *)0x0 > > > > > > > > Looking at the socket you can see that it's been closed > > > > > > > > >>> > > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > > > >>> decode_enum_type_flags(1 << > > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > > > 'TCPF_FIN_WAIT1' > > > > > > > > This occurs in our container setup where we have an NFS mount that > > > > belongs to the containers network namespace. On container shutdown > > > > our > > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > > > panic. In the kernel we're responsible for destroying our sockets > > > > when > > > > the network namespace exits, or holding a reference on the network > > > > namespace for our sockets so this doesn't happen. > > > > > > > > Even once we shutdown the socket we can still have TCP timers that > > > > fire > > > > in the background, hence this panic. SUNRPC shuts down the socket > > > > and > > > > throws away all knowledge of it, but it's still doing things in the > > > > background. > > > > > > > > Fix this by grabbing a reference on the network namespace for any tcp > > > > sockets we open. With this patch I'm able to cycle my 500 node > > > > stress > > > > tier over and over again without panicing, whereas previously I was > > > > losing 10-20 nodes every shutdown cycle. > > > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > > --- > > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > > > realize there was > > > > a division of the client and server side of SUNRPC. > > > > > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > > > 1 file changed, 20 insertions(+) > > > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > > index bb81050c870e..f02387751a94 100644 > > > > --- a/net/sunrpc/xprtsock.c > > > > +++ b/net/sunrpc/xprtsock.c > > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > > > if (!transport->inet) { > > > > struct sock *sk = sock->sk; > > > > + struct net *net = sock_net(sk); > > > > > > > > /* Avoid temporary address, they are bad for long- > > > > lived > > > > * connections such as NFS mounts. > > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > > > rpc_xprt *xprt, struct socket *sock) > > > > tcp_sock_set_nodelay(sk); > > > > > > > > lock_sock(sk); > > > > + /* > > > > + * Because timers can fire after the fact we need to > > > > hold a > > > > + * reference on the netns for this socket. > > > > + */ > > > > + if (!sk->sk_net_refcnt) { > > > > + if (!maybe_get_net(net)) { > > > > + release_sock(sk); > > > > + return -ENOTCONN; > > > > + } > > > > + /* > > > > + * For kernel sockets we have a tracker put > > > > in place for > > > > + * the tracing, we need to free this to > > > > maintaine > > > > + * consistent tracking info. > > > > + */ > > > > + __netns_tracker_free(net, &sk->ns_tracker, > > > > false); > > > > > > > > + sk->sk_net_refcnt = 1; > > > > + netns_tracker_alloc(net, &sk->ns_tracker, > > > > GFP_KERNEL); > > > > + sock_inuse_add(net, 1); > > > > + } > > > > xs_save_old_callbacks(transport, sk); > > > > > > > > sk->sk_user_data = xprt; > > > > > > Hmm... Doesn't this end up being more or less equivalent to calling > > > __sock_create() with the kernel flag being set to 0? > > > > AFAICT yes, but there are a lot of other things that happen with kern being set > > to 1, so I think this is a safer bet, and is analagous to this other fix > > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, > > > > Hmm... this would prevent a netns with one or more TCP flows owned by > this layer to be dismantled, > even if all other processes/sockets disappeared ? Yeah but if sockets are still in use then we want the netns to still be up right? I personally am very confused about how the lifetime stuff works for sockets, I don't understand how shutting down the socket means it gets to stick around after the fact forever, but feels like if it's tied to a netns then it's completely valid to hold the netns open until we're done with the socket. > > Have you looked at my suggestion instead ? > > https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > I never formally submitted this patch because I got no feedback. I did something similar, tho not with _sync so maybe that was the problem, but this is what I did originally in production before I emailed you the first time diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 40a2aa9fd00e..73ae59a5a488 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2827,6 +2827,8 @@ void tcp_shutdown(struct sock *sk, int how) if (!(how & SEND_SHUTDOWN)) return; + tcp_clear_xmit_timers(sk); + /* If we've already sent a FIN, or it's a closed state, skip this. */ if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_SYN_SENT | But I still hit the panic. I followed this up with a patch where I set a special flag that I set on the socket _after_ i called kernel_sock_shutdown(), and then I added a WARN_ON() to the icsk_retransmit_timer re-arm stuff if my flag was set, and this fired alllll the time. Stopping the timer doesn't help, because after we close the socket we'll still get some incoming ACK that'll re-arm the timer and then we're in the same position we were originally. I'm happy to be wrong here, because this is clearly not my area of expertise, but I wandered down this road and it wasn't the right one. Thanks, Josef ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-20 14:56 ` Josef Bacik @ 2024-03-20 15:00 ` Eric Dumazet 2024-03-20 15:35 ` Josef Bacik 2024-03-21 18:49 ` Josef Bacik 0 siblings, 2 replies; 10+ messages in thread From: Eric Dumazet @ 2024-03-20 15:00 UTC (permalink / raw) To: Josef Bacik Cc: Trond Myklebust, anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, kuba@kernel.org On Wed, Mar 20, 2024 at 3:56 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Wed, Mar 20, 2024 at 03:28:15PM +0100, Eric Dumazet wrote: > > On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > > > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > > > > We've been seeing variations of the following panic in production > > > > > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > > > > Call Trace: > > > > > <IRQ> > > > > > ? __die+0x78/0xc0 > > > > > ? page_fault_oops+0x286/0x380 > > > > > ? fib6_table_lookup+0x95/0xf40 > > > > > ? exc_page_fault+0x5d/0x110 > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > ? ip6_pol_route+0x59/0x7a0 > > > > > ? unlink_anon_vmas+0x370/0x370 > > > > > fib6_rule_lookup+0x56/0x1b0 > > > > > ? update_blocked_averages+0x2c6/0x6a0 > > > > > ip6_route_output_flags+0xd2/0x130 > > > > > ip6_dst_lookup_tail+0x3b/0x220 > > > > > ip6_dst_lookup_flow+0x2c/0x80 > > > > > inet6_sk_rebuild_header+0x14c/0x1e0 > > > > > ? tcp_release_cb+0x150/0x150 > > > > > __tcp_retransmit_skb+0x68/0x6b0 > > > > > ? tcp_current_mss+0xca/0x150 > > > > > ? tcp_release_cb+0x150/0x150 > > > > > tcp_send_loss_probe+0x8e/0x220 > > > > > tcp_write_timer+0xbe/0x2d0 > > > > > run_timer_softirq+0x272/0x840 > > > > > ? hrtimer_interrupt+0x2c9/0x5f0 > > > > > ? sched_clock_cpu+0xc/0x170 > > > > > irq_exit_rcu+0x171/0x330 > > > > > sysvec_apic_timer_interrupt+0x6d/0x80 > > > > > </IRQ> > > > > > <TASK> > > > > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > > > > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > > > > pointer deref > > > > > > > > > > >>> prog.crashed_thread().stack_trace()[0] > > > > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > > > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > > > > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > > > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > > > > > > > >>> > > > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > > > > (struct ipv6_devconf *)0x0 > > > > > > > > > > Looking at the socket you can see that it's been closed > > > > > > > > > > >>> > > > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > > > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > > > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > > > > >>> decode_enum_type_flags(1 << > > > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > > > > 'TCPF_FIN_WAIT1' > > > > > > > > > > This occurs in our container setup where we have an NFS mount that > > > > > belongs to the containers network namespace. On container shutdown > > > > > our > > > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > > > > panic. In the kernel we're responsible for destroying our sockets > > > > > when > > > > > the network namespace exits, or holding a reference on the network > > > > > namespace for our sockets so this doesn't happen. > > > > > > > > > > Even once we shutdown the socket we can still have TCP timers that > > > > > fire > > > > > in the background, hence this panic. SUNRPC shuts down the socket > > > > > and > > > > > throws away all knowledge of it, but it's still doing things in the > > > > > background. > > > > > > > > > > Fix this by grabbing a reference on the network namespace for any tcp > > > > > sockets we open. With this patch I'm able to cycle my 500 node > > > > > stress > > > > > tier over and over again without panicing, whereas previously I was > > > > > losing 10-20 nodes every shutdown cycle. > > > > > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > > > --- > > > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > > > > realize there was > > > > > a division of the client and server side of SUNRPC. > > > > > > > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > > > > 1 file changed, 20 insertions(+) > > > > > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > > > index bb81050c870e..f02387751a94 100644 > > > > > --- a/net/sunrpc/xprtsock.c > > > > > +++ b/net/sunrpc/xprtsock.c > > > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > > > > > if (!transport->inet) { > > > > > struct sock *sk = sock->sk; > > > > > + struct net *net = sock_net(sk); > > > > > > > > > > /* Avoid temporary address, they are bad for long- > > > > > lived > > > > > * connections such as NFS mounts. > > > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > tcp_sock_set_nodelay(sk); > > > > > > > > > > lock_sock(sk); > > > > > + /* > > > > > + * Because timers can fire after the fact we need to > > > > > hold a > > > > > + * reference on the netns for this socket. > > > > > + */ > > > > > + if (!sk->sk_net_refcnt) { > > > > > + if (!maybe_get_net(net)) { > > > > > + release_sock(sk); > > > > > + return -ENOTCONN; > > > > > + } > > > > > + /* > > > > > + * For kernel sockets we have a tracker put > > > > > in place for > > > > > + * the tracing, we need to free this to > > > > > maintaine > > > > > + * consistent tracking info. > > > > > + */ > > > > > + __netns_tracker_free(net, &sk->ns_tracker, > > > > > false); > > > > > > > > > > + sk->sk_net_refcnt = 1; > > > > > + netns_tracker_alloc(net, &sk->ns_tracker, > > > > > GFP_KERNEL); > > > > > + sock_inuse_add(net, 1); > > > > > + } > > > > > xs_save_old_callbacks(transport, sk); > > > > > > > > > > sk->sk_user_data = xprt; > > > > > > > > Hmm... Doesn't this end up being more or less equivalent to calling > > > > __sock_create() with the kernel flag being set to 0? > > > > > > AFAICT yes, but there are a lot of other things that happen with kern being set > > > to 1, so I think this is a safer bet, and is analagous to this other fix > > > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, > > > > > > > Hmm... this would prevent a netns with one or more TCP flows owned by > > this layer to be dismantled, > > even if all other processes/sockets disappeared ? > > Yeah but if sockets are still in use then we want the netns to still be up > right? I personally am very confused about how the lifetime stuff works for > sockets, I don't understand how shutting down the socket means it gets to stick > around after the fact forever, but feels like if it's tied to a netns then it's > completely valid to hold the netns open until we're done with the socket. > > > > > Have you looked at my suggestion instead ? > > > > https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > I never formally submitted this patch because I got no feedback. > > I did something similar, tho not with _sync so maybe that was the problem, but > this is what I did originally in production before I emailed you the first time The _sync part is mandatory really for this context. Not that it needs to be done while the socket is not locked, or risk a deadlock. Note that modern trees have timer_shutdown_sync() which might even be better. > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 40a2aa9fd00e..73ae59a5a488 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2827,6 +2827,8 @@ void tcp_shutdown(struct sock *sk, int how) > if (!(how & SEND_SHUTDOWN)) > return; > > + tcp_clear_xmit_timers(sk); > + > /* If we've already sent a FIN, or it's a closed state, skip this. */ > if ((1 << sk->sk_state) & > (TCPF_ESTABLISHED | TCPF_SYN_SENT | > > But I still hit the panic. I followed this up with a patch where I set a > special flag that I set on the socket _after_ i called kernel_sock_shutdown(), > and then I added a WARN_ON() to the icsk_retransmit_timer re-arm stuff if my > flag was set, and this fired alllll the time. Stopping the timer doesn't help, > because after we close the socket we'll still get some incoming ACK that'll > re-arm the timer and then we're in the same position we were originally. timer_shutdown() to the rescue perhaps. > > I'm happy to be wrong here, because this is clearly not my area of expertise, > but I wandered down this road and it wasn't the right one. Thanks, > > Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-20 15:00 ` Eric Dumazet @ 2024-03-20 15:35 ` Josef Bacik 2024-03-21 18:49 ` Josef Bacik 1 sibling, 0 replies; 10+ messages in thread From: Josef Bacik @ 2024-03-20 15:35 UTC (permalink / raw) To: Eric Dumazet Cc: Trond Myklebust, anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, kuba@kernel.org On Wed, Mar 20, 2024 at 04:00:09PM +0100, Eric Dumazet wrote: > On Wed, Mar 20, 2024 at 3:56 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Wed, Mar 20, 2024 at 03:28:15PM +0100, Eric Dumazet wrote: > > > On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > > > > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > > > > > We've been seeing variations of the following panic in production > > > > > > > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > > > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > > > > > Call Trace: > > > > > > <IRQ> > > > > > > ? __die+0x78/0xc0 > > > > > > ? page_fault_oops+0x286/0x380 > > > > > > ? fib6_table_lookup+0x95/0xf40 > > > > > > ? exc_page_fault+0x5d/0x110 > > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > > ? ip6_pol_route+0x59/0x7a0 > > > > > > ? unlink_anon_vmas+0x370/0x370 > > > > > > fib6_rule_lookup+0x56/0x1b0 > > > > > > ? update_blocked_averages+0x2c6/0x6a0 > > > > > > ip6_route_output_flags+0xd2/0x130 > > > > > > ip6_dst_lookup_tail+0x3b/0x220 > > > > > > ip6_dst_lookup_flow+0x2c/0x80 > > > > > > inet6_sk_rebuild_header+0x14c/0x1e0 > > > > > > ? tcp_release_cb+0x150/0x150 > > > > > > __tcp_retransmit_skb+0x68/0x6b0 > > > > > > ? tcp_current_mss+0xca/0x150 > > > > > > ? tcp_release_cb+0x150/0x150 > > > > > > tcp_send_loss_probe+0x8e/0x220 > > > > > > tcp_write_timer+0xbe/0x2d0 > > > > > > run_timer_softirq+0x272/0x840 > > > > > > ? hrtimer_interrupt+0x2c9/0x5f0 > > > > > > ? sched_clock_cpu+0xc/0x170 > > > > > > irq_exit_rcu+0x171/0x330 > > > > > > sysvec_apic_timer_interrupt+0x6d/0x80 > > > > > > </IRQ> > > > > > > <TASK> > > > > > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > > > > > > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > > > > > pointer deref > > > > > > > > > > > > >>> prog.crashed_thread().stack_trace()[0] > > > > > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > > > > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > > > > > > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > > > > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > > > > > > > > > >>> > > > > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > > > > > (struct ipv6_devconf *)0x0 > > > > > > > > > > > > Looking at the socket you can see that it's been closed > > > > > > > > > > > > >>> > > > > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > > > > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > > > > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > > > > > >>> decode_enum_type_flags(1 << > > > > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > > > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > > > > > 'TCPF_FIN_WAIT1' > > > > > > > > > > > > This occurs in our container setup where we have an NFS mount that > > > > > > belongs to the containers network namespace. On container shutdown > > > > > > our > > > > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > > > > > panic. In the kernel we're responsible for destroying our sockets > > > > > > when > > > > > > the network namespace exits, or holding a reference on the network > > > > > > namespace for our sockets so this doesn't happen. > > > > > > > > > > > > Even once we shutdown the socket we can still have TCP timers that > > > > > > fire > > > > > > in the background, hence this panic. SUNRPC shuts down the socket > > > > > > and > > > > > > throws away all knowledge of it, but it's still doing things in the > > > > > > background. > > > > > > > > > > > > Fix this by grabbing a reference on the network namespace for any tcp > > > > > > sockets we open. With this patch I'm able to cycle my 500 node > > > > > > stress > > > > > > tier over and over again without panicing, whereas previously I was > > > > > > losing 10-20 nodes every shutdown cycle. > > > > > > > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > > > > --- > > > > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > > > > > realize there was > > > > > > a division of the client and server side of SUNRPC. > > > > > > > > > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > > > > > 1 file changed, 20 insertions(+) > > > > > > > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > > > > index bb81050c870e..f02387751a94 100644 > > > > > > --- a/net/sunrpc/xprtsock.c > > > > > > +++ b/net/sunrpc/xprtsock.c > > > > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > > > > > > > if (!transport->inet) { > > > > > > struct sock *sk = sock->sk; > > > > > > + struct net *net = sock_net(sk); > > > > > > > > > > > > /* Avoid temporary address, they are bad for long- > > > > > > lived > > > > > > * connections such as NFS mounts. > > > > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > tcp_sock_set_nodelay(sk); > > > > > > > > > > > > lock_sock(sk); > > > > > > + /* > > > > > > + * Because timers can fire after the fact we need to > > > > > > hold a > > > > > > + * reference on the netns for this socket. > > > > > > + */ > > > > > > + if (!sk->sk_net_refcnt) { > > > > > > + if (!maybe_get_net(net)) { > > > > > > + release_sock(sk); > > > > > > + return -ENOTCONN; > > > > > > + } > > > > > > + /* > > > > > > + * For kernel sockets we have a tracker put > > > > > > in place for > > > > > > + * the tracing, we need to free this to > > > > > > maintaine > > > > > > + * consistent tracking info. > > > > > > + */ > > > > > > + __netns_tracker_free(net, &sk->ns_tracker, > > > > > > false); > > > > > > > > > > > > + sk->sk_net_refcnt = 1; > > > > > > + netns_tracker_alloc(net, &sk->ns_tracker, > > > > > > GFP_KERNEL); > > > > > > + sock_inuse_add(net, 1); > > > > > > + } > > > > > > xs_save_old_callbacks(transport, sk); > > > > > > > > > > > > sk->sk_user_data = xprt; > > > > > > > > > > Hmm... Doesn't this end up being more or less equivalent to calling > > > > > __sock_create() with the kernel flag being set to 0? > > > > > > > > AFAICT yes, but there are a lot of other things that happen with kern being set > > > > to 1, so I think this is a safer bet, and is analagous to this other fix > > > > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, > > > > > > > > > > Hmm... this would prevent a netns with one or more TCP flows owned by > > > this layer to be dismantled, > > > even if all other processes/sockets disappeared ? > > > > Yeah but if sockets are still in use then we want the netns to still be up > > right? I personally am very confused about how the lifetime stuff works for > > sockets, I don't understand how shutting down the socket means it gets to stick > > around after the fact forever, but feels like if it's tied to a netns then it's > > completely valid to hold the netns open until we're done with the socket. > > > > > > > > Have you looked at my suggestion instead ? > > > > > > https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > > > I never formally submitted this patch because I got no feedback. > > > > I did something similar, tho not with _sync so maybe that was the problem, but > > this is what I did originally in production before I emailed you the first time > > > The _sync part is mandatory really for this context. > > Not that it needs to be done while the socket is not locked, or risk a deadlock. > > Note that modern trees have timer_shutdown_sync() which might even be better. > Sounds good, I've reverted my patches and I've applied this patch, I should have results by the end of the day on wether or not it fixed the problem. Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets 2024-03-20 15:00 ` Eric Dumazet 2024-03-20 15:35 ` Josef Bacik @ 2024-03-21 18:49 ` Josef Bacik 1 sibling, 0 replies; 10+ messages in thread From: Josef Bacik @ 2024-03-21 18:49 UTC (permalink / raw) To: Eric Dumazet Cc: Trond Myklebust, anna@kernel.org, kernel-team@fb.com, linux-nfs@vger.kernel.org, kuba@kernel.org On Wed, Mar 20, 2024 at 04:00:09PM +0100, Eric Dumazet wrote: > On Wed, Mar 20, 2024 at 3:56 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Wed, Mar 20, 2024 at 03:28:15PM +0100, Eric Dumazet wrote: > > > On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote: > > > > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote: > > > > > > We've been seeing variations of the following panic in production > > > > > > > > > > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > > > > RIP: 0010:ip6_pol_route+0x59/0x7a0 > > > > > > Call Trace: > > > > > > <IRQ> > > > > > > ? __die+0x78/0xc0 > > > > > > ? page_fault_oops+0x286/0x380 > > > > > > ? fib6_table_lookup+0x95/0xf40 > > > > > > ? exc_page_fault+0x5d/0x110 > > > > > > ? asm_exc_page_fault+0x22/0x30 > > > > > > ? ip6_pol_route+0x59/0x7a0 > > > > > > ? unlink_anon_vmas+0x370/0x370 > > > > > > fib6_rule_lookup+0x56/0x1b0 > > > > > > ? update_blocked_averages+0x2c6/0x6a0 > > > > > > ip6_route_output_flags+0xd2/0x130 > > > > > > ip6_dst_lookup_tail+0x3b/0x220 > > > > > > ip6_dst_lookup_flow+0x2c/0x80 > > > > > > inet6_sk_rebuild_header+0x14c/0x1e0 > > > > > > ? tcp_release_cb+0x150/0x150 > > > > > > __tcp_retransmit_skb+0x68/0x6b0 > > > > > > ? tcp_current_mss+0xca/0x150 > > > > > > ? tcp_release_cb+0x150/0x150 > > > > > > tcp_send_loss_probe+0x8e/0x220 > > > > > > tcp_write_timer+0xbe/0x2d0 > > > > > > run_timer_softirq+0x272/0x840 > > > > > > ? hrtimer_interrupt+0x2c9/0x5f0 > > > > > > ? sched_clock_cpu+0xc/0x170 > > > > > > irq_exit_rcu+0x171/0x330 > > > > > > sysvec_apic_timer_interrupt+0x6d/0x80 > > > > > > </IRQ> > > > > > > <TASK> > > > > > > asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > > RIP: 0010:cpuidle_enter_state+0xe7/0x243 > > > > > > > > > > > > Inspecting the vmcore with drgn you can see why this is a NULL > > > > > > pointer deref > > > > > > > > > > > > >>> prog.crashed_thread().stack_trace()[0] > > > > > > #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in > > > > > > ip6_pol_route at net/ipv6/route.c:2212:40 > > > > > > > > > > > > 2212 if (net->ipv6.devconf_all->forwarding == 0) > > > > > > 2213 strict |= RT6_LOOKUP_F_REACHABLE; > > > > > > > > > > > > >>> > > > > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all > > > > > > (struct ipv6_devconf *)0x0 > > > > > > > > > > > > Looking at the socket you can see that it's been closed > > > > > > > > > > > > >>> > > > > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk']. > > > > > > __sk_common.skc_flags, prog.type('enum sock_flags')) > > > > > > 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' > > > > > > >>> decode_enum_type_flags(1 << > > > > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v > > > > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) > > > > > > 'TCPF_FIN_WAIT1' > > > > > > > > > > > > This occurs in our container setup where we have an NFS mount that > > > > > > belongs to the containers network namespace. On container shutdown > > > > > > our > > > > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we > > > > > > panic. In the kernel we're responsible for destroying our sockets > > > > > > when > > > > > > the network namespace exits, or holding a reference on the network > > > > > > namespace for our sockets so this doesn't happen. > > > > > > > > > > > > Even once we shutdown the socket we can still have TCP timers that > > > > > > fire > > > > > > in the background, hence this panic. SUNRPC shuts down the socket > > > > > > and > > > > > > throws away all knowledge of it, but it's still doing things in the > > > > > > background. > > > > > > > > > > > > Fix this by grabbing a reference on the network namespace for any tcp > > > > > > sockets we open. With this patch I'm able to cycle my 500 node > > > > > > stress > > > > > > tier over and over again without panicing, whereas previously I was > > > > > > losing 10-20 nodes every shutdown cycle. > > > > > > > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > > > > --- > > > > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't > > > > > > realize there was > > > > > > a division of the client and server side of SUNRPC. > > > > > > > > > > > > net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ > > > > > > 1 file changed, 20 insertions(+) > > > > > > > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > > > > index bb81050c870e..f02387751a94 100644 > > > > > > --- a/net/sunrpc/xprtsock.c > > > > > > +++ b/net/sunrpc/xprtsock.c > > > > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct > > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > > > > > > > if (!transport->inet) { > > > > > > struct sock *sk = sock->sk; > > > > > > + struct net *net = sock_net(sk); > > > > > > > > > > > > /* Avoid temporary address, they are bad for long- > > > > > > lived > > > > > > * connections such as NFS mounts. > > > > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct > > > > > > rpc_xprt *xprt, struct socket *sock) > > > > > > tcp_sock_set_nodelay(sk); > > > > > > > > > > > > lock_sock(sk); > > > > > > + /* > > > > > > + * Because timers can fire after the fact we need to > > > > > > hold a > > > > > > + * reference on the netns for this socket. > > > > > > + */ > > > > > > + if (!sk->sk_net_refcnt) { > > > > > > + if (!maybe_get_net(net)) { > > > > > > + release_sock(sk); > > > > > > + return -ENOTCONN; > > > > > > + } > > > > > > + /* > > > > > > + * For kernel sockets we have a tracker put > > > > > > in place for > > > > > > + * the tracing, we need to free this to > > > > > > maintaine > > > > > > + * consistent tracking info. > > > > > > + */ > > > > > > + __netns_tracker_free(net, &sk->ns_tracker, > > > > > > false); > > > > > > > > > > > > + sk->sk_net_refcnt = 1; > > > > > > + netns_tracker_alloc(net, &sk->ns_tracker, > > > > > > GFP_KERNEL); > > > > > > + sock_inuse_add(net, 1); > > > > > > + } > > > > > > xs_save_old_callbacks(transport, sk); > > > > > > > > > > > > sk->sk_user_data = xprt; > > > > > > > > > > Hmm... Doesn't this end up being more or less equivalent to calling > > > > > __sock_create() with the kernel flag being set to 0? > > > > > > > > AFAICT yes, but there are a lot of other things that happen with kern being set > > > > to 1, so I think this is a safer bet, and is analagous to this other fix > > > > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets"). Thanks, > > > > > > > > > > Hmm... this would prevent a netns with one or more TCP flows owned by > > > this layer to be dismantled, > > > even if all other processes/sockets disappeared ? > > > > Yeah but if sockets are still in use then we want the netns to still be up > > right? I personally am very confused about how the lifetime stuff works for > > sockets, I don't understand how shutting down the socket means it gets to stick > > around after the fact forever, but feels like if it's tied to a netns then it's > > completely valid to hold the netns open until we're done with the socket. > > > > > > > > Have you looked at my suggestion instead ? > > > > > > https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ > > > > > > I never formally submitted this patch because I got no feedback. > > > > I did something similar, tho not with _sync so maybe that was the problem, but > > this is what I did originally in production before I emailed you the first time > > > The _sync part is mandatory really for this context. > > Not that it needs to be done while the socket is not locked, or risk a deadlock. > > Note that modern trees have timer_shutdown_sync() which might even be better. > Your patch fixes the problem as well, I've been starting and stopping the task sporadically for a day and haven't tripped the panic. You can add my Tested-by when you send it. Thanks! Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-03-21 18:49 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-19 19:49 [PATCH] sunrpc: hold a ref on netns for tcp sockets Josef Bacik 2024-03-19 19:53 ` Chuck Lever III 2024-03-19 20:07 ` [PATCH][RESEND] " Josef Bacik 2024-03-19 21:59 ` Trond Myklebust 2024-03-20 14:10 ` Josef Bacik 2024-03-20 14:28 ` Eric Dumazet 2024-03-20 14:56 ` Josef Bacik 2024-03-20 15:00 ` Eric Dumazet 2024-03-20 15:35 ` Josef Bacik 2024-03-21 18:49 ` Josef Bacik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox