All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Eric Dumazet <edumazet@google.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	"anna@kernel.org" <anna@kernel.org>,
	"kernel-team@fb.com" <kernel-team@fb.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>
Subject: Re: [PATCH][RESEND] sunrpc: hold a ref on netns for tcp sockets
Date: Thu, 21 Mar 2024 14:49:18 -0400	[thread overview]
Message-ID: <20240321184918.GA3186943@perftesting> (raw)
In-Reply-To: <CANn89iJwh=qcOsSvkbCsnrccbGNkdOTORyrwM+13mSbgNqH5Nw@mail.gmail.com>

On Wed, Mar 20, 2024 at 04:00:09PM +0100, Eric Dumazet wrote:
> On Wed, Mar 20, 2024 at 3:56 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 03:28:15PM +0100, Eric Dumazet wrote:
> > > On Wed, Mar 20, 2024 at 3:10 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > On Tue, Mar 19, 2024 at 09:59:48PM +0000, Trond Myklebust wrote:
> > > > > On Tue, 2024-03-19 at 16:07 -0400, Josef Bacik wrote:
> > > > > > We've been seeing variations of the following panic in production
> > > > > >
> > > > > >   BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > > >   RIP: 0010:ip6_pol_route+0x59/0x7a0
> > > > > >   Call Trace:
> > > > > >    <IRQ>
> > > > > >    ? __die+0x78/0xc0
> > > > > >    ? page_fault_oops+0x286/0x380
> > > > > >    ? fib6_table_lookup+0x95/0xf40
> > > > > >    ? exc_page_fault+0x5d/0x110
> > > > > >    ? asm_exc_page_fault+0x22/0x30
> > > > > >    ? ip6_pol_route+0x59/0x7a0
> > > > > >    ? unlink_anon_vmas+0x370/0x370
> > > > > >    fib6_rule_lookup+0x56/0x1b0
> > > > > >    ? update_blocked_averages+0x2c6/0x6a0
> > > > > >    ip6_route_output_flags+0xd2/0x130
> > > > > >    ip6_dst_lookup_tail+0x3b/0x220
> > > > > >    ip6_dst_lookup_flow+0x2c/0x80
> > > > > >    inet6_sk_rebuild_header+0x14c/0x1e0
> > > > > >    ? tcp_release_cb+0x150/0x150
> > > > > >    __tcp_retransmit_skb+0x68/0x6b0
> > > > > >    ? tcp_current_mss+0xca/0x150
> > > > > >    ? tcp_release_cb+0x150/0x150
> > > > > >    tcp_send_loss_probe+0x8e/0x220
> > > > > >    tcp_write_timer+0xbe/0x2d0
> > > > > >    run_timer_softirq+0x272/0x840
> > > > > >    ? hrtimer_interrupt+0x2c9/0x5f0
> > > > > >    ? sched_clock_cpu+0xc/0x170
> > > > > >    irq_exit_rcu+0x171/0x330
> > > > > >    sysvec_apic_timer_interrupt+0x6d/0x80
> > > > > >    </IRQ>
> > > > > >    <TASK>
> > > > > >    asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > >   RIP: 0010:cpuidle_enter_state+0xe7/0x243
> > > > > >
> > > > > > Inspecting the vmcore with drgn you can see why this is a NULL
> > > > > > pointer deref
> > > > > >
> > > > > >     >>> prog.crashed_thread().stack_trace()[0]
> > > > > >     #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in
> > > > > > ip6_pol_route at net/ipv6/route.c:2212:40
> > > > > >
> > > > > >     2212        if (net->ipv6.devconf_all->forwarding == 0)
> > > > > >     2213              strict |= RT6_LOOKUP_F_REACHABLE;
> > > > > >
> > > > > >     >>>
> > > > > > prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all
> > > > > >     (struct ipv6_devconf *)0x0
> > > > > >
> > > > > > Looking at the socket you can see that it's been closed
> > > > > >
> > > > > >     >>>
> > > > > > decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].
> > > > > > __sk_common.skc_flags, prog.type('enum sock_flags'))
> > > > > >     'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE'
> > > > > >     >>> decode_enum_type_flags(1 <<
> > > > > > prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.v
> > > > > > alue_(), prog["TCPF_CLOSE"].type_, bit_numbers=False)
> > > > > >     'TCPF_FIN_WAIT1'
> > > > > >
> > > > > > This occurs in our container setup where we have an NFS mount that
> > > > > > belongs to the containers network namespace.  On container shutdown
> > > > > > our
> > > > > > netns goes away, which sets net->ipv6.defconf_all = NULL, and then we
> > > > > > panic.  In the kernel we're responsible for destroying our sockets
> > > > > > when
> > > > > > the network namespace exits, or holding a reference on the network
> > > > > > namespace for our sockets so this doesn't happen.
> > > > > >
> > > > > > Even once we shutdown the socket we can still have TCP timers that
> > > > > > fire
> > > > > > in the background, hence this panic.  SUNRPC shuts down the socket
> > > > > > and
> > > > > > throws away all knowledge of it, but it's still doing things in the
> > > > > > background.
> > > > > >
> > > > > > Fix this by grabbing a reference on the network namespace for any tcp
> > > > > > sockets we open.  With this patch I'm able to cycle my 500 node
> > > > > > stress
> > > > > > tier over and over again without panicing, whereas previously I was
> > > > > > losing 10-20 nodes every shutdown cycle.
> > > > > >
> > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > > > > ---
> > > > > > Apologies, I just grepped for SUNRPC in MAINTAINERS and didn't
> > > > > > realize there was
> > > > > > a division of the client and server side of SUNRPC.
> > > > > >
> > > > > >  net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++
> > > > > >  1 file changed, 20 insertions(+)
> > > > > >
> > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > > > index bb81050c870e..f02387751a94 100644
> > > > > > --- a/net/sunrpc/xprtsock.c
> > > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > > @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct
> > > > > > rpc_xprt *xprt, struct socket *sock)
> > > > > >
> > > > > >     if (!transport->inet) {
> > > > > >             struct sock *sk = sock->sk;
> > > > > > +           struct net *net = sock_net(sk);
> > > > > >
> > > > > >             /* Avoid temporary address, they are bad for long-
> > > > > > lived
> > > > > >              * connections such as NFS mounts.
> > > > > > @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct
> > > > > > rpc_xprt *xprt, struct socket *sock)
> > > > > >             tcp_sock_set_nodelay(sk);
> > > > > >
> > > > > >             lock_sock(sk);
> > > > > > +           /*
> > > > > > +            * Because timers can fire after the fact we need to
> > > > > > hold a
> > > > > > +            * reference on the netns for this socket.
> > > > > > +            */
> > > > > > +           if (!sk->sk_net_refcnt) {
> > > > > > +                   if (!maybe_get_net(net)) {
> > > > > > +                          release_sock(sk);
> > > > > > +                          return -ENOTCONN;
> > > > > > +                  }
> > > > > > +                  /*
> > > > > > +                   * For kernel sockets we have a tracker put
> > > > > > in place for
> > > > > > +                   * the tracing, we need to free this to
> > > > > > maintaine
> > > > > > +                   * consistent tracking info.
> > > > > > +                   */
> > > > > > +                  __netns_tracker_free(net, &sk->ns_tracker,
> > > > > > false);
> > > > > >
> > > > > > +                  sk->sk_net_refcnt = 1;
> > > > > > +                  netns_tracker_alloc(net, &sk->ns_tracker,
> > > > > > GFP_KERNEL);
> > > > > > +                  sock_inuse_add(net, 1);
> > > > > > +           }
> > > > > >             xs_save_old_callbacks(transport, sk);
> > > > > >
> > > > > >             sk->sk_user_data = xprt;
> > > > >
> > > > > Hmm... Doesn't this end up being more or less equivalent to calling
> > > > > __sock_create() with the kernel flag being set to 0?
> > > >
> > > > AFAICT yes, but there are a lot of other things that happen with kern being set
> > > > to 1, so I think this is a safer bet, and is analagous to this other fix
> > > > 3a58f13a881e ("net: rds: acquire refcount on TCP sockets").  Thanks,
> > > >
> > >
> > > Hmm... this would prevent a netns with one or more TCP flows owned by
> > > this layer to be dismantled,
> > > even if all other processes/sockets disappeared ?
> >
> > Yeah but if sockets are still in use then we want the netns to still be up
> > right?  I personally am very confused about how the lifetime stuff works for
> > sockets, I don't understand how shutting down the socket means it gets to stick
> > around after the fact forever, but feels like if it's tied to a netns then it's
> > completely valid to hold the netns open until we're done with the socket.
> >
> > >
> > > Have you looked at my suggestion instead ?
> > >
> > > https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/
> > >
> > > I never formally submitted this patch because I got no feedback.
> >
> > I did something similar, tho not with _sync so maybe that was the problem, but
> > this is what I did originally in production before I emailed you the first time
> 
> 
> The _sync part is mandatory really for this context.
> 
> Not that it needs to be done while the socket is not locked, or risk a deadlock.
> 
> Note that modern trees have timer_shutdown_sync() which might even be better.
> 

Your patch fixes the problem as well, I've been starting and stopping the task
sporadically for a day and haven't tripped the panic.  You can add my Tested-by
when you send it.  Thanks!

Josef

      parent reply	other threads:[~2024-03-21 18:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-19 19:49 [PATCH] sunrpc: hold a ref on netns for tcp sockets Josef Bacik
2024-03-19 20:07 ` [PATCH][RESEND] " Josef Bacik
2024-03-19 19:53 ` [PATCH] " Chuck Lever III
2024-03-19 21:59 ` [PATCH][RESEND] " Trond Myklebust
2024-03-20 14:10   ` Josef Bacik
2024-03-20 14:28     ` Eric Dumazet
2024-03-20 14:56       ` Josef Bacik
2024-03-20 15:00         ` Eric Dumazet
2024-03-20 15:35           ` Josef Bacik
2024-03-21 18:49           ` Josef Bacik [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240321184918.GA3186943@perftesting \
    --to=josef@toxicpanda.com \
    --cc=anna@kernel.org \
    --cc=edumazet@google.com \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.