All of lore.kernel.org
 help / color / mirror / Atom feed
From: Florian Westphal <fw@strlen.de>
To: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
	fw@strlen.de, kuba@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, pabeni@redhat.com,
	syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com,
	syzkaller-bugs@googlegroups.com
Subject: Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
Date: Mon, 12 Aug 2024 16:01:04 +0200	[thread overview]
Message-ID: <20240812140104.GA21559@breakpoint.cc> (raw)
In-Reply-To: <20240811230836.95914-1-kuniyu@amazon.com>

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > 
> > > > ... shows at two cores racing:
> > > > 
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > 
> > > > 
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > > 
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > 
> > > ... which doesn't work.  Does this look like a plausible
> > > theory/exlanation?
> > 
> > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.

I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
called inet_twsk_kill(), 'our' task has to wait for that to complete.

We need to force proper ordering so that all twsk found

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

/*HERE*/tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);

.... have gone through inet_twsk_kill() so tw_refcount managed to
drop back to 1 before doing
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
.

> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?

Not sure, afaiu only one thread enters inet_twsk_kill()
(the one that manages to deactivate the timer).

> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >  	struct inet_bind_hashbucket *bhead, *bhead2;
> >  
> >  	spin_lock(lock);
> > -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> > +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > +		spin_unlock(lock);
> > +		return false;
> 
> forgot to remove false, just return :)

I don't see how this helps, we need to wait until 'stolen' twsk
has gone through inet_twsk_kill() and decremented tw_refcount.
Obviously It would be a bit simpler if we had a reliable reproducer :-)

Possible solutions I came up with so far:

1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").

This commit replaced a net_eq(twsk_net(tw) ... with a check for
dead netns (ns.count == 0),

Downside: We need to remove the purged_once trick that calls
inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
tcp_twsk_purge() as well.

As per b099ce2602d8 changelog, likely increases netns dismantle times.

Upside: simpler code, so this is my preferred solution.

No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
called refcount_dec(->tw_refcount) for every twsk in the exiting netns
list, without other task stealing twsks owned by exiting netns.

Solution 2: change tcp_sk_exit_batch like this:

   tcp_twsk_purge(net_exit_list);

+  list_for_each_entry(net, net_exit_list, exit_list) {
+      while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
+         schedule();
+
+  }

    list_for_each_entry(net, net_exit_list, exit_list) {
       inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
       WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

This synchronizes two concurrent tcp_sk_exit_batch() calls via
existing refcount; if netns setup error unwinding ran off with one of
'our' twsk, it will wait until other task has completed the refcount decrement.

I don't expect it to increase netns dismantle times, else we'd have seen
the WARN_ON_ONCE splat frequently.

Solution 3:

Similar to 2), but via mutex_lock/unlock pair:

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

	mutex_lock(&tcp_exit_batch_mutex);

        tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
                tcp_fastopen_ctx_destroy(net);
        }
	mutex_unlock(&tcp_exit_batch_mutex);
}

Solution 4:

I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
the 'wrong' time.  This is independent "problem", I might be
imagining things here.

Consider:
313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
314 {
[..]
321         for (slot = 0; slot <= ehash_mask; slot++, head++) {

tw sk timer fires on other cpu, inet_twsk_kill() does:

56         spin_lock(lock);
57         sk_nulls_del_node_init_rcu((struct sock *)tw);
58         spin_unlock(lock);

... then other cpu gets preempted.
inet_twsk_purge() resumes and hits empty chain head:

322                 if (hlist_nulls_empty(&head->chain))
323                         continue;

so we don't(can't) wait for the timer to run to completion.

If this sounds plausible to you, this gives us solution 4:

Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
dropped down to 1.

Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
needs to be moved down:

 48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
...
 58     /* Disassociate with bind bucket. */
...
 68     spin_unlock(&bhead->lock);

 70     refcount_dec(&tw->tw_dr->tw_refcount);

 +      spin_lock(lock);
 +      sk_nulls_del_node_init_rcu((struct sock *)tw);
 +      spin_unlock(lock);
71      inet_twsk_put(tw);
72 }

... so concurrent purge() call will find us
the node list (and then wait on timer_shutdown_sync())
until other cpu executing the timer is done.

If twsk was unlinked from table already before
inet_twsk_purge() had chance to find it sk, then in worst
case call to tcp_twsk_destructor() is missing, but I don't
see any ordering requirements that need us to wait for this.

  parent reply	other threads:[~2024-08-12 14:01 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-11  1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
2024-08-11  2:29 ` Kuniyuki Iwashima
2024-08-11  5:42   ` Jason Xing
2024-08-11 13:24   ` Florian Westphal
2024-08-11 14:54     ` Florian Westphal
2024-08-11 16:28       ` Florian Westphal
2024-08-11 23:00         ` Kuniyuki Iwashima
2024-08-11 23:08           ` Kuniyuki Iwashima
2024-08-12  0:36             ` Jason Xing
2024-08-12 14:01             ` Florian Westphal [this message]
2024-08-12 14:30               ` Jason Xing
2024-08-12 15:03                 ` Florian Westphal
2024-08-12 15:49                   ` Jason Xing
2024-08-12 20:00               ` Kuniyuki Iwashima
2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
2024-08-12 23:28                   ` Kuniyuki Iwashima
2024-08-12 23:52                     ` Florian Westphal
2024-08-13  0:01                       ` Kuniyuki Iwashima
2024-08-13  2:48                   ` Jason Xing
2024-08-15 10:47                   ` Paolo Abeni
2024-08-19 15:36                     ` Eric Dumazet
2024-08-19 15:50                   ` patchwork-bot+netdevbpf
2024-08-11 13:32   ` [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill Florian Westphal
2024-08-11 22:35     ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240812140104.GA21559@breakpoint.cc \
    --to=fw@strlen.de \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.