netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josef Bacik <jbacik@fb.com>
To: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Tom Herbert <tom@herbertland.com>,
	Craig Gallek <kraigatgoog@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
Date: Fri, 16 Dec 2016 09:54:48 -0500	[thread overview]
Message-ID: <1481900088.24490.6@smtp.office365.com> (raw)
In-Reply-To: <aca378b5-aae4-49cf-0542-d1167a5768b8@stressinduktion.org>

On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> Hi Josef,
> 
> On 15.12.2016 19:53, Josef Bacik wrote:
>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com> 
>> wrote:
>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>> <kraigatgoog@gmail.com>
>>>  wrote:
>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>> <tom@herbertland.com>
>>>>  wrote:
>>>>>   I think there may be some suspicious code in inet_csk_get_port. 
>>>>> At
>>>>>   tb_found there is:
>>>>> 
>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>                        (tb->fastreuseport > 0 &&
>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 
>>>>> &&
>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>  uid))) &&
>>>>>                       smallest_size == -1)
>>>>>                           goto success;
>>>>>                   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>  tb, true)) {
>>>>>                           if ((reuse ||
>>>>>                                (tb->fastreuseport > 0 &&
>>>>>                                 sk->sk_reuseport &&
>>>>> 
>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>                               smallest_size != -1 && --attempts 
>>>>> >= 0) {
>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>                                   goto again;
>>>>>                           }
>>>>>                           goto fail_unlock;
>>>>>                   }
>>>>> 
>>>>>   AFAICT there is redundancy in these two conditionals.  The same 
>>>>> clause
>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
>>>>> true the
>>>>>   first conditional should be hit, goto done,  and the second 
>>>>> will never
>>>>>   evaluate that part to true-- unless the sk is changed (do we 
>>>>> need
>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>   That's an interesting point... It looks like this function also
>>>>   changed in 4.6 from using a single local_bh_disable() at the 
>>>> beginning
>>>>   with several spin_lock(&head->lock) to exclusively
>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 
>>>> full bh
>>>>   disable variant was preventing the timers in your stack trace 
>>>> from
>>>>   running interleaved with this function before?
>>> 
>>>  Could be, although dropping the lock shouldn't be able to affect 
>>> the
>>>  search state. TBH, I'm a little lost in reading function, the
>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
>>> that
>>>  function and also in every call to inet_csk_bind_conflict. I 
>>> wonder if
>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>  allowed if the port number (snum) is explicitly specified.
>> 
>>  Ok first I have data for you Hannes, here's the time distributions
>>  before during and after the lockup (with all the debugging in place 
>> the
>>  box eventually recovers).  I've attached it as a text file since it 
>> is
>>  long.
> 
> Thanks a lot!
> 
>>  Second is I was thinking about why we would spend so much time 
>> doing the
>>  ->owners list, and obviously it's because of the massive amount of
>>  timewait sockets on the owners list.  I wrote the following dumb 
>> patch
>>  and tested it and the problem has disappeared completely.  Now I 
>> don't
>>  know if this is right at all, but I thought it was weird we weren't
>>  copying the soreuseport option from the original socket onto the 
>> twsk.
>>  Is there are reason we aren't doing this currently?  Does this help
>>  explain what is happening?  Thanks,
> 
> The patch is interesting and a good clue, but I am immediately a bit
> concerned that we don't copy/tag the socket with the uid also to keep
> the security properties for SO_REUSEPORT. I have to think a bit more
> about this.
> 
> We have seen hangs during connect. I am afraid this patch wouldn't 
> help
> there while also guaranteeing uniqueness.


Yeah so I looked at the code some more and actually my patch is really 
bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, 
which is outside of the timewait sock, so that's definitely bad.

But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.

So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,

Josef

  reply	other threads:[~2016-12-16 14:55 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
2016-12-08 21:03 ` Hannes Frederic Sowa
2016-12-08 21:36   ` Josef Bacik
2016-12-09  0:30     ` Eric Dumazet
2016-12-09  1:01       ` Josef Bacik
2016-12-10  1:59         ` Josef Bacik
2016-12-10  3:47           ` Eric Dumazet
2016-12-10  4:14             ` Eric Dumazet
2016-12-12 18:05               ` Josef Bacik
2016-12-12 18:44                 ` Hannes Frederic Sowa
2016-12-12 21:23                   ` Josef Bacik
2016-12-12 22:24                   ` Josef Bacik
2016-12-13 20:51                     ` Tom Herbert
2016-12-13 23:03                       ` Craig Gallek
2016-12-13 23:32                         ` Tom Herbert
2016-12-15 18:53                           ` Josef Bacik
2016-12-15 22:39                             ` Tom Herbert
2016-12-15 23:25                               ` Craig Gallek
2016-12-16  0:07                             ` Hannes Frederic Sowa
2016-12-16 14:54                               ` Josef Bacik [this message]
2016-12-16 15:21                                 ` Josef Bacik
2016-12-16 22:08                                   ` Josef Bacik
2016-12-16 22:18                                     ` Tom Herbert
2016-12-16 22:50                                       ` Josef Bacik
2016-12-17 11:08                                         ` Hannes Frederic Sowa
2016-12-17 13:26                                           ` Josef Bacik
2016-12-20  1:56                                             ` David Miller
2016-12-20  2:07                                               ` Tom Herbert
2016-12-20  2:41                                                 ` Eric Dumazet
2016-12-20  3:40                                                   ` Josef Bacik
2016-12-20  4:52                                                     ` Eric Dumazet
2016-12-20  4:59                                                       ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1481900088.24490.6@smtp.office365.com \
    --to=jbacik@fb.com \
    --cc=eric.dumazet@gmail.com \
    --cc=hannes@stressinduktion.org \
    --cc=kraigatgoog@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).