From: Josef Bacik <jbacik@fb.com>
To: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Tom Herbert <tom@herbertland.com>,
Craig Gallek <kraigatgoog@gmail.com>,
Eric Dumazet <eric.dumazet@gmail.com>,
Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
Date: Fri, 16 Dec 2016 09:54:48 -0500 [thread overview]
Message-ID: <1481900088.24490.6@smtp.office365.com> (raw)
In-Reply-To: <aca378b5-aae4-49cf-0542-d1167a5768b8@stressinduktion.org>
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Josef,
>
> On 15.12.2016 19:53, Josef Bacik wrote:
>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>> wrote:
>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>> <kraigatgoog@gmail.com>
>>> wrote:
>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>> <tom@herbertland.com>
>>>> wrote:
>>>>> I think there may be some suspicious code in inet_csk_get_port.
>>>>> At
>>>>> tb_found there is:
>>>>>
>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>> (tb->fastreuseport > 0 &&
>>>>> !rcu_access_pointer(sk->sk_reuseport_cb)
>>>>> &&
>>>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>> uid))) &&
>>>>> smallest_size == -1)
>>>>> goto success;
>>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>> tb, true)) {
>>>>> if ((reuse ||
>>>>> (tb->fastreuseport > 0 &&
>>>>> sk->sk_reuseport &&
>>>>>
>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>> smallest_size != -1 && --attempts
>>>>> >= 0) {
>>>>> spin_unlock_bh(&head->lock);
>>>>> goto again;
>>>>> }
>>>>> goto fail_unlock;
>>>>> }
>>>>>
>>>>> AFAICT there is redundancy in these two conditionals. The same
>>>>> clause
>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is
>>>>> true the
>>>>> first conditional should be hit, goto done, and the second
>>>>> will never
>>>>> evaluate that part to true-- unless the sk is changed (do we
>>>>> need
>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>> That's an interesting point... It looks like this function also
>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>> beginning
>>>> with several spin_lock(&head->lock) to exclusively
>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the
>>>> full bh
>>>> disable variant was preventing the timers in your stack trace
>>>> from
>>>> running interleaved with this function before?
>>>
>>> Could be, although dropping the lock shouldn't be able to affect
>>> the
>>> search state. TBH, I'm a little lost in reading function, the
>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>> that
>>> function and also in every call to inet_csk_bind_conflict. I
>>> wonder if
>>> we can simply this under the assumption that SO_REUSEPORT is only
>>> allowed if the port number (snum) is explicitly specified.
>>
>> Ok first I have data for you Hannes, here's the time distributions
>> before during and after the lockup (with all the debugging in place
>> the
>> box eventually recovers). I've attached it as a text file since it
>> is
>> long.
>
> Thanks a lot!
>
>> Second is I was thinking about why we would spend so much time
>> doing the
>> ->owners list, and obviously it's because of the massive amount of
>> timewait sockets on the owners list. I wrote the following dumb
>> patch
>> and tested it and the problem has disappeared completely. Now I
>> don't
>> know if this is right at all, but I thought it was weird we weren't
>> copying the soreuseport option from the original socket onto the
>> twsk.
>> Is there are reason we aren't doing this currently? Does this help
>> explain what is happening? Thanks,
>
> The patch is interesting and a good clue, but I am immediately a bit
> concerned that we don't copy/tag the socket with the uid also to keep
> the security properties for SO_REUSEPORT. I have to think a bit more
> about this.
>
> We have seen hangs during connect. I am afraid this patch wouldn't
> help
> there while also guaranteeing uniqueness.
Yeah so I looked at the code some more and actually my patch is really
bad. If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb,
which is outside of the timewait sock, so that's definitely bad.
But we should at least be setting it to 0 so that we don't do this
normally. Unfortunately simply setting it to 0 doesn't fix the
problem. So for some reason having ->sk_reuseport set to 1 on a
timewait socket makes this problem non-existent, which is strange.
So back to the drawing board I guess. I wonder if doing what craig
suggested and batching the timewait timer expires so it hurts less
would accomplish the same results. Thanks,
Josef
next prev parent reply other threads:[~2016-12-16 14:55 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
2016-12-08 21:03 ` Hannes Frederic Sowa
2016-12-08 21:36 ` Josef Bacik
2016-12-09 0:30 ` Eric Dumazet
2016-12-09 1:01 ` Josef Bacik
2016-12-10 1:59 ` Josef Bacik
2016-12-10 3:47 ` Eric Dumazet
2016-12-10 4:14 ` Eric Dumazet
2016-12-12 18:05 ` Josef Bacik
2016-12-12 18:44 ` Hannes Frederic Sowa
2016-12-12 21:23 ` Josef Bacik
2016-12-12 22:24 ` Josef Bacik
2016-12-13 20:51 ` Tom Herbert
2016-12-13 23:03 ` Craig Gallek
2016-12-13 23:32 ` Tom Herbert
2016-12-15 18:53 ` Josef Bacik
2016-12-15 22:39 ` Tom Herbert
2016-12-15 23:25 ` Craig Gallek
2016-12-16 0:07 ` Hannes Frederic Sowa
2016-12-16 14:54 ` Josef Bacik [this message]
2016-12-16 15:21 ` Josef Bacik
2016-12-16 22:08 ` Josef Bacik
2016-12-16 22:18 ` Tom Herbert
2016-12-16 22:50 ` Josef Bacik
2016-12-17 11:08 ` Hannes Frederic Sowa
2016-12-17 13:26 ` Josef Bacik
2016-12-20 1:56 ` David Miller
2016-12-20 2:07 ` Tom Herbert
2016-12-20 2:41 ` Eric Dumazet
2016-12-20 3:40 ` Josef Bacik
2016-12-20 4:52 ` Eric Dumazet
2016-12-20 4:59 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1481900088.24490.6@smtp.office365.com \
--to=jbacik@fb.com \
--cc=eric.dumazet@gmail.com \
--cc=hannes@stressinduktion.org \
--cc=kraigatgoog@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=tom@herbertland.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).