netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josef Bacik <jbacik@fb.com>
To: Tom Herbert <tom@herbertland.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>,
	Craig Gallek <kraigatgoog@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"Linux Kernel Network Developers" <netdev@vger.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
Date: Fri, 16 Dec 2016 17:50:10 -0500	[thread overview]
Message-ID: <1481928610.17731.0@smtp.office365.com> (raw)
In-Reply-To: <CALx6S34qacbD-zxvEJyDVsp1_U_tkL_t0hEXrtfffBALsDMxEw@mail.gmail.com>

On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> 
wrote:
> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>> 
>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>> 
>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>  <hannes@stressinduktion.org> wrote:
>>>>> 
>>>>>  Hi Josef,
>>>>> 
>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>> 
>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>>>> <tom@herbertland.com>
>>>>>>  wrote:
>>>>>>> 
>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>   wrote:
>>>>>>>> 
>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>>>>> <tom@herbertland.com>
>>>>>>>>   wrote:
>>>>>>>>> 
>>>>>>>>>    I think there may be some suspicious code in 
>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>    tb_found there is:
>>>>>>>>> 
>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>                         (tb->fastreuseport > 0 &&
>>>>>>>>>                          
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                          sk->sk_reuseport && 
>>>>>>>>> uid_eq(tb->fastuid,
>>>>>>>>>   uid))) &&
>>>>>>>>>                        smallest_size == -1)
>>>>>>>>>                            goto success;
>>>>>>>>>                    if 
>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>   tb, true)) {
>>>>>>>>>                            if ((reuse ||
>>>>>>>>>                                 (tb->fastreuseport > 0 &&
>>>>>>>>>                                  sk->sk_reuseport &&
>>>>>>>>> 
>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>                                smallest_size != -1 && 
>>>>>>>>> --attempts >=
>>>>>>>>>  0) {
>>>>>>>>>                                    
>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>                                    goto again;
>>>>>>>>>                            }
>>>>>>>>>                            goto fail_unlock;
>>>>>>>>>                    }
>>>>>>>>> 
>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The 
>>>>>>>>> same
>>>>>>>>>  clause
>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) && 
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this 
>>>>>>>>> is true
>>>>>>>>>  the
>>>>>>>>>    first conditional should be hit, goto done,  and the 
>>>>>>>>> second will
>>>>>>>>>  never
>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do 
>>>>>>>>> we need
>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>> 
>>>>>>>>    That's an interesting point... It looks like this function 
>>>>>>>> also
>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>  beginning
>>>>>>>>    with several spin_lock(&head->lock) to exclusively
>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps 
>>>>>>>> the full
>>>>>>>>  bh
>>>>>>>>    disable variant was preventing the timers in your stack 
>>>>>>>> trace from
>>>>>>>>    running interleaved with this function before?
>>>>>>> 
>>>>>>> 
>>>>>>>   Could be, although dropping the lock shouldn't be able to 
>>>>>>> affect the
>>>>>>>   search state. TBH, I'm a little lost in reading function, the
>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three 
>>>>>>> times in
>>>>>>>  that
>>>>>>>   function and also in every call to inet_csk_bind_conflict. I 
>>>>>>> wonder
>>>>>>>  if
>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is 
>>>>>>> only
>>>>>>>   allowed if the port number (snum) is explicitly specified.
>>>>>> 
>>>>>> 
>>>>>>   Ok first I have data for you Hannes, here's the time 
>>>>>> distributions
>>>>>>   before during and after the lockup (with all the debugging in 
>>>>>> place
>>>>>>  the
>>>>>>   box eventually recovers).  I've attached it as a text file 
>>>>>> since it is
>>>>>>   long.
>>>>> 
>>>>> 
>>>>>  Thanks a lot!
>>>>> 
>>>>>>   Second is I was thinking about why we would spend so much time 
>>>>>> doing
>>>>>>  the
>>>>>>   ->owners list, and obviously it's because of the massive 
>>>>>> amount of
>>>>>>   timewait sockets on the owners list.  I wrote the following 
>>>>>> dumb patch
>>>>>>   and tested it and the problem has disappeared completely.  Now 
>>>>>> I don't
>>>>>>   know if this is right at all, but I thought it was weird we 
>>>>>> weren't
>>>>>>   copying the soreuseport option from the original socket onto 
>>>>>> the twsk.
>>>>>>   Is there are reason we aren't doing this currently?  Does this 
>>>>>> help
>>>>>>   explain what is happening?  Thanks,
>>>>> 
>>>>> 
>>>>>  The patch is interesting and a good clue, but I am immediately a 
>>>>> bit
>>>>>  concerned that we don't copy/tag the socket with the uid also to 
>>>>> keep
>>>>>  the security properties for SO_REUSEPORT. I have to think a bit 
>>>>> more
>>>>>  about this.
>>>>> 
>>>>>  We have seen hangs during connect. I am afraid this patch 
>>>>> wouldn't help
>>>>>  there while also guaranteeing uniqueness.
>>>> 
>>>> 
>>>> 
>>>>  Yeah so I looked at the code some more and actually my patch is 
>>>> really
>>>>  bad.  If sk2->sk_reuseport is set we'll look at 
>>>> sk2->sk_reuseport_cb, which
>>>>  is outside of the timewait sock, so that's definitely bad.
>>>> 
>>>>  But we should at least be setting it to 0 so that we don't do this
>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the 
>>>> problem.  So
>>>>  for some reason having ->sk_reuseport set to 1 on a timewait 
>>>> socket makes
>>>>  this problem non-existent, which is strange.
>>>> 
>>>>  So back to the drawing board I guess.  I wonder if doing what 
>>>> craig
>>>>  suggested and batching the timewait timer expires so it hurts 
>>>> less would
>>>>  accomplish the same results.  Thanks,
>>> 
>>> 
>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  
>>> This is the
>>>  code
>>> 
>>>                         if ((!reuse || !sk2->sk_reuse ||
>>>                             sk2->sk_state == TCP_LISTEN) &&
>>>                             (!reuseport || !sk2->sk_reuseport ||
>>>                              
>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>                              (sk2->sk_state != TCP_TIME_WAIT &&
>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {
>>> 
>>>                                 if (!sk2->sk_rcv_saddr || 
>>> !sk->sk_rcv_saddr
>>>  ||
>>>                                     sk2->sk_rcv_saddr == 
>>> sk->sk_rcv_saddr)
>>>                                         break;
>>>                         }
>>> 
>>>  so in my patches case we now have reuseport == 1, 
>>> sk2->sk_reuseport == 1.
>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be 
>>> non-NULL
>>>  right?  So really setting the timewait sock's sk_reuseport should 
>>> have no
>>>  bearing on how this loop plays out right?  Thanks,
>> 
>> 
>> 
>>  So more messing around and I noticed that we basically don't do the
>>  tb->fastreuseport logic at all if we've ended up with a non 
>> SO_REUSEPORT
>>  socket on that tb.  So before I fully understood what I was doing I 
>> fixed it
>>  so that after we go through ->bind_conflict() once with a 
>> SO_REUSEPORT
>>  socket, we reset tb->fastreuseport to 1 and set the uid to match 
>> the uid of
>>  the socket.  This made the problem go away.  Tom pointed out that 
>> if we bind
>>  to the same port on a different address and we have a non 
>> SO_REUSEPORT
>>  socket with the same address on this tb then we'd be screwed with 
>> my code.
>> 
>>  Which brings me to his proposed solution.  We need another hash 
>> table that
>>  is indexed based on the binding address.  Then each node 
>> corresponds to one
>>  address/port binding, with non-SO_REUSEPORT entries having only one 
>> entry,
>>  and normal SO_REUSEPORT entries having many.  This cleans up the 
>> need to
>>  search all the possible sockets on any given tb, we just go and 
>> look at the
>>  one we care about.  Does this make sense?  Thanks,
>> 
> Hi Josef,
> 
> Thinking about it some more the hash table won't work because of the
> rules of binding different addresses to the same port. What I think we
> can do is to change inet_bind_bucket to be structure that contains all
> the information used to detect conflicts (reuse*, if, address, uid,
> etc.) and a list of sockets that share that exact same information--
> for instance all socket in timewait state create through some listener
> socket should wind up on single bucket. When we do the bind_conflict
> function we only should have to walk this buckets, not the full list
> of sockets.
> 
> Thoughts on this?

This sounds good, maybe tb->owners be a list of say

struct inet_unique_shit {
	struct sock_common sk;
	struct hlist socks;
};

Then we make inet_unique_shit like twsks', just copy the relevant 
information, then hang the real sockets off of the socks hlist.  
Something like that?  Thanks,

Josef

  reply	other threads:[~2016-12-16 23:25 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-06 23:06 Soft lockup in inet_put_port on 4.6 Tom Herbert
2016-12-08 21:03 ` Hannes Frederic Sowa
2016-12-08 21:36   ` Josef Bacik
2016-12-09  0:30     ` Eric Dumazet
2016-12-09  1:01       ` Josef Bacik
2016-12-10  1:59         ` Josef Bacik
2016-12-10  3:47           ` Eric Dumazet
2016-12-10  4:14             ` Eric Dumazet
2016-12-12 18:05               ` Josef Bacik
2016-12-12 18:44                 ` Hannes Frederic Sowa
2016-12-12 21:23                   ` Josef Bacik
2016-12-12 22:24                   ` Josef Bacik
2016-12-13 20:51                     ` Tom Herbert
2016-12-13 23:03                       ` Craig Gallek
2016-12-13 23:32                         ` Tom Herbert
2016-12-15 18:53                           ` Josef Bacik
2016-12-15 22:39                             ` Tom Herbert
2016-12-15 23:25                               ` Craig Gallek
2016-12-16  0:07                             ` Hannes Frederic Sowa
2016-12-16 14:54                               ` Josef Bacik
2016-12-16 15:21                                 ` Josef Bacik
2016-12-16 22:08                                   ` Josef Bacik
2016-12-16 22:18                                     ` Tom Herbert
2016-12-16 22:50                                       ` Josef Bacik [this message]
2016-12-17 11:08                                         ` Hannes Frederic Sowa
2016-12-17 13:26                                           ` Josef Bacik
2016-12-20  1:56                                             ` David Miller
2016-12-20  2:07                                               ` Tom Herbert
2016-12-20  2:41                                                 ` Eric Dumazet
2016-12-20  3:40                                                   ` Josef Bacik
2016-12-20  4:52                                                     ` Eric Dumazet
2016-12-20  4:59                                                       ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1481928610.17731.0@smtp.office365.com \
    --to=jbacik@fb.com \
    --cc=eric.dumazet@gmail.com \
    --cc=hannes@stressinduktion.org \
    --cc=kraigatgoog@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).