Re: tcp hang when socket fills up ?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dominique Martinet <asmadeus@codewreck.org>
To: netdev@vger.kernel.org
Subject: Re: tcp hang when socket fills up ?
Date: Fri, 13 Apr 2018 18:42:55 +0900	[thread overview]
Message-ID: <20180413094255.GA17273@nautica> (raw)
In-Reply-To: <20180406090720.GA31845@nautica>


Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{       
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        struct tcp_sacktag_state sack_state;
        struct rate_sample rs = { .prior_delivered = 0 };
        u32 prior_snd_una = tp->snd_una;
        bool is_sack_reneg = tp->is_sack_reneg;
        u32 ack_seq = TCP_SKB_CB(skb)->seq;
        u32 ack = TCP_SKB_CB(skb)->ack_seq;
        bool is_dupack = false;
        int prior_packets = tp->packets_out;
        u32 delivered = tp->delivered;
        u32 lost = tp->lost;
        int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
		losses */
        u32 prior_fack;

        sack_state.first_sackt = 0;
        sack_state.rate = &rs;

        /* We very likely will need to access rtx queue. */
        prefetch(sk->tcp_rtx_queue.rb_node);

        /* If the ack is older than previous acks
         * then we can probably ignore it.
         */
        if (before(ack, prior_snd_una)) {
		}

        /* If the ack includes data we haven't sent yet, discard
         * this segment (RFC793 Section 3.9).
         */
        if (after(ack, tp->snd_nxt))

        if (after(ack, prior_snd_una)) {
        }

        prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
		tp->snd_una;
        rs.prior_in_flight = tcp_packets_in_flight(tp);

        /* ts_recent update must be made after we are sure that the
		packet                           
         * is in window.
         */
        if (flag & FLAG_UPDATE_TS_RECENT)

        if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
        } else {
                u32 ack_ev_flags = CA_ACK_SLOWPATH;

                if (ack_seq != TCP_SKB_CB(skb)->end_seq)
                else
                        NET_INC_STATS(sock_net(sk),
						LINUX_MIB_TCPPUREACKS);

                flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

                if (TCP_SKB_CB(skb)->sacked)

                if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
                }

                if (flag & FLAG_WIN_UPDATE)

                tcp_in_ack_event(sk, ack_ev_flags);
        }
        
        /* We passed data and got it acked, remove any soft error
         * log. Something worked...
         */
        sk->sk_err_soft = 0;
        icsk->icsk_probes_out = 0;
        tp->rcv_tstamp = tcp_jiffies32;
        if (!prior_packets)
                goto no_queue;

no_queue:
        /* If data was DSACKed, see if we can undo a cwnd reduction. */
        if (flag & FLAG_DSACKING_ACK)
                tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
				&flag,
                                      &rexmit);
        /* If this ack opens up a zero window, clear backoff.  It was
         * being used to time the probes, and is probably far higher
		 than
         * it needs to be for normal retransmission.
         */
        tcp_ack_probe(sk);

        if (tp->tlp_high_seq)
                tcp_process_tlp_ack(sk, ack, flag);
        return 1;


And here is 'info local' towards the end of the function:
icsk = 0xffff88001740b800
tp = 0xffff88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0xffff88003fc83ae0, flag = 0, mss_now = 0}
rs = {prior_mstamp = 0, prior_delivered = 0, delivered = 0, interval_us = 0, rtt_us = 0, losses = 0, acked_sacked = 0, prior_in_flight = 0, is_app_limited = false, is_retrans = false, is_ack_delayed = false}
prior_snd_una = 2896339291
is_sack_reneg = false
ack_seq = 2651638615
ack = 2896339291
is_dupack = false
prior_packets = 0
delivered = 172
lost = 0
rexmit = 0
prior_fack = 2896339291
__vpp_verify = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>


So as I was seeing, we have ack = prior_fack = prior_snd_una (which is > ack_seq)

We're in the !prior_packets goto to no_queue though so we don't pass
tcp_clean_rtx_queue - so that explains why the replays don't help. I'm
ok with that. But all these traces actually don't tell me why the first
time we saw that packed didn't take the ack and decreate sk_wmem_queued
properly.

This isn't easy enough to reproduce that I feel confident grabbing the
first call to that function that fails though, so I guess I'll actually
have to read the code I'm debugging instead of trying to just plow
through it...

Hints still welcome, if anyone has an idea.

-- 
Dominique Martinet | Asmadeus

next prev parent reply	other threads:[~2018-04-13  9:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-06  9:07 tcp hang when socket fills up ? Dominique Martinet
2018-04-13  9:42 ` Dominique Martinet [this message]
2018-04-13 15:01 ` Eric Dumazet
2018-04-13 16:32 ` Michal Kubecek
2018-04-14  1:09   ` Dominique Martinet
2018-04-14  1:39     ` Eric Dumazet
2018-04-14  1:55       ` Dominique Martinet
2018-04-16  1:47         ` Dominique Martinet
2018-04-16  2:26           ` Eric Dumazet
2018-04-16  3:55             ` Dominique Martinet
2018-04-16  4:03               ` Dominique Martinet
2018-04-16 11:01               ` Florian Westphal
2018-04-17  3:52                 ` Dominique Martinet
2018-04-17  5:28                   ` Eric Dumazet
2018-04-17  6:52                     ` Michal Kubecek
2018-04-17  9:20                     ` Michal Kubecek
2018-04-17 12:34                       ` Dominique Martinet
2018-04-17 13:00                         ` Michal Kubecek
2018-04-17 13:29                         ` Florian Westphal
2018-04-18  8:13                           ` Jozsef Kadlecsik
2018-04-18  8:30                             ` Dominique Martinet
2018-04-18  9:36                               ` Dominique Martinet
2018-04-18 10:27                                 ` Jozsef Kadlecsik
2018-04-18 11:30                                   ` Dominique Martinet
2018-04-18 11:37                                     ` Jozsef Kadlecsik
2018-04-16 20:43         ` Marcelo Ricardo Leitner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180413094255.GA17273@nautica \
    --to=asmadeus@codewreck.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).