WARN_ON in TLP causing RT throttling

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* WARN_ON in TLP causing RT throttling
@ 2018-09-26 23:46 stranche
  2018-09-27  0:09 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: stranche @ 2018-09-26 23:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: soheil@google.com

Hi Eric,

Someone recently reported a crash to us on the 4.14.62 kernel where 
excessive
WARNING prints were spamming the logs and causing watchdog bites. The 
kernel
does have the following commit by Soheil:
bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"

Before this bug we see over 1 second of continuous WARN_ON prints from
tcp_send_loss_probe() like so:

7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
7795.539492:   <2>  tcp_write_timer+0x4c/0x74
7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
7795.547113:   <2>  run_timer_softirq+0x248/0x81c

Specifically, the prints come from the following check:

	/* Retransmit last segment. */
	if (WARN_ON(!skb))
		goto rearm_timer;

Since skb is always NULL, we know there's nothing on the write queue or 
the
retransmit queue, so we just keep resetting the timer, waiting for more 
data
to be queued. However, we were able to determine that the TCP socket is 
in the
TCP_FIN_WAIT1 state, so we will no longer be sending any data and these 
queues
remain empty.

Would it be appropriate to stop resetting the TLP timer if we detect 
that the
connection is starting to close and we have no more data to send the 
probe with,
or is there some way that this scenario should already be handled?

Unfortunately, we don't have a reproducer for this crash.

Thanks,
Sean

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-26 23:46 WARN_ON in TLP causing RT throttling stranche
@ 2018-09-27  0:09 ` Eric Dumazet
  2018-09-27 19:14   ` Yuchung Cheng
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2018-09-27  0:09 UTC (permalink / raw)
  To: stranche, eric.dumazet; +Cc: soheil@google.com



On 09/26/2018 04:46 PM, stranche@codeaurora.org wrote:
> Hi Eric,
> 
> Someone recently reported a crash to us on the 4.14.62 kernel where excessive
> WARNING prints were spamming the logs and causing watchdog bites. The kernel
> does have the following commit by Soheil:
> bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
> 
> Before this bug we see over 1 second of continuous WARN_ON prints from
> tcp_send_loss_probe() like so:
> 
> 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
> 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
> 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
> 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
> 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
> 
> Specifically, the prints come from the following check:
> 
>     /* Retransmit last segment. */
>     if (WARN_ON(!skb))
>         goto rearm_timer;
> 
> Since skb is always NULL, we know there's nothing on the write queue or the
> retransmit queue, so we just keep resetting the timer, waiting for more data
> to be queued. However, we were able to determine that the TCP socket is in the
> TCP_FIN_WAIT1 state, so we will no longer be sending any data and these queues
> remain empty.
> 
> Would it be appropriate to stop resetting the TLP timer if we detect that the
> connection is starting to close and we have no more data to send the probe with,
> or is there some way that this scenario should already be handled?
> 
> Unfortunately, we don't have a reproducer for this crash.
>

Something is fishy.

If there is no skb in the queues, then tp->packets_out should be 0,
therefore tcp_rearm_rto() should simply call inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);

I have never seen this report before.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-27  0:09 ` Eric Dumazet
@ 2018-09-27 19:14   ` Yuchung Cheng
  2018-09-28  0:16     ` stranche
  0 siblings, 1 reply; 7+ messages in thread
From: Yuchung Cheng @ 2018-09-27 19:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: stranche, soheil@google.com

On Wed, Sep 26, 2018 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 09/26/2018 04:46 PM, stranche@codeaurora.org wrote:
> > Hi Eric,
> >
> > Someone recently reported a crash to us on the 4.14.62 kernel where excessive
> > WARNING prints were spamming the logs and causing watchdog bites. The kernel
> > does have the following commit by Soheil:
> > bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
> >
> > Before this bug we see over 1 second of continuous WARN_ON prints from
> > tcp_send_loss_probe() like so:
> >
> > 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
> > 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
> > 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
> > 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
> > 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
> >
> > Specifically, the prints come from the following check:
> >
> >     /* Retransmit last segment. */
> >     if (WARN_ON(!skb))
> >         goto rearm_timer;
> >
> > Since skb is always NULL, we know there's nothing on the write queue or the
> > retransmit queue, so we just keep resetting the timer, waiting for more data
> > to be queued. However, we were able to determine that the TCP socket is in the
> > TCP_FIN_WAIT1 state, so we will no longer be sending any data and these queues
> > remain empty.
> >
> > Would it be appropriate to stop resetting the TLP timer if we detect that the
> > connection is starting to close and we have no more data to send the probe with,
> > or is there some way that this scenario should already be handled?
> >
> > Unfortunately, we don't have a reproducer for this crash.
> >
>
> Something is fishy.
>
> If there is no skb in the queues, then tp->packets_out should be 0,
> therefore tcp_rearm_rto() should simply call inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
>
> I have never seen this report before.
Do you use Fast Open? I am wondering if its a bug when a TFO server
closes the socket before the handshake finishes...

Either way, it's pretty safe to just stop TLP if write queue is empty
for any unexpected reason.

>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-27 19:14   ` Yuchung Cheng
@ 2018-09-28  0:16     ` stranche
  2018-09-28  0:25       ` Eric Dumazet
  2018-10-02 21:19       ` Yuchung Cheng
  0 siblings, 2 replies; 7+ messages in thread
From: stranche @ 2018-09-28  0:16 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Eric Dumazet, soheil, netdev

On 2018-09-27 13:14, Yuchung Cheng wrote:
> On Wed, Sep 26, 2018 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> 
> wrote:
>> 
>> 
>> 
>> On 09/26/2018 04:46 PM, stranche@codeaurora.org wrote:
>> > Hi Eric,
>> >
>> > Someone recently reported a crash to us on the 4.14.62 kernel where excessive
>> > WARNING prints were spamming the logs and causing watchdog bites. The kernel
>> > does have the following commit by Soheil:
>> > bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
>> >
>> > Before this bug we see over 1 second of continuous WARN_ON prints from
>> > tcp_send_loss_probe() like so:
>> >
>> > 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
>> > 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
>> > 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
>> > 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
>> > 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
>> >
>> > Specifically, the prints come from the following check:
>> >
>> >     /* Retransmit last segment. */
>> >     if (WARN_ON(!skb))
>> >         goto rearm_timer;
>> >
>> > Since skb is always NULL, we know there's nothing on the write queue or the
>> > retransmit queue, so we just keep resetting the timer, waiting for more data
>> > to be queued. However, we were able to determine that the TCP socket is in the
>> > TCP_FIN_WAIT1 state, so we will no longer be sending any data and these queues
>> > remain empty.
>> >
>> > Would it be appropriate to stop resetting the TLP timer if we detect that the
>> > connection is starting to close and we have no more data to send the probe with,
>> > or is there some way that this scenario should already be handled?
>> >
>> > Unfortunately, we don't have a reproducer for this crash.
>> >
>> 
>> Something is fishy.
>> 
>> If there is no skb in the queues, then tp->packets_out should be 0,
>> therefore tcp_rearm_rto() should simply call 
>> inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
>> 
>> I have never seen this report before.
> Do you use Fast Open? I am wondering if its a bug when a TFO server
> closes the socket before the handshake finishes...
> 
> Either way, it's pretty safe to just stop TLP if write queue is empty
> for any unexpected reason.
> 
>> 
Hi Yuchung,

Based on the dumps we were able to get, it appears that TFO was not used 
in this case.
We also tried some local experiments where we dropped incoming SYN 
packets after already
successful TFO connections on the receive side to see if TFO would 
trigger this scenario, but
have not been able to reproduce it.

One other interesting thing we found is that the socket never sent or 
received any data. It only
sent/received the packets for the initial handshake and the outgoing 
FIN.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-28  0:16     ` stranche
@ 2018-09-28  0:25       ` Eric Dumazet
  2018-09-28 16:20         ` stranche
  2018-10-02 21:19       ` Yuchung Cheng
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2018-09-28  0:25 UTC (permalink / raw)
  To: stranche, Yuchung Cheng; +Cc: Eric Dumazet, soheil, netdev



On 09/27/2018 05:16 PM, stranche@codeaurora.org wrote:

> Hi Yuchung,
> 
> Based on the dumps we were able to get, it appears that TFO was not used in this case.
> We also tried some local experiments where we dropped incoming SYN packets after already
> successful TFO connections on the receive side to see if TFO would trigger this scenario, but
> have not been able to reproduce it.
> 
> One other interesting thing we found is that the socket never sent or received any data. It only
> sent/received the packets for the initial handshake and the outgoing FIN.

Just to make sure : Was this some sort of syzkaller (or other fuzzer) run ?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-28  0:25       ` Eric Dumazet
@ 2018-09-28 16:20         ` stranche
  0 siblings, 0 replies; 7+ messages in thread
From: stranche @ 2018-09-28 16:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Yuchung Cheng, soheil, netdev

On 2018-09-27 18:25, Eric Dumazet wrote:
> On 09/27/2018 05:16 PM, stranche@codeaurora.org wrote:
> 
>> Hi Yuchung,
>> 
>> Based on the dumps we were able to get, it appears that TFO was not 
>> used in this case.
>> We also tried some local experiments where we dropped incoming SYN 
>> packets after already
>> successful TFO connections on the receive side to see if TFO would 
>> trigger this scenario, but
>> have not been able to reproduce it.
>> 
>> One other interesting thing we found is that the socket never sent or 
>> received any data. It only
>> sent/received the packets for the initial handshake and the outgoing 
>> FIN.
> 
> Just to make sure : Was this some sort of syzkaller (or other fuzzer) 
> run ?

No, this was a normal run.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARN_ON in TLP causing RT throttling
  2018-09-28  0:16     ` stranche
  2018-09-28  0:25       ` Eric Dumazet
@ 2018-10-02 21:19       ` Yuchung Cheng
  1 sibling, 0 replies; 7+ messages in thread
From: Yuchung Cheng @ 2018-10-02 21:19 UTC (permalink / raw)
  To: stranche; +Cc: Eric Dumazet, Soheil Hassas Yeganeh, netdev

On Thu, Sep 27, 2018 at 5:16 PM, <stranche@codeaurora.org> wrote:
>
> On 2018-09-27 13:14, Yuchung Cheng wrote:
>>
>> On Wed, Sep 26, 2018 at 5:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>> On 09/26/2018 04:46 PM, stranche@codeaurora.org wrote:
>>> > Hi Eric,
>>> >
>>> > Someone recently reported a crash to us on the 4.14.62 kernel where excessive
>>> > WARNING prints were spamming the logs and causing watchdog bites. The kernel
>>> > does have the following commit by Soheil:
>>> > bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue"
>>> >
>>> > Before this bug we see over 1 second of continuous WARN_ON prints from
>>> > tcp_send_loss_probe() like so:
>>> >
>>> > 7795.530450:   <2>  tcp_send_loss_probe+0x194/0x1b8
>>> > 7795.534833:   <2>  tcp_write_timer_handler+0xf8/0x1c4
>>> > 7795.539492:   <2>  tcp_write_timer+0x4c/0x74
>>> > 7795.543348:   <2>  call_timer_fn+0xc0/0x1b4
>>> > 7795.547113:   <2>  run_timer_softirq+0x248/0x81c
>>> >
>>> > Specifically, the prints come from the following check:
>>> >
>>> >     /* Retransmit last segment. */
>>> >     if (WARN_ON(!skb))
>>> >         goto rearm_timer;
>>> >
>>> > Since skb is always NULL, we know there's nothing on the write queue or the
>>> > retransmit queue, so we just keep resetting the timer, waiting for more data
>>> > to be queued. However, we were able to determine that the TCP socket is in the
>>> > TCP_FIN_WAIT1 state, so we will no longer be sending any data and these queues
>>> > remain empty.
>>> >
>>> > Would it be appropriate to stop resetting the TLP timer if we detect that the
>>> > connection is starting to close and we have no more data to send the probe with,
>>> > or is there some way that this scenario should already be handled?
>>> >
>>> > Unfortunately, we don't have a reproducer for this crash.
>>> >
>>>
>>> Something is fishy.
>>>
>>> If there is no skb in the queues, then tp->packets_out should be 0,
>>> therefore tcp_rearm_rto() should simply call inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
>>>
>>> I have never seen this report before.
>>
>> Do you use Fast Open? I am wondering if its a bug when a TFO server
>> closes the socket before the handshake finishes...
>>
>> Either way, it's pretty safe to just stop TLP if write queue is empty
>> for any unexpected reason.
>>
>>>
> Hi Yuchung,
>
> Based on the dumps we were able to get, it appears that TFO was not used in this case.
> We also tried some local experiments where we dropped incoming SYN packets after already
> successful TFO connections on the receive side to see if TFO would trigger this scenario, but
> have not been able to reproduce it.
>
> One other interesting thing we found is that the socket never sent or received any data. It only
> sent/received the packets for the initial handshake and the outgoing FIN.
I wonder if there's a bug in tcp_rtx_queue. rtx_queue should have at
least the FIN packet pending to be acked in TCP_FIN_WAIT1. And the
warning should only show up once if packets_out is 0 due to the check
in tcp_rearm_rto. so it seems packets_out is non-zero (i.e. counting a
FIN) but tcp_rtx_queue has become empty. hmmm

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-03  4:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-26 23:46 WARN_ON in TLP causing RT throttling stranche
2018-09-27  0:09 ` Eric Dumazet
2018-09-27 19:14   ` Yuchung Cheng
2018-09-28  0:16     ` stranche
2018-09-28  0:25       ` Eric Dumazet
2018-09-28 16:20         ` stranche
2018-10-02 21:19       ` Yuchung Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).