From mboxrd@z Thu Jan 1 00:00:00 1970 From: stranche@codeaurora.org Subject: WARN_ON in TLP causing RT throttling Date: Wed, 26 Sep 2018 17:46:27 -0600 Message-ID: <7aa9932a59aad7a21c7f8a8146dd0542@codeaurora.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Cc: "soheil@google.com" To: eric.dumazet@gmail.com Return-path: Received: from smtp.codeaurora.org ([198.145.29.96]:33606 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726469AbeI0GBv (ORCPT ); Thu, 27 Sep 2018 02:01:51 -0400 Sender: netdev-owner@vger.kernel.org List-ID: Hi Eric, Someone recently reported a crash to us on the 4.14.62 kernel where excessive WARNING prints were spamming the logs and causing watchdog bites. The kernel does have the following commit by Soheil: bffd168c3fc5 "tcp: clear tp->packets_out when purging write queue" Before this bug we see over 1 second of continuous WARN_ON prints from tcp_send_loss_probe() like so: 7795.530450: <2> tcp_send_loss_probe+0x194/0x1b8 7795.534833: <2> tcp_write_timer_handler+0xf8/0x1c4 7795.539492: <2> tcp_write_timer+0x4c/0x74 7795.543348: <2> call_timer_fn+0xc0/0x1b4 7795.547113: <2> run_timer_softirq+0x248/0x81c Specifically, the prints come from the following check: /* Retransmit last segment. */ if (WARN_ON(!skb)) goto rearm_timer; Since skb is always NULL, we know there's nothing on the write queue or the retransmit queue, so we just keep resetting the timer, waiting for more data to be queued. However, we were able to determine that the TCP socket is in the TCP_FIN_WAIT1 state, so we will no longer be sending any data and these queues remain empty. Would it be appropriate to stop resetting the TLP timer if we detect that the connection is starting to close and we have no more data to send the probe with, or is there some way that this scenario should already be handled? Unfortunately, we don't have a reproducer for this crash. Thanks, Sean