From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gasper Zejn Subject: BBR and TCP internal pacing causing interrupt storm with pfifo_fast Date: Tue, 9 Oct 2018 18:38:17 +0200 Message-ID: <183b7fe0-0757-cf63-555c-925ea840c67f@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit To: Kevin Yang , Eric Dumazet , netdev@vger.kernel.org Return-path: Received: from mail-wm1-f41.google.com ([209.85.128.41]:36363 "EHLO mail-wm1-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726492AbeJIX4J (ORCPT ); Tue, 9 Oct 2018 19:56:09 -0400 Received: by mail-wm1-f41.google.com with SMTP id a8-v6so2698033wmf.1 for ; Tue, 09 Oct 2018 09:38:20 -0700 (PDT) Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: Hello, I am seeing interrupt storms of over 100k-900k local timer interrupts when changing between network devices or networks with open TCP connections when not using sch_fq (I was using pfifo_fast). Using sch_fq makes the bug with interrupt storm go away. The interrupts all called tcp_pace_kick (according to perf), which seems to return HRTIMER_NORESTART, but apparently somewhere calls another function, that does restart the timer. The bug is fairly easy to reproduce. Congestion control needs to be BBR, network scheduler was pfifo_fast, and there need to be open TCP connections when changing network in such a way that TCP connections cannot continue to work (eg. different client IP addresses). The more connections the more interrupts. The connection handling code will cause interrupt storm, which eventually sets down as the connections time out. It is a bit annoying as high interrupt rate does not show as load. I successfully reproduced this with 4.18.12, but this has been happening for some time, with previous versions of kernel too. I'd like to thank you for the comment regarding use of sch_fq with BBR above the tcp_needs_internal_pacing function. It has pointed me in the direction to find the workaround. Kind regards, Gasper Zejn