linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Matthew Wilcox <willy@infradead.org>, Liao Chang <liaochang1@huawei.com>
Cc: mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com,
	clg@kaod.org, nitesh@redhat.com, edumazet@google.com,
	peterz@infradead.org, joshdon@google.com, masahiroy@kernel.org,
	nathan@kernel.org, akpm@linux-foundation.org, vbabka@suse.cz,
	gustavoars@kernel.org, arnd@arndb.de, chris@chrisdown.name,
	dmitry.torokhov@gmail.com, linux@rasmusvillemoes.dk,
	daniel@iogearbox.net, john.ogness@linutronix.de, will@kernel.org,
	dave@stgolabs.net, frederic@kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	heying24@huawei.com, guohanjun@huawei.com,
	weiyongjun1@huawei.com
Subject: Re: [RFC 0/3] softirq: Introduce softirq throttling
Date: Thu, 07 Apr 2022 14:47:20 +0200	[thread overview]
Message-ID: <877d81jc13.ffs@tglx> (raw)
In-Reply-To: <Yk2ppI60P98E2Qj5@casper.infradead.org>

On Wed, Apr 06 2022 at 15:54, Matthew Wilcox wrote:
> On Wed, Apr 06, 2022 at 10:52:38AM +0800, Liao Chang wrote:
>> Kernel check for pending softirqs periodically, they are performed in a
>> few points of kernel code, such as irq_exit() and __local_bh_enable_ip(),
>> softirqs that have been activated by a given CPU must be executed on the
>> same CPU, this characteristic of softirq is always a potentially
>> "dangerous" operation, because one CPU might be end up very busy while
>> the other are most idle.
>> 
>> Above concern is proven in a networking user case: recenlty, we
>> engineer find out the time used for connection re-establishment on
>> kernel v5.10 is 300 times larger than v4.19, meanwhile, softirq
>> monopolize almost 99% of CPU. This problem stem from that the connection
>> between Sender and Receiver node get lost, the NIC driver on Sender node
>> will keep raising NET_TX softirq before connection recovery. The system
>> log show that most of softirq is performed from __local_bh_enable_ip(),
>> since __local_bh_enable_ip is used widley in kernel code, it is very
>> easy to run out most of CPU, and the user-mode application can't obtain
>> enough CPU cycles to establish connection as soon as possible.
>
> Shouldn't you fix that bug instead?  This seems like papering over the
> bad effects of a bug and would make it harder to find bugs like this in
> the future.  Essentially, it's the same as a screaming hardware interrupt,
> except that it's a software interrupt, so we can fix the bug instead of
> working around broken hardware.

It's not necessarily broken hardware. It's a fundamental issue of our
softirq processing magic which can happen in those contexts:

   1) On return from interrupt

   2) In local_bh_enable()

   3) In ksoftirqd

We have heuristics in place which delegate processing to ksoftirqd,
which brings softirq processing under scheduler control to some extent,
but those heuristics are rather easy to evade.

Delegation to ksoftirqd happens when the runtime of the __do_softirq()
loop exceeds a threshold. But if that is not exceeded then you still can
get into a situation where softirq processing eats up a large quantity
of CPU time in #1 and #2 which is the real problem because it prevents
the scheduler from applying fairness.

That's a known issue and attempts to fix that are popping up on a
regular base. There are several issues here:

  1) The runtime check in __do_softirq() is jiffies based and depending
     on CONFIG_HZ it's easy to stay under the threshold for one
     invocation, but still eat a large amount of CPU time.

     Also the runtime check happens at the end of the loop, which means
     that if a single softirq callback runs too long we still process
     all other pending ones.

  2) The decision to process softirqs directly on return from interrupt
     or in local_bh_enable() is error prone.

     The logic is:

         if (!ksoftirq_running() ||
             local_softirq_pending() & (SOFTIRQ_HI | SOFTIRQ_TASKLET))
         	process_direct();

     The reason for the HI/TASKLET exception is documented here:

         3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")

     But this is nasty because tasklets are widely used in networking,
     crypto and quite some of them are self rearming when they take too
     long. See mlx5_cq_tasklet_cb() as one example, which also uses
     jiffies for time limiting...

     With the HI/TASKLET exception this means that there is no
     delegation to ksoftirqd simply because on the next return from
     interrupt or the next local_bh_enable() in task context softirqs
     are processed. And this processes _all_ pending bits not only the
     HI/TASKLET ones...

The approach of just not running softirqs at all via a throttle
mechanism as proposed with these patches here, is definitely wrong and
going nowhere.

The proper solution for load accounting is a moving average with
exponential decay based on sched_clock() and not on jiffies. That gives
a reasonable decision to enforce ksoftirqd processing, but of course
that does neither solve the tasklet issues nor any of the other problems
vs. softirqs at all.

We've tried splitting ksoftirqd into different threads a couple of years
ago in RT, but that turned out to be problematic in some cases. Frederic
did some experiments to make local_bh_disable() take a mask argument to
disable only particular softirqs, but that ran into a dead end and is
problematic because quite some code, esp. networking relies on multiple
softirqs being disabled.

Softirqs are semantically ill defined and that's known for a very long
time, but of course they are conveniant and with a few hacks piled on
top to address the most urgent horrors they work by some definition of
work. IOW, we are accumulating technical depth with a fast pace.

TBH, I have no real good plan how to address this proper, but it's about
time to tackle this in a concerted effort.

Thanks,

        tglx

  reply	other threads:[~2022-04-07 12:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06  2:52 [RFC 0/3] softirq: Introduce softirq throttling Liao Chang
2022-04-06  2:52 ` [RFC 1/3] softirq: Add two parameters to control CPU bandwidth for use by softirq Liao Chang
2022-04-06  2:52 ` [RFC 2/3] softirq: Do throttling when softirqs use up its bandwidth Liao Chang
2022-04-06  2:52 ` [RFC 3/3] softirq: Introduce statistics about softirq throttling Liao Chang
2022-04-06 14:54 ` [RFC 0/3] softirq: Introduce " Matthew Wilcox
2022-04-07 12:47   ` Thomas Gleixner [this message]
2022-04-07 10:57 ` Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2022-04-06  2:27 Liao Chang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877d81jc13.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=chris@chrisdown.name \
    --cc=clg@kaod.org \
    --cc=daniel@iogearbox.net \
    --cc=dave@stgolabs.net \
    --cc=dmitry.torokhov@gmail.com \
    --cc=edumazet@google.com \
    --cc=frederic@kernel.org \
    --cc=guohanjun@huawei.com \
    --cc=gustavoars@kernel.org \
    --cc=heying24@huawei.com \
    --cc=john.ogness@linutronix.de \
    --cc=joshdon@google.com \
    --cc=keescook@chromium.org \
    --cc=liaochang1@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=masahiroy@kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=nathan@kernel.org \
    --cc=nitesh@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vbabka@suse.cz \
    --cc=weiyongjun1@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).