From: Martin Karsten <mkarsten@uwaterloo.ca>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>,
netdev@vger.kernel.org, Joe Damato <jdamato@fastly.com>,
amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
Alexander Lobakin <aleksander.lobakin@intel.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Breno Leitao <leitao@debian.org>,
Christian Brauner <brauner@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>, Jan Kara <jack@suse.cz>,
Jiri Pirko <jiri@resnulli.us>,
Johannes Berg <johannes.berg@intel.com>,
Jonathan Corbet <corbet@lwn.net>,
"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)"
<linux-fsdevel@vger.kernel.org>,
open list <linux-kernel@vger.kernel.org>,
Lorenzo Bianconi <lorenzo@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Date: Tue, 13 Aug 2024 21:14:40 -0400 [thread overview]
Message-ID: <15bec172-490f-4535-bd07-442c1be75ed9@uwaterloo.ca> (raw)
In-Reply-To: <20240813171015.425f239e@kernel.org>
On 2024-08-13 20:10, Jakub Kicinski wrote:
> On Mon, 12 Aug 2024 17:46:42 -0400 Martin Karsten wrote:
>>>> Here's how it is intended to work:
>>>> - An administrator sets the existing sysfs parameters for
>>>> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>
>>>> - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>> to a larger value than gro-timeout to enable IRQ suspension.
>>>
>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>> and I don't see how it helps.
>>
>> gro-flush-timeout (in combination with defer-hard-irqs) is the default
>> irq deferral mechanism and as such, always active when configured. Its
>> static periodic softirq processing leads to a situation where:
>>
>> - A long gro-flush-timeout causes high latencies when load is
>> sufficiently below capacity, or
>>
>> - a short gro-flush-timeout causes overhead when softirq execution
>> asynchronously competes with application processing at high load.
>>
>> The shortcomings of this are documented (to some extent) by our
>> experiments. See defer20 working well at low load, but having problems
>> at high load, while defer200 having higher latency at low load.
>>
>> irq-suspend-timeout is only active when an application uses
>> prefer-busy-polling and in that case, produces a nice alternating
>> pattern of application processing and networking processing (similar to
>> what we describe in the paper). This then works well with both low and
>> high load.
>
> What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
> to be used with NICs which either don't have IRQ coalescing or have a
> broken implementation. The timeout of 200usec should be perfectly within
> range of what NICs can support.
>
> If the NIC IRQ coalescing works, instead of adding a new timeout value
> we could add a new deferral control (replacing defer_hard_irqs_count)
> which would always kick in after seeing prefer_busy_poll() but also
> not kick in if the busy poll harvested 0 packets.
Maybe I am missing something, but I believe this would have the same
problem that we describe for gro-timeout + defer-irq. When busy poll
does not harvest packets and the application thread is idle and goes to
sleep, it would then take up to 200 us to get the next interrupt. This
considerably increases tail latencies under low load.
In order get low latencies under low load, the NIC timeout would have to
be something like 20 us, but under high load the application thread will
be busy for longer than 20 us and the interrupt (and softirq) will come
too early and cause interference.
The fundamental problem is that one fixed timer cannot handle dynamic
workloads, regardless of whether the timer is implemented in software or
the NIC. However, the current software implementation of the timer makes
it easy to add our mechanism that effectively switches between a short
and a long timeout. I assume it would be more difficult/overhead to
update the NIC timer all the time.
In other words, the complexity is always the same: A very long timeout
is needed to suspend irqs during periods of successful busy polling and
application processing. A short timeout is needed to receive the next
packet(s) with low latency during idle periods.
It is tempting to think of the second timeout as 0 and in fact re-enable
interrupts right away. We have tried it, but it leads to a lot of
interrupts and corresponding inefficiencies, since a system below
capacity frequently switches between busy and idle. Using a small
timeout (20 us) for modest deferral and batching when idle is a lot more
efficient.
Thanks,
Martin
next prev parent reply other threads:[~2024-08-14 1:15 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-12 12:57 [RFC net-next 0/5] Suspend IRQs during preferred busy poll Joe Damato
2024-08-12 12:57 ` [RFC net-next 4/5] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-08-12 13:19 ` Christoph Hellwig
2024-08-12 16:17 ` Matthew Wilcox
2024-08-12 17:49 ` Joe Damato
2024-08-12 17:46 ` Joe Damato
2024-08-12 12:57 ` [RFC net-next 5/5] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-08-12 20:20 ` Stanislav Fomichev
2024-08-12 21:47 ` Martin Karsten
2024-08-12 20:19 ` [RFC net-next 0/5] Suspend IRQs during preferred busy poll Stanislav Fomichev
2024-08-12 21:46 ` Martin Karsten
2024-08-12 23:03 ` Stanislav Fomichev
2024-08-13 0:04 ` Martin Karsten
2024-08-13 1:54 ` Stanislav Fomichev
2024-08-13 2:35 ` Martin Karsten
2024-08-13 4:07 ` Stanislav Fomichev
2024-08-13 13:18 ` Martin Karsten
2024-08-14 3:16 ` Willem de Bruijn
2024-08-14 14:19 ` Joe Damato
2024-08-14 15:08 ` Willem de Bruijn
2024-08-14 15:46 ` Joe Damato
2024-08-14 19:53 ` Samiullah Khawaja
2024-08-14 20:42 ` Martin Karsten
2024-08-16 14:27 ` Willem de Bruijn
2024-08-16 14:59 ` Willem de Bruijn
2024-08-16 15:25 ` Joe Damato
2024-08-16 17:01 ` Willem de Bruijn
2024-08-16 20:03 ` Martin Karsten
2024-08-16 20:58 ` Willem de Bruijn
2024-08-17 18:15 ` Martin Karsten
2024-08-18 12:55 ` Willem de Bruijn
2024-08-18 14:51 ` Martin Karsten
2024-08-20 2:36 ` Jakub Kicinski
2024-08-20 14:28 ` Martin Karsten
2024-08-17 10:00 ` Joe Damato
2024-08-14 0:10 ` Jakub Kicinski
2024-08-14 1:14 ` Martin Karsten [this message]
2024-08-20 2:07 ` Jakub Kicinski
2024-08-20 14:27 ` Martin Karsten
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15bec172-490f-4535-bd07-442c1be75ed9@uwaterloo.ca \
--to=mkarsten@uwaterloo.ca \
--cc=aleksander.lobakin@intel.com \
--cc=amritha.nambiar@intel.com \
--cc=bigeasy@linutronix.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jack@suse.cz \
--cc=jdamato@fastly.com \
--cc=jiri@resnulli.us \
--cc=johannes.berg@intel.com \
--cc=kuba@kernel.org \
--cc=leitao@debian.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=sridhar.samudrala@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).