From: Stanislav Fomichev <sdf@fomichev.me>
To: Joe Damato <jdamato@fastly.com>
Cc: netdev@vger.kernel.org, mkarsten@uwaterloo.ca,
amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
Alexander Lobakin <aleksander.lobakin@intel.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Breno Leitao <leitao@debian.org>,
Christian Brauner <brauner@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Jan Kara <jack@suse.cz>,
Jiri Pirko <jiri@resnulli.us>,
Johannes Berg <johannes.berg@intel.com>,
Jonathan Corbet <corbet@lwn.net>,
"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)"
<linux-fsdevel@vger.kernel.org>,
open list <linux-kernel@vger.kernel.org>,
Lorenzo Bianconi <lorenzo@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Date: Mon, 12 Aug 2024 13:19:36 -0700 [thread overview]
Message-ID: <ZrpuWMoXHxzPvvhL@mini-arch> (raw)
In-Reply-To: <20240812125717.413108-1-jdamato@fastly.com>
On 08/12, Joe Damato wrote:
> Greetings:
>
> Martin Karsten (CC'd) and I have been collaborating on some ideas about
> ways of reducing tail latency when using epoll-based busy poll and we'd
> love to get feedback from the list on the code in this series. This is
> the idea I mentioned at netdev conf, for those who were there. Barring
> any major issues, we hope to submit this officially shortly after RFC.
>
> The basic idea for suspending IRQs in this manner was described in an
> earlier paper presented at Sigmetrics 2024 [1].
Let me explicitly call out the paper. Very nice analysis!
> Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for
> epoll_params") introduced the ability to enable or disable preferred
> busy poll mode on a specific epoll context using an ioctl
> (EPIOCSPARAMS).
>
> This series extends preferred busy poll mode by adding a sysfs parameter,
> irq_suspend_timeout, which when used in combination with preferred busy
> poll suspends device IRQs up to irq_suspend_timeout nanoseconds.
>
> Important call outs:
> - Enabling per epoll-context preferred busy poll will now effectively
> lead to a nonblocking iteration through napi_busy_loop, even when
> busy_poll_usecs is 0. See patch 4.
>
> - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa:
> microchip: ksz9477: split half-duplex monitoring function"), but
> may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate
> data-race of busy_poll_usecs") picked up by the vfs folks makes its way
> into net-next.
>
> - In the future, time permitting, I hope to enable support for
> napi_defer_hard_irqs, gro_flush_timeout (introduced in commit
> 6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and
> irq_suspend_timeout (introduced in this series) on a per-NAPI basis
> (presumably via netdev-genl).
>
> ~ Description of the changes
>
> The overall idea is that IRQ suspension is introduced via a sysfs
> parameter which controls the maximum time that IRQs can be suspended.
>
> Here's how it is intended to work:
> - An administrator sets the existing sysfs parameters for
> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>
> - An administrator sets the new sysfs parameter irq_suspend_timeout
> to a larger value than gro-timeout to enable IRQ suspension.
Can you expand more on what's the problem with the existing gro_flush_timeout?
Is it defer_hard_irqs_count? Or you want a separate timeout only for the
perfer_busy_poll case(why?)? Because looking at the first two patches,
you essentially replace all usages of gro_flush_timeout with a new variable
and I don't see how it helps.
Maybe expand more on what code paths are we trying to improve? Existing
busy polling code is not super readable, so would be nice to simplify
it a bit in the process (if possible) instead of adding one more tunable.
> - The user application issues the existing epoll ioctl to set the
> prefer_busy_poll flag on the epoll context.
>
> - The user application then calls epoll_wait to busy poll for network
> events, as it normally would.
>
> - If epoll_wait returns events to userland, IRQ are suspended for the
> duration of irq_suspend_timeout.
>
> - If epoll_wait finds no events and the thread is about to go to
> sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is
> resumed.
>
> As long as epoll_wait is retrieving events, IRQs (and softirq
> processing) for the NAPI being polled remain disabled. Unless IRQ
> suspension is continued by subsequent calls to epoll_wait, it
> automatically times out after the irq_suspend_timeout timer expires.
>
> When network traffic reduces, eventually a busy poll loop in the kernel
> will retrieve no data. When this occurs, regular deferral using
> gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular
> deferral is also immediately re-enabled when the epoll context is
> destroyed.
>
> ~ Benchmark configs & descriptions
>
> These changes were benchmarked with memcached [2] using the
> benchmarking tool mutilate [3].
>
> To facilitate benchmarking, a small patch [4] was applied to
> memcached 1.6.29 (the latest memcached release as of this RFC) to allow
> setting per-epoll context preferred busy poll and other settings
> via environment variables.
>
> Multiple scenarios were benchmarked as described below
> and the scripts used for producing these results can be found on
> github [5].
>
> (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID
> by passing -N to memcached):
>
> - base: Other than NAPI-based traffic splitting, no other options are
> enabled.
> - busy:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to 200,000
> - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
> busy_poll_budget = 64, prefer_busy_poll = true)
> - deferX:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to X,000
[..]
> - suspendX:
> - set defer_hard_irqs to 100
> - set gro_flush_timeout to X,000
> - set irq_suspend_timeout to 20,000,000
> - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
> busy_poll_budget = 64, prefer_busy_poll = true)
What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
to busy_poll sysctl value?
next prev parent reply other threads:[~2024-08-12 20:19 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-12 12:57 [RFC net-next 0/5] Suspend IRQs during preferred busy poll Joe Damato
2024-08-12 12:57 ` [RFC net-next 4/5] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-08-12 13:19 ` Christoph Hellwig
2024-08-12 16:17 ` Matthew Wilcox
2024-08-12 17:49 ` Joe Damato
2024-08-12 17:46 ` Joe Damato
2024-08-12 12:57 ` [RFC net-next 5/5] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-08-12 20:20 ` Stanislav Fomichev
2024-08-12 21:47 ` Martin Karsten
2024-08-12 20:19 ` Stanislav Fomichev [this message]
2024-08-12 21:46 ` [RFC net-next 0/5] Suspend IRQs during preferred busy poll Martin Karsten
2024-08-12 23:03 ` Stanislav Fomichev
2024-08-13 0:04 ` Martin Karsten
2024-08-13 1:54 ` Stanislav Fomichev
2024-08-13 2:35 ` Martin Karsten
2024-08-13 4:07 ` Stanislav Fomichev
2024-08-13 13:18 ` Martin Karsten
2024-08-14 3:16 ` Willem de Bruijn
2024-08-14 14:19 ` Joe Damato
2024-08-14 15:08 ` Willem de Bruijn
2024-08-14 15:46 ` Joe Damato
2024-08-14 19:53 ` Samiullah Khawaja
2024-08-14 20:42 ` Martin Karsten
2024-08-16 14:27 ` Willem de Bruijn
2024-08-16 14:59 ` Willem de Bruijn
2024-08-16 15:25 ` Joe Damato
2024-08-16 17:01 ` Willem de Bruijn
2024-08-16 20:03 ` Martin Karsten
2024-08-16 20:58 ` Willem de Bruijn
2024-08-17 18:15 ` Martin Karsten
2024-08-18 12:55 ` Willem de Bruijn
2024-08-18 14:51 ` Martin Karsten
2024-08-20 2:36 ` Jakub Kicinski
2024-08-20 14:28 ` Martin Karsten
2024-08-17 10:00 ` Joe Damato
2024-08-14 0:10 ` Jakub Kicinski
2024-08-14 1:14 ` Martin Karsten
2024-08-20 2:07 ` Jakub Kicinski
2024-08-20 14:27 ` Martin Karsten
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZrpuWMoXHxzPvvhL@mini-arch \
--to=sdf@fomichev.me \
--cc=aleksander.lobakin@intel.com \
--cc=amritha.nambiar@intel.com \
--cc=bigeasy@linutronix.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jack@suse.cz \
--cc=jdamato@fastly.com \
--cc=jiri@resnulli.us \
--cc=johannes.berg@intel.com \
--cc=kuba@kernel.org \
--cc=leitao@debian.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=mkarsten@uwaterloo.ca \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sridhar.samudrala@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).