From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Joe Damato <jdamato@fastly.com>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Martin Karsten <mkarsten@uwaterloo.ca>,
Stanislav Fomichev <sdf@fomichev.me>,
netdev@vger.kernel.org, amritha.nambiar@intel.com,
sridhar.samudrala@intel.com,
Alexander Lobakin <aleksander.lobakin@intel.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Breno Leitao <leitao@debian.org>,
Christian Brauner <brauner@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Jan Kara <jack@suse.cz>,
Jiri Pirko <jiri@resnulli.us>,
Johannes Berg <johannes.berg@intel.com>,
Jonathan Corbet <corbet@lwn.net>,
"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
"open list:FILESYSTEMS (VFS and infrastructure)"
<linux-fsdevel@vger.kernel.org>,
open list <linux-kernel@vger.kernel.org>,
Lorenzo Bianconi <lorenzo@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Date: Wed, 14 Aug 2024 11:08:45 -0400 [thread overview]
Message-ID: <66bcc87d605_b1f942948@willemb.c.googlers.com.notmuch> (raw)
In-Reply-To: <Zry9AO5Im6rjW0jm@LQ3V64L9R2.home>
Joe Damato wrote:
> On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> > Martin Karsten wrote:
> > > On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > > On 08/12, Martin Karsten wrote:
> > > >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > > >>> On 08/12, Martin Karsten wrote:
> > > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > >>>>> On 08/12, Martin Karsten wrote:
> > > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > >>>>>>> On 08/12, Joe Damato wrote:
>
> [...]
>
> > > >>
> > > >> One of the goals of this patch set is to reduce parameter tuning and make
> > > >> the parameter setting independent of workload dynamics, so it should make
> > > >> things easier. This is of course notwithstanding that per-napi settings
> > > >> would be even better.
> >
> > I don't follow how adding another tunable reduces parameter tuning.
>
> Thanks for taking a look and providing valuable feedback, Willem.
>
> An early draft of the cover letter included some paragraphs which were removed
> for the sake of brevity that we can add back in, which addresses your comment
> above:
>
> The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout)
> is useful, but picking the correct values for these settings is difficult and
> the ideal values change as the type of traffic and workload changes.
>
> For example: picking a large timeout value is good for batching packet
> processing, but induces latency. Picking a small timeout value will interrupt
> user app processing and reduce efficiency. The value chosen would be different
> depending on whether the system is under high load (large timeout) or when less
> busy (small timeout).
>
> As such, adding the new tunable makes it much easier to use the existing ones
> and also produces better performance as shown in the results we presented.
>
> Please let me know if you have any questions; I know that the change we are
> introducing is very subtle and I am happy to expand the cover letter if it'd be
> helpful for you.
>
> My concern was that the cover letter was too long already, but a big takeaway
> for me thus far has been that we should expand the cover letter.
>
> [...]
>
> > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > > disabling napi during busy polling is super useful, but it would still
> > > > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > > a per-napi basis imho.
> > >
> > > Fingers crossed. I hope this patch will be accepted, because it has
> > > practical performance and efficiency benefits, and that this will
> > > further increase the motivation to re-design the entire irq
> > > defer(/suspend) infrastructure for per-napi settings.
> >
> > Overall, the idea of keeping interrupts disabled during event
> > processing is very interesting.
>
> Thanks; I'm happy to hear we are aligned on this.
>
> > Hopefully the interface can be made more intuitive. Or documented more
> > easily. I had to read the kernel patches to fully (perhaps) grasp it.
> >
> > Another +1 on the referenced paper. Pointing out a specific difference
> > in behavior that is unrelated to the protection domain, rather than a
> > straightforward kernel vs user argument. The paper also had some
> > explanation that may be clearer for a commit message than the current
> > cover letter:
> >
> > "user-level network stacks put the application in charge of the entire
> > network stack processing (cf. Section 2). Interrupts are disabled and
> > the application coordinates execution by alternating between
> > processing existing requests and polling the RX queues for new data"
> > " [This series extends this behavior to kernel busy polling, while
> > falling back onto interrupt processing to limit CPU overhead.]
> >
> > "Instead of re-enabling the respective interrupt(s) as soon as
> > epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
> > masked until a subsequent epoll_wait() call comes up empty, i.e., no
> > events of interest are found and the application thread is about to be
> > blocked."
> >
> > "A fallback technical approach would use a kernel timeout set on the
> > return path from epoll_wait(). If necessary, the timeout re-enables
> > interrupts regardless of the application´s (mis)behaviour."
> > [Where misbehavior is not calling epoll_wait again]
> >
> > "The resulting execution model mimics the execution model of typical
> > user-level network stacks and does not add any requirements compared
> > to user-level networking. In fact, it is slightly better, because it
> > can resort to blocking and interrupt delivery, instead of having to
> > continuously busyloop during idle times."
> >
> > This last part shows a preference on your part to a trade-off:
> > you want low latency, but also low cpu utilization if possible.
> > This also came up in this thread. Please state that design decision
> > explicitly.
>
> Sure, we can include that in the list of cover letter updates we
> need to make. I could have called it out more clearly, but in the cover
> letter [1] I mentioned that latency improved for compared CPU usage (i.e.
> CPU efficiency improved):
>
> The overall takeaway from the results below is that the new mechanism
> (suspend20, see below) results in reduced 99th percentile latency and
> increased QPS in the MAX QPS case (compared to the other cases), and
> reduced latency in the lower QPS cases for comparable CPU usage to the
> base case (and less CPU than the busy case).
>
> > There are plenty of workloads where burning a core is acceptable
> > (especially as core count continues increasing), not "slightly worse".
>
> Respectfully, I don't think I'm on board with this argument. "Burning a core"
> has side effects even as core counts increase (power, cooling, etc) and it
> seems the counter argument is equally valid, as well: there are plenty of
> workloads where burning a core is undesirable.
Even more, likely. As this might be usable for standard efficiency
focused production services.
> Using less CPU to get comparable performance is strictly better, even if a
> system can theoretically support the increased CPU/power/cooling load.
If it is always a strict win yes. But falling back onto interrupts
with standard moderation will not match busy polling in all cases.
Different solutions for different workloads. No need to stack rank
them. My request is just to be explicit which design point this
chooses, and that the other design point (continuous busy polling) is
already addressed in Linux kernel busypolling.
> Either way: this is not an either/or. Adding support for the code we've
> proposed will be very beneficial for an important set of workloads without
> taking anything anyway.
>
> - Joe
>
> [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
next prev parent reply other threads:[~2024-08-14 15:08 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-12 12:57 [RFC net-next 0/5] Suspend IRQs during preferred busy poll Joe Damato
2024-08-12 12:57 ` [RFC net-next 4/5] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-08-12 13:19 ` Christoph Hellwig
2024-08-12 16:17 ` Matthew Wilcox
2024-08-12 17:49 ` Joe Damato
2024-08-12 17:46 ` Joe Damato
2024-08-12 12:57 ` [RFC net-next 5/5] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-08-12 20:20 ` Stanislav Fomichev
2024-08-12 21:47 ` Martin Karsten
2024-08-12 20:19 ` [RFC net-next 0/5] Suspend IRQs during preferred busy poll Stanislav Fomichev
2024-08-12 21:46 ` Martin Karsten
2024-08-12 23:03 ` Stanislav Fomichev
2024-08-13 0:04 ` Martin Karsten
2024-08-13 1:54 ` Stanislav Fomichev
2024-08-13 2:35 ` Martin Karsten
2024-08-13 4:07 ` Stanislav Fomichev
2024-08-13 13:18 ` Martin Karsten
2024-08-14 3:16 ` Willem de Bruijn
2024-08-14 14:19 ` Joe Damato
2024-08-14 15:08 ` Willem de Bruijn [this message]
2024-08-14 15:46 ` Joe Damato
2024-08-14 19:53 ` Samiullah Khawaja
2024-08-14 20:42 ` Martin Karsten
2024-08-16 14:27 ` Willem de Bruijn
2024-08-16 14:59 ` Willem de Bruijn
2024-08-16 15:25 ` Joe Damato
2024-08-16 17:01 ` Willem de Bruijn
2024-08-16 20:03 ` Martin Karsten
2024-08-16 20:58 ` Willem de Bruijn
2024-08-17 18:15 ` Martin Karsten
2024-08-18 12:55 ` Willem de Bruijn
2024-08-18 14:51 ` Martin Karsten
2024-08-20 2:36 ` Jakub Kicinski
2024-08-20 14:28 ` Martin Karsten
2024-08-17 10:00 ` Joe Damato
2024-08-14 0:10 ` Jakub Kicinski
2024-08-14 1:14 ` Martin Karsten
2024-08-20 2:07 ` Jakub Kicinski
2024-08-20 14:27 ` Martin Karsten
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=66bcc87d605_b1f942948@willemb.c.googlers.com.notmuch \
--to=willemdebruijn.kernel@gmail.com \
--cc=aleksander.lobakin@intel.com \
--cc=amritha.nambiar@intel.com \
--cc=bigeasy@linutronix.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jack@suse.cz \
--cc=jdamato@fastly.com \
--cc=jiri@resnulli.us \
--cc=johannes.berg@intel.com \
--cc=kuba@kernel.org \
--cc=leitao@debian.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=mkarsten@uwaterloo.ca \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=sridhar.samudrala@intel.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).