linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Karsten <mkarsten@uwaterloo.ca>
To: Stanislav Fomichev <sdf@fomichev.me>
Cc: netdev@vger.kernel.org, Joe Damato <jdamato@fastly.com>,
	amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
	Alexander Lobakin <aleksander.lobakin@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Breno Leitao <leitao@debian.org>,
	Christian Brauner <brauner@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Jan Kara <jack@suse.cz>,
	Jiri Pirko <jiri@resnulli.us>,
	Johannes Berg <johannes.berg@intel.com>,
	Jonathan Corbet <corbet@lwn.net>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:FILESYSTEMS (VFS and infrastructure)"
	<linux-fsdevel@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Date: Mon, 12 Aug 2024 20:04:13 -0400	[thread overview]
Message-ID: <d53e8aa6-a5eb-41f4-9a4c-70d04a5ca748@uwaterloo.ca> (raw)
In-Reply-To: <ZrqU3kYgL4-OI-qj@mini-arch>

On 2024-08-12 19:03, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>> On 08/12, Joe Damato wrote:
>>>> Greetings:
>>>>
>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>>>> ways of reducing tail latency when using epoll-based busy poll and we'd
>>>> love to get feedback from the list on the code in this series. This is
>>>> the idea I mentioned at netdev conf, for those who were there. Barring
>>>> any major issues, we hope to submit this officially shortly after RFC.
>>>>
>>>> The basic idea for suspending IRQs in this manner was described in an
>>>> earlier paper presented at Sigmetrics 2024 [1].
>>>
>>> Let me explicitly call out the paper. Very nice analysis!
>>
>> Thank you!
>>
>> [snip]
>>
>>>> Here's how it is intended to work:
>>>>     - An administrator sets the existing sysfs parameters for
>>>>       defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>
>>>>     - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>       to a larger value than gro-timeout to enable IRQ suspension.
>>>
>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>> and I don't see how it helps.
>>
>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
>> deferral mechanism and as such, always active when configured. Its static
>> periodic softirq processing leads to a situation where:
>>
>> - A long gro-flush-timeout causes high latencies when load is sufficiently
>> below capacity, or
>>
>> - a short gro-flush-timeout causes overhead when softirq execution
>> asynchronously competes with application processing at high load.
>>
>> The shortcomings of this are documented (to some extent) by our experiments.
>> See defer20 working well at low load, but having problems at high load,
>> while defer200 having higher latency at low load.
>>
>> irq-suspend-timeout is only active when an application uses
>> prefer-busy-polling and in that case, produces a nice alternating pattern of
>> application processing and networking processing (similar to what we
>> describe in the paper). This then works well with both low and high load.
> 
> So you only want it for the prefer-busy-pollingc case, makes sense. I was
> a bit confused by the difference between defer200 and suspend200,
> but now I see that defer200 does not enable busypoll.
> 
> I'm assuming that if you enable busypool in defer200 case, the numbers
> should be similar to suspend200 (ignoring potentially affecting
> non-busypolling queues due to higher gro_flush_timeout).

defer200 + napi busy poll is essentially what we labelled "busy" and it 
does not perform as well, since it still suffers interference between 
application and softirq processing.

>>> Maybe expand more on what code paths are we trying to improve? Existing
>>> busy polling code is not super readable, so would be nice to simplify
>>> it a bit in the process (if possible) instead of adding one more tunable.
>>
>> There are essentially three possible loops for network processing:
>>
>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>
>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>> with the shortcomings described above
>>
>> 3) epoll -> busy-poll -> napi poll
>>
>> If a system is configured for 1), not much can be done, as it is difficult
>> to interject anything into this loop without adding state and side effects.
>> This is what we tried for the paper, but it ended up being a hack.
>>
>> If however the system is configured for irq deferral, Loops 2) and 3)
>> "wrestle" with each other for control. Injecting the larger
>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>> of Loop 3) and creates the nice pattern describe above.
> 
> And you hit (2) when the epoll goes to sleep and/or when the userspace
> isn't fast enough to keep up with the timer, presumably? I wonder
> if need to use this opportunity and do proper API as Joe hints in the
> cover letter. Something over netlink to say "I'm gonna busy-poll on
> this queue / napi_id and with this timeout". And then we can essentially make
> gro_flush_timeout per queue (and avoid
> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> too hacky already :-(

If someone would implement the necessary changes to make these 
parameters per-napi, this would improve things further, but note that 
the current proposal gives strong performance across a range of 
workloads, which is otherwise difficult to impossible to achieve.

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake 
of an individual queue or application to make sure that IRQ suspension 
is enabled/disabled right away when the state of the system changes from 
busy to idle and back.

>> [snip]
>>
>>>>     - suspendX:
>>>>       - set defer_hard_irqs to 100
>>>>       - set gro_flush_timeout to X,000
>>>>       - set irq_suspend_timeout to 20,000,000
>>>>       - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>>>         busy_poll_budget = 64, prefer_busy_poll = true)
>>>
>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
>>> to busy_poll sysctl value?
>>
>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
>> busy-polling even when the application does not actually need or want it.
>> Only one iteration through the busy loop is needed to make the new scheme
>> work. Additional napi busy polling over and above is optional.
> 
> Ack, thanks, was trying to understand why not stay with
> busy_poll_usecs=64 for consistency. But I guess you were just
> trying to show that patch 4/5 works.

Right, and we would potentially be wasting CPU cycles by adding more 
busy-looping.

Thanks,
Martin

  reply	other threads:[~2024-08-13  0:04 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-12 12:57 [RFC net-next 0/5] Suspend IRQs during preferred busy poll Joe Damato
2024-08-12 12:57 ` [RFC net-next 4/5] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-08-12 13:19   ` Christoph Hellwig
2024-08-12 16:17     ` Matthew Wilcox
2024-08-12 17:49       ` Joe Damato
2024-08-12 17:46     ` Joe Damato
2024-08-12 12:57 ` [RFC net-next 5/5] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-08-12 20:20   ` Stanislav Fomichev
2024-08-12 21:47     ` Martin Karsten
2024-08-12 20:19 ` [RFC net-next 0/5] Suspend IRQs during preferred busy poll Stanislav Fomichev
2024-08-12 21:46   ` Martin Karsten
2024-08-12 23:03     ` Stanislav Fomichev
2024-08-13  0:04       ` Martin Karsten [this message]
2024-08-13  1:54         ` Stanislav Fomichev
2024-08-13  2:35           ` Martin Karsten
2024-08-13  4:07             ` Stanislav Fomichev
2024-08-13 13:18               ` Martin Karsten
2024-08-14  3:16                 ` Willem de Bruijn
2024-08-14 14:19                   ` Joe Damato
2024-08-14 15:08                     ` Willem de Bruijn
2024-08-14 15:46                       ` Joe Damato
2024-08-14 19:53                 ` Samiullah Khawaja
2024-08-14 20:42                   ` Martin Karsten
2024-08-16 14:27                     ` Willem de Bruijn
2024-08-16 14:59                       ` Willem de Bruijn
2024-08-16 15:25                         ` Joe Damato
2024-08-16 17:01                           ` Willem de Bruijn
2024-08-16 20:03                             ` Martin Karsten
2024-08-16 20:58                               ` Willem de Bruijn
2024-08-17 18:15                                 ` Martin Karsten
2024-08-18 12:55                                   ` Willem de Bruijn
2024-08-18 14:51                                     ` Martin Karsten
2024-08-20  2:36                                       ` Jakub Kicinski
2024-08-20 14:28                                         ` Martin Karsten
2024-08-17 10:00                             ` Joe Damato
2024-08-14  0:10     ` Jakub Kicinski
2024-08-14  1:14       ` Martin Karsten
2024-08-20  2:07         ` Jakub Kicinski
2024-08-20 14:27           ` Martin Karsten

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d53e8aa6-a5eb-41f4-9a4c-70d04a5ca748@uwaterloo.ca \
    --to=mkarsten@uwaterloo.ca \
    --cc=aleksander.lobakin@intel.com \
    --cc=amritha.nambiar@intel.com \
    --cc=bigeasy@linutronix.de \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jack@suse.cz \
    --cc=jdamato@fastly.com \
    --cc=jiri@resnulli.us \
    --cc=johannes.berg@intel.com \
    --cc=kuba@kernel.org \
    --cc=leitao@debian.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=sridhar.samudrala@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).