linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin Karsten <mkarsten@uwaterloo.ca>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>,
	netdev@vger.kernel.org, Joe Damato <jdamato@fastly.com>,
	amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
	Alexander Lobakin <aleksander.lobakin@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Breno Leitao <leitao@debian.org>,
	Christian Brauner <brauner@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>, Jan Kara <jack@suse.cz>,
	Jiri Pirko <jiri@resnulli.us>,
	Johannes Berg <johannes.berg@intel.com>,
	Jonathan Corbet <corbet@lwn.net>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:FILESYSTEMS (VFS and infrastructure)"
	<linux-fsdevel@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll
Date: Tue, 20 Aug 2024 10:27:40 -0400	[thread overview]
Message-ID: <3e68c3e0-cada-48e6-8a19-6d5fba86dd40@uwaterloo.ca> (raw)
In-Reply-To: <20240819190755.0ed0a959@kernel.org>

On 2024-08-19 22:07, Jakub Kicinski wrote:
> On Tue, 13 Aug 2024 21:14:40 -0400 Martin Karsten wrote:
>>> What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
>>> to be used with NICs which either don't have IRQ coalescing or have a
>>> broken implementation. The timeout of 200usec should be perfectly within
>>> range of what NICs can support.
>>>
>>> If the NIC IRQ coalescing works, instead of adding a new timeout value
>>> we could add a new deferral control (replacing defer_hard_irqs_count)
>>> which would always kick in after seeing prefer_busy_poll() but also
>>> not kick in if the busy poll harvested 0 packets.
>> Maybe I am missing something, but I believe this would have the same
>> problem that we describe for gro-timeout + defer-irq. When busy poll
>> does not harvest packets and the application thread is idle and goes to
>> sleep, it would then take up to 200 us to get the next interrupt. This
>> considerably increases tail latencies under low load.
>>
>> In order get low latencies under low load, the NIC timeout would have to
>> be something like 20 us, but under high load the application thread will
>> be busy for longer than 20 us and the interrupt (and softirq) will come
>> too early and cause interference.
> 
> An FSM-like diagram would go a long way in clarifying things :)

I agree the suspend mechanism is not trivial and the implementation is 
subtle. It has frequently made our heads hurt while developing this. We 
will take a long hard look at our cover letter and produce other 
documentation to hopefully provide clear explanations.

>> It is tempting to think of the second timeout as 0 and in fact re-enable
>> interrupts right away. We have tried it, but it leads to a lot of
>> interrupts and corresponding inefficiencies, since a system below
>> capacity frequently switches between busy and idle. Using a small
>> timeout (20 us) for modest deferral and batching when idle is a lot more
>> efficient.
> 
> I see. I think we are on the same page. What I was suggesting is to use
> the HW timer instead of the short timer. But I suspect the NIC you're
> using isn't really good at clearing IRQs before unmasking. Meaning that
> when you try to reactivate HW control there's already an IRQ pending
> and it fires pointlessly. That matches my experience with mlx5.
> If the NIC driver was to clear the IRQ state before running the NAPI
> loop, we would have no pending IRQ by the time we unmask and activate
> HW IRQs.

I believe there are additional issues. The problem is that the long 
timeout must engage if and only if prefer-busy is active.

When using NIC coalescing for the short timeout (without gro/defer), an 
interrupt after an idle period will trigger softirq, which will run napi 
polling. At this point, prefer-busy is not active, so NIC interrupts 
would be re-enabled. Then it is not possible for the longer timeout to 
interject to switch control back to polling. In other words, only by 
using the software timer for the short timeout, it is possible to extend 
the timeout without having to reprogram the NIC timer or reach down 
directly and disable interrupts.

Using gro_flush_timeout for the long timeout also has problems, for the 
same underlying reason. In the current napi implementation, 
gro_flush_timeout is not tied to prefer-busy. We'd either have to change 
that and in the process modify the existing deferral mechanism, or 
introduce a state variable to determine whether gro_flush_timeout is 
used as long timeout for irq suspend or whether it is used for its 
default purpose. In an earlier version, we did try something similar to 
the latter and made it work, but it ends up being a lot more convoluted 
than our current proposal.

Thanks,
Martin


      reply	other threads:[~2024-08-20 14:27 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-12 12:57 [RFC net-next 0/5] Suspend IRQs during preferred busy poll Joe Damato
2024-08-12 12:57 ` [RFC net-next 4/5] eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set Joe Damato
2024-08-12 13:19   ` Christoph Hellwig
2024-08-12 16:17     ` Matthew Wilcox
2024-08-12 17:49       ` Joe Damato
2024-08-12 17:46     ` Joe Damato
2024-08-12 12:57 ` [RFC net-next 5/5] eventpoll: Control irq suspension for prefer_busy_poll Joe Damato
2024-08-12 20:20   ` Stanislav Fomichev
2024-08-12 21:47     ` Martin Karsten
2024-08-12 20:19 ` [RFC net-next 0/5] Suspend IRQs during preferred busy poll Stanislav Fomichev
2024-08-12 21:46   ` Martin Karsten
2024-08-12 23:03     ` Stanislav Fomichev
2024-08-13  0:04       ` Martin Karsten
2024-08-13  1:54         ` Stanislav Fomichev
2024-08-13  2:35           ` Martin Karsten
2024-08-13  4:07             ` Stanislav Fomichev
2024-08-13 13:18               ` Martin Karsten
2024-08-14  3:16                 ` Willem de Bruijn
2024-08-14 14:19                   ` Joe Damato
2024-08-14 15:08                     ` Willem de Bruijn
2024-08-14 15:46                       ` Joe Damato
2024-08-14 19:53                 ` Samiullah Khawaja
2024-08-14 20:42                   ` Martin Karsten
2024-08-16 14:27                     ` Willem de Bruijn
2024-08-16 14:59                       ` Willem de Bruijn
2024-08-16 15:25                         ` Joe Damato
2024-08-16 17:01                           ` Willem de Bruijn
2024-08-16 20:03                             ` Martin Karsten
2024-08-16 20:58                               ` Willem de Bruijn
2024-08-17 18:15                                 ` Martin Karsten
2024-08-18 12:55                                   ` Willem de Bruijn
2024-08-18 14:51                                     ` Martin Karsten
2024-08-20  2:36                                       ` Jakub Kicinski
2024-08-20 14:28                                         ` Martin Karsten
2024-08-17 10:00                             ` Joe Damato
2024-08-14  0:10     ` Jakub Kicinski
2024-08-14  1:14       ` Martin Karsten
2024-08-20  2:07         ` Jakub Kicinski
2024-08-20 14:27           ` Martin Karsten [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e68c3e0-cada-48e6-8a19-6d5fba86dd40@uwaterloo.ca \
    --to=mkarsten@uwaterloo.ca \
    --cc=aleksander.lobakin@intel.com \
    --cc=amritha.nambiar@intel.com \
    --cc=bigeasy@linutronix.de \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jack@suse.cz \
    --cc=jdamato@fastly.com \
    --cc=jiri@resnulli.us \
    --cc=johannes.berg@intel.com \
    --cc=kuba@kernel.org \
    --cc=leitao@debian.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=sridhar.samudrala@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).