From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: Thomas Gleixner <tglx@linutronix.de>, Ciju Rajan K <crajank@nvidia.com>
Cc: Hans de Goede <hdegoede@redhat.com>,
LKML <linux-kernel@vger.kernel.org>,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
christophe.jaillet@wanadoo.fr,
platform-driver-x86@vger.kernel.org, vadimp@nvidia.com
Subject: Re: [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm
Date: Mon, 29 Sep 2025 19:40:21 +0300 (EEST) [thread overview]
Message-ID: <57573eee-8b8f-2399-0723-f6b68c9d779e@linux.intel.com> (raw)
In-Reply-To: <87zfaf3nnl.ffs@tglx>
[-- Attachment #1: Type: text/plain, Size: 4738 bytes --]
On Sat, 27 Sep 2025, Thomas Gleixner wrote:
> On Tue, Sep 16 2025 at 13:00, Ilpo Järvinen wrote:
> > On Tue, 16 Sep 2025, Ciju Rajan K wrote:
> >> This patch provides a mechanism to detect device causing interrupt
> >> flooding and mask interrupt for this specific device, to isolate
> >> from interrupt handling flow. Use the following criteria: if the
> >> specific interrupt was generated 'N' times during 'T' seconds,
> >> such device is to be considered as broken and will be closed for
> >> getting interrupts. User will be notified through the log error
> >> and will be instructed to replace broken device.
> >>
> >> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
> >> Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
> >
> > I've a general question on this approach, probably more directed towards
> > Hans, Thomas, or others who might have some insight.
> >
> > Are drivers expected to build their own workarounds for interrupt storms
> > due to broken HW such as this? It sounds something that should be at least
> > in part handled by something generic, while the lower-most level masking
> > and detection might still need to be done by the driver to handle the HW
> > specific aspects, there seems to be a generic aspect in all this.
> >
> > Is there something generic for this already? If not, should there be
> > instead of adding this in full into an end-of-the-food-chain driver?
>
> We don't have a generic mechanism for that. The core handles only the
> case of unhandled runaway interrupts.
>
> I agree, that there should be some generic infrastructure for that.
>
> Something like the incomplete below, which obviously needs some thoughts
> vs. shared interrupts and other details, but you get the idea.
Thanks Thomas!
Ciju, please base this storm detection on this kind of approach. I also
started to wonder could some non-storming bit end up being detect as
storming because your v3 approach doesn't do counters per asserted bits so
if unlucky, non-storming bits might be asserted on the very moment the
detection threshold was crossed and gets marked as storming even if it's
not?
--
i.
> ---
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1927,6 +1927,10 @@ static struct irqaction *__free_irq(stru
> irq_release_resources(desc);
> chip_bus_sync_unlock(desc);
> irq_remove_timings(desc);
> + if (desc->irq_storm) {
> + kfree(desc->irq_storm);
> + desc->irq_storm = NULL;
> + }
> }
>
> mutex_unlock(&desc->request_mutex);
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -22,6 +22,44 @@ static DEFINE_TIMER(poll_spurious_irq_ti
> int irq_poll_cpu;
> static atomic_t irq_poll_active;
>
> +/* Runaway interrupt detection */
> +struct irq_storm {
> + unsigned long max_cnt;
> + unsigned long last_cnt;
> + irq_storm_cb_t cb;
> + void *cb_arg;
> +};
> +
> +bool irq_register_storm_detection(unsigned int irq, unsigned int max_freq,
> + irq_storm_cb_t cb, void *cb_arg)
> +{
> + struct irq_storm *is;
> + unsigned long cnt;
> + bool first;
> +
> + if (max_freq < 500)
> + return false;
> +
> + is = kzalloc(sizeof(*is), GFP_KERNEL);
> + if (!is)
> + return false;
> +
> + /* Adjust to cnt/10ms */
> + is->max_cnt = max_freq / 100;
> + is->cb_arg = cb->arg;
> + is->cb = cb;
> +
> + scoped_irqdesc_get_and_buslock(irq, 0) {
> + if (scoped_desc->action) {
> + is->last_cnt = scoped_desc->tot_cnt;
> + scoped_desc->irq_storm = is;
> + return true;
> + }
> + }
> + kfree(is);
> + return false;
> +}
> +
> /*
> * Recovery handler for misrouted interrupts.
> */
> @@ -217,6 +255,19 @@ static inline bool try_misrouted_irq(uns
> return action && (action->flags & IRQF_IRQPOLL);
> }
>
> +static void irq_storm_check(struct irq_desc *desc)
> +{
> + unsigned long delta, now = jiffies;
> +
> + if (!time_after_irq(now, desc->irq_storm.next_period))
> + return;
> + desc->irq_storm.next_period = now + msec_to_jiffies(10);
> + delta = desc->tot_cnt - desc->irq_storm.last_cnt;
> + desc->irq_storm.last_cnt = desc->tot_cnt;
> + if (delta > desc->irq_storm.max_cnt)
> + desc->irq_storm.cb(irq_desc_get_irq(desc), delta * 100, desc->irq_storm.dev_id);
> +}
> +
> #define SPURIOUS_DEFERRED 0x80000000
>
> void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
> @@ -231,6 +282,9 @@ void note_interrupt(struct irq_desc *des
> return;
> }
>
> + if (desc->irq_storm && action_ret == IRQ_HANDLED)
> + irq_storm_check(desc);
> +
> /*
> * We cannot call note_interrupt from the threaded handler
> * because we need to look at the compound of all handlers
>
prev parent reply other threads:[~2025-09-29 16:40 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250916054731.1412031-1-crajank@nvidia.com>
[not found] ` <20250916054731.1412031-3-crajank@nvidia.com>
2025-09-16 10:00 ` [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm Ilpo Järvinen
2025-09-19 12:46 ` Ciju Rajan K
2025-09-27 15:45 ` Thomas Gleixner
2025-09-29 16:40 ` Ilpo Järvinen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57573eee-8b8f-2399-0723-f6b68c9d779e@linux.intel.com \
--to=ilpo.jarvinen@linux.intel.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=christophe.jaillet@wanadoo.fr \
--cc=crajank@nvidia.com \
--cc=hdegoede@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=platform-driver-x86@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=vadimp@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox