Re: [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: Thomas Gleixner <tglx@linutronix.de>, Ciju Rajan K <crajank@nvidia.com>
Cc: Hans de Goede <hdegoede@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	 Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	 christophe.jaillet@wanadoo.fr,
	platform-driver-x86@vger.kernel.org,  vadimp@nvidia.com
Subject: Re: [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm
Date: Mon, 29 Sep 2025 19:40:21 +0300 (EEST)	[thread overview]
Message-ID: <57573eee-8b8f-2399-0723-f6b68c9d779e@linux.intel.com> (raw)
In-Reply-To: <87zfaf3nnl.ffs@tglx>

[-- Attachment #1: Type: text/plain, Size: 4738 bytes --]

On Sat, 27 Sep 2025, Thomas Gleixner wrote:

> On Tue, Sep 16 2025 at 13:00, Ilpo Järvinen wrote:
> > On Tue, 16 Sep 2025, Ciju Rajan K wrote:
> >> This patch provides a mechanism to detect device causing interrupt
> >> flooding and mask interrupt for this specific device, to isolate
> >> from interrupt handling flow. Use the following criteria: if the
> >> specific interrupt was generated 'N' times during 'T' seconds,
> >> such device is to be considered as broken and will be closed for
> >> getting interrupts. User will be notified through the log error
> >> and will be instructed to replace broken device.
> >> 
> >> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
> >> Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
> >
> > I've a general question on this approach, probably more directed towards 
> > Hans, Thomas, or others who might have some insight.
> >
> > Are drivers expected to build their own workarounds for interrupt storms 
> > due to broken HW such as this? It sounds something that should be at least 
> > in part handled by something generic, while the lower-most level masking 
> > and detection might still need to be done by the driver to handle the HW 
> > specific aspects, there seems to be a generic aspect in all this.
> >
> > Is there something generic for this already? If not, should there be 
> > instead of adding this in full into an end-of-the-food-chain driver?
> 
> We don't have a generic mechanism for that. The core handles only the
> case of unhandled runaway interrupts.
> 
> I agree, that there should be some generic infrastructure for that.
> 
> Something like the incomplete below, which obviously needs some thoughts
> vs. shared interrupts and other details, but you get the idea.

Thanks Thomas!

Ciju, please base this storm detection on this kind of approach. I also 
started to wonder could some non-storming bit end up being detect as 
storming because your v3 approach doesn't do counters per asserted bits so 
if unlucky, non-storming bits might be asserted on the very moment the 
detection threshold was crossed and gets marked as storming even if it's 
not?

-- 
 i.

> ---
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1927,6 +1927,10 @@ static struct irqaction *__free_irq(stru
>  		irq_release_resources(desc);
>  		chip_bus_sync_unlock(desc);
>  		irq_remove_timings(desc);
> +		if (desc->irq_storm) {
> +			kfree(desc->irq_storm);
> +			desc->irq_storm = NULL;
> +		}
>  	}
>  
>  	mutex_unlock(&desc->request_mutex);
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -22,6 +22,44 @@ static DEFINE_TIMER(poll_spurious_irq_ti
>  int irq_poll_cpu;
>  static atomic_t irq_poll_active;
>  
> +/* Runaway interrupt detection */
> +struct irq_storm {
> +	unsigned long		max_cnt;
> +	unsigned long		last_cnt;
> +	irq_storm_cb_t		cb;
> +	void			*cb_arg;
> +};
> +
> +bool irq_register_storm_detection(unsigned int irq, unsigned int max_freq,
> +				  irq_storm_cb_t cb, void *cb_arg)
> +{
> +	struct irq_storm *is;
> +	unsigned long cnt;
> +	bool first;
> +
> +	if (max_freq < 500)
> +		return false;
> +
> +	is = kzalloc(sizeof(*is), GFP_KERNEL);
> +	if (!is)
> +		return false;
> +
> +	/* Adjust to cnt/10ms */
> +	is->max_cnt = max_freq / 100;
> +	is->cb_arg = cb->arg;
> +	is->cb = cb;
> +
> +	scoped_irqdesc_get_and_buslock(irq, 0) {
> +		if (scoped_desc->action) {
> +			is->last_cnt = scoped_desc->tot_cnt;
> +			scoped_desc->irq_storm = is;
> +			return true;
> +		}
> +	}
> +	kfree(is);
> +	return false;
> +}
> +
>  /*
>   * Recovery handler for misrouted interrupts.
>   */
> @@ -217,6 +255,19 @@ static inline bool try_misrouted_irq(uns
>  	return action && (action->flags & IRQF_IRQPOLL);
>  }
>  
> +static void irq_storm_check(struct irq_desc *desc)
> +{
> +	unsigned long delta, now = jiffies;
> +
> +	if (!time_after_irq(now, desc->irq_storm.next_period))
> +		return;
> +	desc->irq_storm.next_period = now + msec_to_jiffies(10);
> +	delta = desc->tot_cnt - desc->irq_storm.last_cnt;
> +	desc->irq_storm.last_cnt = desc->tot_cnt;
> +	if (delta > desc->irq_storm.max_cnt)
> +		desc->irq_storm.cb(irq_desc_get_irq(desc), delta * 100, desc->irq_storm.dev_id);
> +}
> +
>  #define SPURIOUS_DEFERRED	0x80000000
>  
>  void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
> @@ -231,6 +282,9 @@ void note_interrupt(struct irq_desc *des
>  		return;
>  	}
>  
> +	if (desc->irq_storm && action_ret == IRQ_HANDLED)
> +		irq_storm_check(desc);
> +
>  	/*
>  	 * We cannot call note_interrupt from the threaded handler
>  	 * because we need to look at the compound of all handlers
>

     prev parent reply	other threads:[~2025-09-29 16:40 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250916054731.1412031-1-crajank@nvidia.com>
     [not found] ` <20250916054731.1412031-3-crajank@nvidia.com>
2025-09-16 10:00   ` [PATCH platform-next 2/2] platform/mellanox: mlxreg-hotplug: Add support for handling interrupt storm Ilpo Järvinen
2025-09-19 12:46     ` Ciju Rajan K
2025-09-27 15:45     ` Thomas Gleixner
2025-09-29 16:40       ` Ilpo Järvinen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57573eee-8b8f-2399-0723-f6b68c9d779e@linux.intel.com \
    --to=ilpo.jarvinen@linux.intel.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=christophe.jaillet@wanadoo.fr \
    --cc=crajank@nvidia.com \
    --cc=hdegoede@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=platform-driver-x86@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=vadimp@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox