Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Long Li <longli@microsoft.com>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	"longli@linuxonhyperv.com" <longli@linuxonhyperv.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@fb.com>,
	Christoph Hellwig <hch@lst.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Hannes Reinecke <hare@suse.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
Date: Sat, 24 Aug 2019 20:55:54 +0800	[thread overview]
Message-ID: <20190824125553.GA8474@ming.t460p> (raw)
In-Reply-To: <CY4PR21MB074173E79C7FC3AC13C69CB3CEA70@CY4PR21MB0741.namprd21.prod.outlook.com>

On Sat, Aug 24, 2019 at 12:27:18AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> >>>>
> >>>> > From: Long Li <longli@microsoft.com>
> >>>> >
> >>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> > possible that the CPU this hardware queue is bound to is flooded by
> >>>> > returning I/O for other CPUs.
> >>>> >
> >>>> > For example, consider the following scenario:
> >>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
> >>>> > and 3 keep sending I/Os
> >>>> >
> >>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> > that CPU 0 spends all the time serving NVMe and other system
> >>>> > interrupts, but doesn't have a chance to run in process context.
> >>>> >
> >>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> > when it detects the scheduler is not making progress. This serves
> >>>multiple purposes:
> >>>> >
> >>>> > 1. This CPU has to be scheduled to complete the request. The other
> >>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
> >>>> > This helps this CPU get out of NVMe interrupts.
> >>>> >
> >>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
> >>>> > can not starve a CPU while servicing I/Os from other CPUs.
> >>>> >
> >>>> > 3. This CPU can make progress on RCU and other work items on its
> >>>queue.
> >>>>
> >>>> The problem is indeed real, but this is the wrong approach in my mind.
> >>>>
> >>>> We already have irqpoll which takes care proper budgeting polling
> >>>> cycles and not hogging the cpu.
> >>>
> >>>The issue isn't unique to NVMe, and can be any fast devices which
> >>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
> >>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
> >>>handler, especially in case that multiple submission CPUs vs. single
> >>>completion CPU.
> >>>
> >>>Some SCSI devices has the same problem too.
> >>>
> >>>Could we consider to add one generic mechanism to cover this kind of
> >>>problem?
> >>>
> >>>One approach I thought of is to allocate one backup thread for handling such
> >>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
> >>>
> >>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
> >>>check if this CPU has taken too long time for handling IRQ(interrupt or
> >>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup
> 
> How do you know if this CPU is spending all the time in do_IRQ()?
> 
> Is it something like:
> If (IRQ_time /elapsed_time > a threshold value)
> 	wake up the backup thread

Yeah, the above could work in theory.

Another approach I thought of is to monitor average irq gap time on each
CPU.

We could use EWMA(Exponential Weighted Moving Average) to do it simply,
such as:

	curr_irq_gap(cpu) = current start time of do_IRQ() on 'cpu' -
			end time of last do_IRQ() on 'cpu'
	avg_irq_gap(cpu) = weight_prev * avg_irq_gap(cpu) + weight_curr * curr_irq_gap(cpu) 

	note:
		weight_prev + weight_curr = 1

When avg_irq_gap(cpu) is close to one small enough threshold, we think irq flood is
detected.

'weight_prev' could be chosen as one big enough value for avoiding short-time flood.


Thanks,
Ming

next prev parent reply	other threads:[~2019-08-24 12:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
2019-08-20  9:38   ` Peter Zijlstra
2019-08-21  8:20     ` Long Li
2019-08-21 10:34       ` Peter Zijlstra
2019-08-20  9:39   ` Peter Zijlstra
2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
2019-08-20  9:52   ` Peter Zijlstra
2019-08-21  8:37     ` Long Li
2019-08-21 10:35       ` Peter Zijlstra
2019-08-20 17:33   ` Sagi Grimberg
2019-08-21  8:39     ` Long Li
2019-08-21 17:36       ` Long Li
2019-08-21 21:54         ` Sagi Grimberg
2019-08-24  0:13           ` Long Li
2019-08-23  3:21     ` Ming Lei
2019-08-24  0:27       ` Long Li
2019-08-24 12:55         ` Ming Lei [this message]
2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
2019-08-20  8:59   ` John Garry
2019-08-20 15:05     ` Keith Busch
2019-08-21  7:47     ` Long Li
2019-08-21  9:44       ` Ming Lei
2019-08-21 10:03         ` John Garry
2019-08-21 16:27         ` Long Li
2019-08-22  1:33           ` Ming Lei
2019-08-22  2:00             ` Keith Busch
2019-08-22  2:23               ` Ming Lei
2019-08-22  9:48               ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190824125553.GA8474@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@fb.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=longli@linuxonhyperv.com \
    --cc=longli@microsoft.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.