[PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues'

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

From: ming.lei@redhat.com (Ming Lei)
Subject: [PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues'
Date: Thu, 3 Jan 2019 18:34:56 +0800	[thread overview]
Message-ID: <20190103103455.GB29693@ming.t460p> (raw)
In-Reply-To: <f275dc74-0c7a-a7eb-3b7e-783254f29922@oracle.com>

On Thu, Jan 03, 2019@12:36:42PM +0800, Shan Hai wrote:
> 
> 
> On 2019/1/3 ??11:31, Ming Lei wrote:
> > On Thu, Jan 03, 2019@11:11:07AM +0800, Shan Hai wrote:
> >>
> >>
> >> On 2019/1/3 ??10:52, Shan Hai wrote:
> >>>
> >>>
> >>> On 2019/1/3 ??10:12, Ming Lei wrote:
> >>>> On Wed, Jan 02, 2019@02:11:22PM -0600, Bjorn Helgaas wrote:
> >>>>> [Sorry about the quote corruption below.  I'm responding with gmail in
> >>>>> plain text mode, but seems like it corrupted some of the quoting when
> >>>>> saving as a draft]
> >>>>>
> >>>>> On Mon, Dec 31, 2018@11:47 PM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>> &gt;
> >>>>> &gt; On Mon, Dec 31, 2018@03:24:55PM -0600, Bjorn Helgaas wrote:
> >>>>> &gt; &gt; On Fri, Dec 28, 2018@9:27 PM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>> &gt; &gt; &gt;
> >>>>> &gt; &gt; &gt; On big system with lots of CPU cores, it is easy to
> >>>>> consume up irq
> >>>>> &gt; &gt; &gt; vectors by assigning defaut queue with
> >>>>> num_possible_cpus() irq vectors.
> >>>>> &gt; &gt; &gt; Meantime it is often not necessary to allocate so many
> >>>>> vectors for
> >>>>> &gt; &gt; &gt; reaching NVMe's top performance under that situation.
> >>>>> &gt; &gt;
> >>>>> &gt; &gt; s/defaut/default/
> >>>>> &gt; &gt;
> >>>>> &gt; &gt; &gt; This patch introduces module parameter of 'default_queues' to try
> >>>>> &gt; &gt; &gt; to address this issue reported by Shan Hai.
> >>>>> &gt; &gt;
> >>>>> &gt; &gt; Is there a URL to this report by Shan?
> >>>>> &gt;
> >>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021863.html
> >>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021862.html
> >>>>> &gt;
> >>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021872.html
> >>>>>
> >>>>> It'd be good to include this.  I think the first is the interesting
> >>>>> one.  It'd be nicer to have an https://lore.kernel.org/... URL, but it
> >>>>> doesn't look like lore hosts linux-nvme yet.  (Is anybody working on
> >>>>> that?  I have some archives I could contribute, but other folks
> >>>>> probably have more.)
> >>>>>
> >>>>> </ming.lei at redhat.com></ming.lei at redhat.com>
> >>>>>>>
> >>>>>>> Is there some way you can figure this out automatically instead of
> >>>>>>> forcing the user to use a module parameter?
> >>>>>>
> >>>>>> Not yet, otherwise, I won't post this patch out.
> >>>>>>
> >>>>>>> If not, can you provide some guidance in the changelog for how a user
> >>>>>>> is supposed to figure out when it's needed and what the value should
> >>>>>>> be?  If you add the parameter, I assume that will eventually have to
> >>>>>>> be mentioned in a release note, and it would be nice to have something
> >>>>>>> to start from.
> >>>>>>
> >>>>>> Ok, that is a good suggestion, how about documenting it via the
> >>>>>> following words:
> >>>>>>
> >>>>>> Number of IRQ vectors is system-wide resource, and usually it is big enough
> >>>>>> for each device. However, we allocate num_possible_cpus() + 1 irq vectors for
> >>>>>> each NVMe PCI controller. In case that system has lots of CPU cores, or there
> >>>>>> are more than one NVMe controller, IRQ vectors can be consumed up
> >>>>>> easily by NVMe. When this issue is triggered, please try to pass smaller
> >>>>>> default queues via the module parameter of 'default_queues', usually
> >>>>>> it have to be >= number of NUMA nodes, meantime it needs be big enough
> >>>>>> to reach NVMe's top performance, which is often less than num_possible_cpus()
> >>>>>> + 1.
> >>>>>
> >>>>> You say "when this issue is triggered."  How does the user know when
> >>>>> this issue triggered?
> >>>>
> >>>> Any PCI IRQ vector allocation fails.
> >>>>
> >>>>>
> >>>>> The failure in Shan's email (021863.html) is a pretty ugly hotplug
> >>>>> failure and it would take me personally a long time to connect it with
> >>>>> an IRQ exhaustion issue and even longer to dig out this module
> >>>>> parameter to work around it.  I suppose if we run out of IRQ numbers,
> >>>>> NVMe itself might work fine, but some other random driver might be
> >>>>> broken?
> >>>>
> >>>> Yeah, seems that is true in Shan's report.
> >>>>
> >>>> However, Shan mentioned that the issue is only triggered in case of
> >>>> CPU hotplug, especially "The allocation is caused by IRQ migration of
> >>>> non-managed interrupts from dying to online CPUs."
> >>>>
> >>>> I still don't understand why new IRQ vector allocation is involved
> >>>> under CPU hotplug since Shan mentioned that no IRQ exhaustion issue
> >>>> during booting.
> >>>>
> >>>
> >>> Yes, the bug can be reproduced easily by CPU-hotplug.
> >>> We have to separate the PCI IRQ and CPU IRQ vectors first of all. We know that
> >>> the MSI-X permits up to 2048 interrupts allocation per device, but the CPU,
> >>> X86 as an example, could provide maximum 255 interrupt vectors, and the sad fact
> >>> is that these vectors are not all available for peripheral devices.
> >>>
> >>> So even though the controllers are luxury in PCI IRQ space and have got
> >>> thousands of vectors to use but the heavy lifting is done by the precious CPU
> >>> irq vectors.
> >>>
> >>> The CPU-hotplug causes IRQ vectors exhaustion problem because the interrupt
> >>> handlers of the controllers will be migrated from dying cpu to the online cpu
> >>> as long as the driver's irq affinity is not managed by the kernel, the drivers
> >>> smp_affinity of which can be set by procfs interface belong to this class.
> >>>
> >>> And the irq migration does not do irq free/realloc stuff, so the irqs of a
> >>> controller will be migrated to the target CPU cores according to its irq
> >>> affinity hint value and will consume a irq vector on the target core.
> >>>
> >>> If we try to offline 360 cores out of total 384 cores on a NUMA system attached
> >>> with 6 NVMe and 6 NICs we are out of luck and observe a kernel panic due to the
> >>> failure of I/O.
> >>>
> >>
> >> Put it simply we ran out of CPU irq vectors on CPU-hotplug rather than MSI-X
> >> vectors, adding this knob to the NVMe driver is for let it to be a good citizen
> >> considering the drivers out there irqs of which are still not managed by the
> >> kernel and be migrated between CPU cores on hot-plugging.
> > 
> > Yeah, look we all think this way might address this issue sort of.
> > 
> > But in reality, it can be hard to use this kind of workaround, given
> > people may not conclude easily this kind of failure should be addressed
> > by reducing 'nvme.default_queues'. At least, we should provide hint to
> > user about this solution when the failure is triggered, as mentioned by
> > Bjorn.
> > 
> >>
> >> If all driver's irq affinities are managed by the kernel I guess we will not
> >> be bitten by this bug, but we are not so lucky till today.
> > 
> > I am still not sure why changing affinities may introduce extra irq
> > vector allocation.
> > 
> 
> Below is a simple math to illustrate the problem:
> 
> CPU = 384, NVMe = 6, NIC = 6
> 2 * 6 * 384 local irq vectors are assigned to the controllers irqs
> 
> offline 364 cpu, 6 * 364 NIC irqs are migrated to 20 remaining online CPUs,
> while the irqs of the NVMe controllers are not, which means extra 6 * 364
> local irq vectors of 20 online CPUs need to be assigned to these migrated
> interrupt handlers.

But 6 * 364 Linux IRQs have been allocated/assigned already before, then why
is there IRQ exhaustion?


Thanks,
Ming

next prev parent reply	other threads:[~2019-01-03 10:34 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-29  3:26 [PATCH V2 0/3] nvme pci: two fixes on nvme_setup_irqs Ming Lei
2018-12-29  3:26 ` [PATCH V2 1/3] PCI/MSI: preference to returning -ENOSPC from pci_alloc_irq_vectors_affinity Ming Lei
2018-12-31 22:00   ` Bjorn Helgaas
2018-12-31 22:41     ` Keith Busch
2019-01-01  5:24     ` Ming Lei
2019-01-02 21:02       ` Bjorn Helgaas
2019-01-02 22:46         ` Keith Busch
2018-12-29  3:26 ` [PATCH V2 2/3] nvme pci: fix nvme_setup_irqs() Ming Lei
2018-12-29  3:26 ` [PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues' Ming Lei
2018-12-31 21:24   ` Bjorn Helgaas
2019-01-01  5:47     ` Ming Lei
2019-01-02  2:14       ` Shan Hai
     [not found]         ` <20190102073607.GA25590@ming.t460p>
     [not found]           ` <d59007c6-af13-318c-5c9d-438ad7d9149d@oracle.com>
     [not found]             ` <20190102083901.GA26881@ming.t460p>
2019-01-03  2:04               ` Shan Hai
2019-01-02 20:11       ` Bjorn Helgaas
2019-01-03  2:12         ` Ming Lei
2019-01-03  2:52           ` Shan Hai
2019-01-03  3:11             ` Shan Hai
2019-01-03  3:31               ` Ming Lei
2019-01-03  4:36                 ` Shan Hai
2019-01-03 10:34                   ` Ming Lei [this message]
2019-01-04  2:53                     ` Shan Hai
2019-01-03  4:51                 ` Shan Hai
2019-01-03  3:21             ` Ming Lei
2019-01-14 13:13 ` [PATCH V2 0/3] nvme pci: two fixes on nvme_setup_irqs John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190103103455.GB29693@ming.t460p \
    --to=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox