RE: Affinity managed interrupts vs non-managed interrupts

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ming Lei <tom.leiming@gmail.com>,
	Sumit Saxena <sumit.saxena@broadcom.com>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Shivasharan Srikanteshwara
	<shivasharan.srikanteshwara@broadcom.com>,
	linux-block <linux-block@vger.kernel.org>
Subject: RE: Affinity managed interrupts vs non-managed interrupts
Date: Fri, 31 Aug 2018 17:37:22 -0600	[thread overview]
Message-ID: <602cee6381b9f435a938bbaf852d07f9@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1809010020520.1349@nanos.tec.linutronix.de>

> > > > It is not yet finalized, but it can be based on per sdev
outstanding,
> > > > shost_busy etc.
> > > > We want to use special 16 reply queue for IO acceleration (these
> > queues are
> > > > working interrupt coalescing mode. This is a h/w feature)
> > >
> > > TBH, this does not make any sense whatsoever. Why are you trying to
have
> > > extra interrupts for coalescing instead of doing the following:
> >
> > Thomas,
> >
> > We are using this feature mainly for performance and not for CPU
hotplug
> > issues.
> > I read your below #1 to #4 points are more of addressing CPU hotplug
> > stuffs. Right ? If we use all 72 reply queue (all are in interrupt
> > coalescing mode) without any extra reply queues, we don't have any
issue
> > with cpu-msix mapping and cpu hotplug issues.  Our major problem with
> > that method is latency is very bad on lower QD and/or single worker
case.
> >
> > To solve that problem we have added extra 16 reply queue (this is a
> > special h/w feature for performance only) which can be worked in
interrupt
> > coalescing mode vs existing 72 reply queue will work without any
interrupt
> > coalescing.   Best way to map additional 16 reply queue is map it to
the
> > local numa node.
>
> Ok. I misunderstood the whole thing a bit. So your real issue is that
you
> want to have reply queues which are instantaneous, the per cpu ones, and
> then the extra 16 which do batching and are shared over a set of CPUs,
> right?

Yes that is correct.  Extra 16 or whatever should be shared over set of
CPUs of *local* numa node of the PCI device.

>
> > I understand that, it is unique requirement but at the same time we
may
> > be able to do it gracefully (in irq sub system) as you mentioned "
> > irq_set_affinity_hint" should be avoided in low level driver.
>
> > Is it possible to have similar mapping in managed interrupt case as
below
> > ?
> >
> >     for (i = 0; i < 16 ; i++)
> >         irq_set_affinity_hint (pci_irq_vector(instance->pdev,
> > cpumask_of_node(local_numa_node));
> >
> > Currently we always see managed interrupts for pre-vectors are 0-71
and
> > effective cpu is always 0.
>
> The pre-vectors are not affinity managed. They get the default affinity
> assigned and at request_irq() the vectors are dynamically spread over
CPUs
> to avoid that the bulk of interrupts ends up on CPU0. That's handled
that
> way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation")

I am not sure if this is working on 4.18 kernel. I can double check. What
I remember is pre_vectors are mapped to 0-71 in my case and effective cpu
is always 0.
Ideally you mentioned that it should be spread..let me check that.

>
> > We want some changes in current API which can allow us to  pass flags
> > (like *local numa affinity*) and cpu-msix mapping are from local numa
node
> > + effective cpu are spread across local numa node.
>
> What you really want is to split the vector space for your device into
two
> blocks. One for the regular per cpu queues and the other (16 or how many
> ever) which are managed separately, i.e. spread out evenly. That needs
some
> extensions to the core allocation/management code, but that shouldn't be
a
> huge problem.

Yes this is correct understanding.  I can test any proposed patch if that
is what we want to use as best practice.
We attempted but due to lack of knowledge  in irq-subsystem, we are not
able to settle down anything which is close to our requirement.

We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which
will indicate that all pre and post vector should be shared within local
numa node."

    int irq_flags;
    struct irq_affinity desc;
    desc.pre_vectors = 16;
    desc.post_vectors = 0;

    irq_flags = PCI_IRQ_MSIX;

    i = pci_alloc_irq_vectors_affinity(instance->pdev,
                instance->high_iops_vector_start * 2,
                instance->msix_vectors,
                irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA,
&desc);

Somehow, I was not able to understand which part of irq subsystem should
have changes.

~ Kashyap


>
> Thanks,
>
> 	tglx

next prev parent reply	other threads:[~2018-09-01  3:47 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com>
2018-08-29  8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei
2018-08-29 10:46   ` Sumit Saxena
2018-08-30 17:15     ` Kashyap Desai
2018-08-31  6:54     ` Ming Lei
2018-08-31  7:50       ` Kashyap Desai
2018-08-31 20:24         ` Thomas Gleixner
2018-08-31 21:49           ` Kashyap Desai
2018-08-31 22:48             ` Thomas Gleixner
2018-08-31 23:37               ` Kashyap Desai [this message]
2018-09-02 12:02                 ` Thomas Gleixner
2018-09-03  5:34                   ` Kashyap Desai
2018-09-03 16:28                     ` Thomas Gleixner
2018-09-04 10:29                       ` Kashyap Desai
2018-09-05  5:46                         ` Dou Liyang
2018-09-05  9:45                           ` Kashyap Desai
2018-09-05 10:38                             ` Thomas Gleixner
2018-09-06 10:14                               ` Dou Liyang
2018-09-06 11:46                                 ` Thomas Gleixner
2018-09-11  9:13                                   ` Christoph Hellwig
2018-09-11  9:38                                     ` Dou Liyang
2018-09-11  9:22               ` Christoph Hellwig
2018-09-03  2:13         ` Ming Lei
2018-09-03  6:10           ` Kashyap Desai
2018-09-03  9:21             ` Ming Lei
2018-09-03  9:50               ` Kashyap Desai
2018-09-11  9:21     ` Christoph Hellwig
2018-09-11  9:54       ` Kashyap Desai
2018-08-28  6:47 Sumit Saxena

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=602cee6381b9f435a938bbaf852d07f9@mail.gmail.com \
    --to=kashyap.desai@broadcom.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    --cc=tglx@linutronix.de \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.