From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ming Lei <tom.leiming@gmail.com>,
Sumit Saxena <sumit.saxena@broadcom.com>,
Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Shivasharan Srikanteshwara
<shivasharan.srikanteshwara@broadcom.com>,
linux-block <linux-block@vger.kernel.org>
Subject: RE: Affinity managed interrupts vs non-managed interrupts
Date: Fri, 31 Aug 2018 17:37:22 -0600 [thread overview]
Message-ID: <602cee6381b9f435a938bbaf852d07f9@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1809010020520.1349@nanos.tec.linutronix.de>
> > > > It is not yet finalized, but it can be based on per sdev
outstanding,
> > > > shost_busy etc.
> > > > We want to use special 16 reply queue for IO acceleration (these
> > queues are
> > > > working interrupt coalescing mode. This is a h/w feature)
> > >
> > > TBH, this does not make any sense whatsoever. Why are you trying to
have
> > > extra interrupts for coalescing instead of doing the following:
> >
> > Thomas,
> >
> > We are using this feature mainly for performance and not for CPU
hotplug
> > issues.
> > I read your below #1 to #4 points are more of addressing CPU hotplug
> > stuffs. Right ? If we use all 72 reply queue (all are in interrupt
> > coalescing mode) without any extra reply queues, we don't have any
issue
> > with cpu-msix mapping and cpu hotplug issues. Our major problem with
> > that method is latency is very bad on lower QD and/or single worker
case.
> >
> > To solve that problem we have added extra 16 reply queue (this is a
> > special h/w feature for performance only) which can be worked in
interrupt
> > coalescing mode vs existing 72 reply queue will work without any
interrupt
> > coalescing. Best way to map additional 16 reply queue is map it to
the
> > local numa node.
>
> Ok. I misunderstood the whole thing a bit. So your real issue is that
you
> want to have reply queues which are instantaneous, the per cpu ones, and
> then the extra 16 which do batching and are shared over a set of CPUs,
> right?
Yes that is correct. Extra 16 or whatever should be shared over set of
CPUs of *local* numa node of the PCI device.
>
> > I understand that, it is unique requirement but at the same time we
may
> > be able to do it gracefully (in irq sub system) as you mentioned "
> > irq_set_affinity_hint" should be avoided in low level driver.
>
> > Is it possible to have similar mapping in managed interrupt case as
below
> > ?
> >
> > for (i = 0; i < 16 ; i++)
> > irq_set_affinity_hint (pci_irq_vector(instance->pdev,
> > cpumask_of_node(local_numa_node));
> >
> > Currently we always see managed interrupts for pre-vectors are 0-71
and
> > effective cpu is always 0.
>
> The pre-vectors are not affinity managed. They get the default affinity
> assigned and at request_irq() the vectors are dynamically spread over
CPUs
> to avoid that the bulk of interrupts ends up on CPU0. That's handled
that
> way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation")
I am not sure if this is working on 4.18 kernel. I can double check. What
I remember is pre_vectors are mapped to 0-71 in my case and effective cpu
is always 0.
Ideally you mentioned that it should be spread..let me check that.
>
> > We want some changes in current API which can allow us to pass flags
> > (like *local numa affinity*) and cpu-msix mapping are from local numa
node
> > + effective cpu are spread across local numa node.
>
> What you really want is to split the vector space for your device into
two
> blocks. One for the regular per cpu queues and the other (16 or how many
> ever) which are managed separately, i.e. spread out evenly. That needs
some
> extensions to the core allocation/management code, but that shouldn't be
a
> huge problem.
Yes this is correct understanding. I can test any proposed patch if that
is what we want to use as best practice.
We attempted but due to lack of knowledge in irq-subsystem, we are not
able to settle down anything which is close to our requirement.
We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which
will indicate that all pre and post vector should be shared within local
numa node."
int irq_flags;
struct irq_affinity desc;
desc.pre_vectors = 16;
desc.post_vectors = 0;
irq_flags = PCI_IRQ_MSIX;
i = pci_alloc_irq_vectors_affinity(instance->pdev,
instance->high_iops_vector_start * 2,
instance->msix_vectors,
irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA,
&desc);
Somehow, I was not able to understand which part of irq subsystem should
have changes.
~ Kashyap
>
> Thanks,
>
> tglx
next prev parent reply other threads:[~2018-08-31 23:37 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com>
2018-08-29 8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei
2018-08-29 10:46 ` Sumit Saxena
2018-08-30 17:15 ` Kashyap Desai
2018-08-31 6:54 ` Ming Lei
2018-08-31 7:50 ` Kashyap Desai
2018-08-31 20:24 ` Thomas Gleixner
2018-08-31 21:49 ` Kashyap Desai
2018-08-31 22:48 ` Thomas Gleixner
2018-08-31 23:37 ` Kashyap Desai [this message]
2018-09-02 12:02 ` Thomas Gleixner
2018-09-03 5:34 ` Kashyap Desai
2018-09-03 16:28 ` Thomas Gleixner
2018-09-04 10:29 ` Kashyap Desai
2018-09-05 5:46 ` Dou Liyang
2018-09-05 9:45 ` Kashyap Desai
2018-09-05 10:38 ` Thomas Gleixner
2018-09-06 10:14 ` Dou Liyang
2018-09-06 11:46 ` Thomas Gleixner
2018-09-11 9:13 ` Christoph Hellwig
2018-09-11 9:38 ` Dou Liyang
2018-09-11 9:22 ` Christoph Hellwig
2018-09-03 2:13 ` Ming Lei
2018-09-03 6:10 ` Kashyap Desai
2018-09-03 9:21 ` Ming Lei
2018-09-03 9:50 ` Kashyap Desai
2018-09-11 9:21 ` Christoph Hellwig
2018-09-11 9:54 ` Kashyap Desai
2018-08-28 6:47 Sumit Saxena
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=602cee6381b9f435a938bbaf852d07f9@mail.gmail.com \
--to=kashyap.desai@broadcom.com \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sumit.saxena@broadcom.com \
--cc=tglx@linutronix.de \
--cc=tom.leiming@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).