linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ming Lei <tom.leiming@gmail.com>,
	Sumit Saxena <sumit.saxena@broadcom.com>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Shivasharan Srikanteshwara 
	<shivasharan.srikanteshwara@broadcom.com>,
	linux-block <linux-block@vger.kernel.org>
Subject: RE: Affinity managed interrupts vs non-managed interrupts
Date: Mon, 3 Sep 2018 11:04:44 +0530	[thread overview]
Message-ID: <66256272c020be186becdd7a3f049302@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1809021357000.1349@nanos.tec.linutronix.de>

> On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > > Ok. I misunderstood the whole thing a bit. So your real issue is
that you
> > > want to have reply queues which are instantaneous, the per cpu ones,
and
> > > then the extra 16 which do batching and are shared over a set of
CPUs,
> > > right?
> >
> > Yes that is correct.  Extra 16 or whatever should be shared over set
of
> > CPUs of *local* numa node of the PCI device.
>
> Why restricting it to the local NUMA node of the device? That doesn't
> really make sense if you queue lots of requests from CPUs on a different
> node.

Penalty of cross numa node is minimal with higher interrupt coalescing
used in h/w.  We see penalty of cross numa traffic for lower IOPs type
work load.
In this particular case we are taking care cross numa traffic via higher
interrupt coalescing.

>
> Why don't you spread these extra interrupts accross all nodes and keep
the
> locality for the request/reply?

I assuming you are refereeing spreading msix to all numa node the way
"pci_alloc_irq_vectors" does.

Having extra 16 reply queue spread across nodes will have negative impact.
Take example of 8 node system (total 128 logical cpus).
If  16 reply queue are spread across numa node, there will be total 8
logical cpu mapped to 1 reply queue (eventually one numa node will have
only 2 reply queue mapped).

Running IO from one numa node will only consume 2 reply queues.
Performance dropped drastically in such case.  This is typical problem
with cpu-msix mapping goes to N:1 where msix is less than online cpus.

Mapping extra 16 reply queue to local numa node will always make sure that
driver will round robin all 16 reply queue irrespective of originated cpu.
We validated this method sending IOs from remote node and did not observed
performance penalty.

>
> That also would allow to make them properly managed interrupts as you
> could
> shutdown the per node batching interrupts when all CPUs of that node are
> offlined and you'd avoid the whole affinity hint irq balancer hackery.

One more clarification -

I am using " for-4.19/block " and this particular patch "a0c9259
irq/matrix: Spread interrupts on allocation" is included.
I can see that 16 extra reply queues via pre_vectors are still assigned to
CPU 0 (effective affinity ).

irq 33, cpu list 0-71
irq 34, cpu list 0-71
irq 35, cpu list 0-71
irq 36, cpu list 0-71
irq 37, cpu list 0-71
irq 38, cpu list 0-71
irq 39, cpu list 0-71
irq 40, cpu list 0-71
irq 41, cpu list 0-71
irq 42, cpu list 0-71
irq 43, cpu list 0-71
irq 44, cpu list 0-71
irq 45, cpu list 0-71
irq 46, cpu list 0-71
irq 47, cpu list 0-71
irq 48, cpu list 0-71


# cat /sys/kernel/debug/irq/irqs/34
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300001
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x40000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x22
         chip:    APIC
          flags:   0x0
         Vector:    46
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

#cat /sys/kernel/debug/irq/irqs/35
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300002
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x50000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x23
         chip:    APIC
          flags:   0x0
         Vector:    47
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

Ideally, what we are looking for 16 extra pre_vector reply queue is
"effective affinity" to be within local numa node as long as that numa
node has online CPUs. If not, we are ok to have effective cpu from any
node.

>
> Thanks,
>
> 	tglx
>
>

  reply	other threads:[~2018-09-03  5:34 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com>
2018-08-29  8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei
2018-08-29 10:46   ` Sumit Saxena
2018-08-30 17:15     ` Kashyap Desai
2018-08-31  6:54     ` Ming Lei
2018-08-31  7:50       ` Kashyap Desai
2018-08-31 20:24         ` Thomas Gleixner
2018-08-31 21:49           ` Kashyap Desai
2018-08-31 22:48             ` Thomas Gleixner
2018-08-31 23:37               ` Kashyap Desai
2018-09-02 12:02                 ` Thomas Gleixner
2018-09-03  5:34                   ` Kashyap Desai [this message]
2018-09-03 16:28                     ` Thomas Gleixner
2018-09-04 10:29                       ` Kashyap Desai
2018-09-05  5:46                         ` Dou Liyang
2018-09-05  9:45                           ` Kashyap Desai
2018-09-05 10:38                             ` Thomas Gleixner
2018-09-06 10:14                               ` Dou Liyang
2018-09-06 11:46                                 ` Thomas Gleixner
2018-09-11  9:13                                   ` Christoph Hellwig
2018-09-11  9:38                                     ` Dou Liyang
2018-09-11  9:22               ` Christoph Hellwig
2018-09-03  2:13         ` Ming Lei
2018-09-03  6:10           ` Kashyap Desai
2018-09-03  9:21             ` Ming Lei
2018-09-03  9:50               ` Kashyap Desai
2018-09-11  9:21     ` Christoph Hellwig
2018-09-11  9:54       ` Kashyap Desai
2018-08-28  6:47 Sumit Saxena

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66256272c020be186becdd7a3f049302@mail.gmail.com \
    --to=kashyap.desai@broadcom.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    --cc=tglx@linutronix.de \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).