All of lore.kernel.org
 help / color / mirror / Atom feed
* NVMe and IRQ Affinity
@ 2016-02-02 23:31 Mark Jacobson
  2016-02-02 23:45 ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Jacobson @ 2016-02-02 23:31 UTC (permalink / raw)


Hello there,

I had a question regarding the NVMe driver and IRQ affinity. For
context, I'm running Centos 7.1 with the 4.2.3-1.el7.elrepo.x86_64
kernel on a system with two E5-2650 v3 CPUs.

I've been battling soft core lockups in a system with 24 NVMe drives
and noticed that the drives I'm working with (Samsung PM953) will by
default only route interrupts to CPU0 despite having affinity for all
cores and I figured I'd ask here since that seemed like a driver
issue.

I've tried adjusting the smp_affinity for the various IRQ handlers,
but 1) That seems to essentially force (most of the) interrupts up to
the higher MSI-X vectors that do still have CPU 0 available, and 2)
there seems to be a huge performance hit swapping to another CPU (say
CPU2).  Is this something that makes any kind of sense from a driver
perspective? (I know high-speed NICs commonly distribute interrupts
across many cores, so it doesn't seem normal to me.) Is there a newer
kernel version that would help with this problem? Or barring any of
that, does this seem like a hardware issue?

(And yes, I realize eventually my IRQ masks should only point to cores
associated with the NUMA node they're connected to. I've been ignoring
that for debugging.)

Trimmed interrupt table (forgive the many, many wraps) below. The
spikes on cores on interrupt 159 are for when I forced interrupts away
from CPU0 by masking off some of the lower order irq affinity bits.

 159:  145078914          0          0          0   11994376
0          0          0    5139627          0          0          0
      0          0          0          0          0          0
 0          0    2766162
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058624-edge
   nvme7q0, nvme7q1
 160:   12514715          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058625-edge
   nvme7q2
 161:     985758          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058626-edge
   nvme7q3
 162:    6155130          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058627-edge
   nvme7q4
 163:      61111          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058628-edge
   nvme7q5
 164:      53257          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058629-edge
   nvme7q6
 165:      52186          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058630-edge
   nvme7q7
 166:      28943          0          0          0          0
0          0          0          0          0          0          0
      0          0          0          0          0          0
 0          0          0
         0          0          0          0          0          0
    0          0          0          0          0          0
0          0          0          0          0          0          0
IR-PCI-MSI 12058631-edge
   nvme7q8

Any help would be appreciated!

Thank you,

Mark Jacobson
Software Test Engineer
Stack Velocity

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-02 23:31 NVMe and IRQ Affinity Mark Jacobson
@ 2016-02-02 23:45 ` Keith Busch
  2016-02-02 23:50   ` Mark Jacobson
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2016-02-02 23:45 UTC (permalink / raw)


On Wed, Feb 03, 2016@12:31:22AM +0100, Mark Jacobson wrote:
> and noticed that the drives I'm working with (Samsung PM953) will by
> default only route interrupts to CPU0 despite having affinity for all
> cores and I figured I'd ask here since that seemed like a driver
> issue.

Sounds like the affinity hints are either messed up in this distro, or
just not being used by irqbalance. Could you run the following script
and send the output?

---
cat /sys/block/nvme0n1/mq/*/cpu_list

for i in $(grep nvme0q /proc/interrupts  | cut -d":" -f1 | sed "s/ //g"); do
  echo "IRQ:  $i";
  echo -n "HINT: " && cat /proc/irq/$i/affinity_hint
  echo -n "SMP:  " && cat /proc/irq/$i/smp_affinity
done

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-02 23:45 ` Keith Busch
@ 2016-02-02 23:50   ` Mark Jacobson
  2016-02-02 23:58     ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Jacobson @ 2016-02-02 23:50 UTC (permalink / raw)


Output is below. I'm aware the distro hints are fairly invalid.
Luckily, I've had to implement PCIe endpoints (in FPGAs) in the past,
so I knew roughly where to look. Note that despite the 00,3ff003ff,
only CPU0 ever gets hit unless I force-disable that bit.

root# cat /sys/block/nvme0n1/mq/*/cpu_list
0, 1, 2, 20, 21, 22
3, 4, 23, 24
5, 6, 7, 25, 26, 27
8, 9, 28, 29
10, 11, 12, 30, 31, 32
13, 14, 33, 34
15, 16, 17, 35, 36, 37
18, 19, 38, 39
root#
root# for i in $(grep nvme0q /proc/interrupts  | cut -d":" -f1 | sed
"s/ //g"); do
>   echo "IRQ:  $i";
>   echo -n "HINT: " && cat /proc/irq/$i/affinity_hint
>   echo -n "SMP:  " && cat /proc/irq/$i/smp_affinity
> done
IRQ:  87
HINT: ff,ffffffff
SMP:  ff,ffffffff
IRQ:  88
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  89
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  90
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  91
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  92
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  93
HINT: 00,00000000
SMP:  00,3ff003ff
IRQ:  94
HINT: 00,00000000
SMP:  00,3ff003ff
Thank you,

Mark Jacobson
Software Test Engineer
Stack Velocity


On Wed, Feb 3, 2016@12:45 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Feb 03, 2016@12:31:22AM +0100, Mark Jacobson wrote:
>> and noticed that the drives I'm working with (Samsung PM953) will by
>> default only route interrupts to CPU0 despite having affinity for all
>> cores and I figured I'd ask here since that seemed like a driver
>> issue.
>
> Sounds like the affinity hints are either messed up in this distro, or
> just not being used by irqbalance. Could you run the following script
> and send the output?
>
> ---
> cat /sys/block/nvme0n1/mq/*/cpu_list
>
> for i in $(grep nvme0q /proc/interrupts  | cut -d":" -f1 | sed "s/ //g"); do
>   echo "IRQ:  $i";
>   echo -n "HINT: " && cat /proc/irq/$i/affinity_hint
>   echo -n "SMP:  " && cat /proc/irq/$i/smp_affinity
> done

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-02 23:50   ` Mark Jacobson
@ 2016-02-02 23:58     ` Keith Busch
  2016-02-03  0:13       ` Mark Jacobson
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2016-02-02 23:58 UTC (permalink / raw)


On Wed, Feb 03, 2016@12:50:06AM +0100, Mark Jacobson wrote:
> Output is below. I'm aware the distro hints are fairly invalid.

They're all invalid. This kernel must have forked before the affinity
hints were fixed for a blk-mq nvme driver. A more optimal affinity hint
would match the mq's cpu_list, which is how it looks upstream.

I guess your platform strongly prefers CPU 0 when allowed. You can
either manually override the smp_affinity, use an out-of-tree driver
with a fix and let irqbalance handle it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-02 23:58     ` Keith Busch
@ 2016-02-03  0:13       ` Mark Jacobson
  2016-02-03 14:30         ` Keith Busch
  2016-02-03 16:14         ` Kim Kyungsan
  0 siblings, 2 replies; 9+ messages in thread
From: Mark Jacobson @ 2016-02-03  0:13 UTC (permalink / raw)


In that case, please forgive the silly questions, as I am not an
experienced kernel developer by any means...  (I'm just looking for
enough information go Googling. I won't ask much more down that line
of questioning, as I know this list is not for that purpose.)

1. When you say out-of-tree, do you mean a loadable kernel module?
(My understanding is that the NVMe driver is now part of the mainline
Linux kernel source tree, so I'm a bit confused as to where to nab
that from.)
2. Does the upstream 4.4.1 kernel have any of these fixes if I were to
build it myself with the appropriate support ticked off?

Also, thank you very much for the quick response and assistance. I
really appreciate the help. :)
Thank you,

Mark Jacobson
Software Test Engineer
Stack Velocity


On Wed, Feb 3, 2016@12:58 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Wed, Feb 03, 2016@12:50:06AM +0100, Mark Jacobson wrote:
>> Output is below. I'm aware the distro hints are fairly invalid.
>
> They're all invalid. This kernel must have forked before the affinity
> hints were fixed for a blk-mq nvme driver. A more optimal affinity hint
> would match the mq's cpu_list, which is how it looks upstream.
>
> I guess your platform strongly prefers CPU 0 when allowed. You can
> either manually override the smp_affinity, use an out-of-tree driver
> with a fix and let irqbalance handle it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-03  0:13       ` Mark Jacobson
@ 2016-02-03 14:30         ` Keith Busch
  2016-02-03 18:31           ` Azher Mughal
  2016-02-03 16:14         ` Kim Kyungsan
  1 sibling, 1 reply; 9+ messages in thread
From: Keith Busch @ 2016-02-03 14:30 UTC (permalink / raw)


On Wed, Feb 03, 2016@01:13:23AM +0100, Mark Jacobson wrote:
> 1. When you say out-of-tree, do you mean a loadable kernel module?
> (My understanding is that the NVMe driver is now part of the mainline
> Linux kernel source tree, so I'm a bit confused as to where to nab
> that from.)

Yes, a loadable kernel module or a package that can build one. I am not
aware of any publicly available. Perhaps your vendor provides one.

> 2. Does the upstream 4.4.1 kernel have any of these fixes if I were to
> build it myself with the appropriate support ticked off?

The 4.3 kernel has the fixes, and anything newer than that should also
work. I think it was just the 4.2 kernel that had this wrong (will see
if there's are any stable patch candidates).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-03  0:13       ` Mark Jacobson
  2016-02-03 14:30         ` Keith Busch
@ 2016-02-03 16:14         ` Kim Kyungsan
  1 sibling, 0 replies; 9+ messages in thread
From: Kim Kyungsan @ 2016-02-03 16:14 UTC (permalink / raw)


Hi, I had a similar experience from nvme interrupts.
As you said, kernel basically allocates cpu0 for interrupt handling.
Without setting irq affinity or irqbalance daemon, it could cause the
decrease in performance and soft-lock CPU bug.
The symptom was noticed on systems under high workload with multiple
nvme devices.

How we solved was that setting irq affinity to evenly distribute cpus
to handle interrupts from an IRQ like below.

       Cpu0 - nvme irq0
       Cpu1 - nvme irq1
       Cpu2 - nvme irq2
       Cpu3 - nvme irq3
       Cpu4 - nvme irq4
       Cpu5 - nvme irq5
       Cpu6 - nvme irq6
       Cpu7 - nvme irq7
       ....

And there are two ways to set irq affinity. First one is using proc
interface as you mentioned, another one is using inbox driver higher
than kernel 4.3 . In fact, nvme driver under 4.3 kernel also tried to
set irq affinity hint during device initialization, however, it hadn't
worked due to a bug which is fixed by Keith Busch on 4.3 kernel.
>From kernel 4.3, nvme driver sets irq affinity as well as
affinity_hint during device initialization by calling
irq_set_affinity_hint()

Please refer below.

/* kernel 4.3 nvme-core.c */
nvme_dev_scan()
     + nvme_set_irq_hints()
         +irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
blk_mq_tags_cpumask(*nvmeq->tags));


/* kernel/irq/manage.c */
int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
{
        unsigned long flags;
        struct irq_desc *desc = irq_get_desc_lock(irq, &flags,
IRQ_GET_DESC_CHECK_GLOBAL);

         if (!desc)
                return -EINVAL;
        desc->affinity_hint = m;
        irq_put_desc_unlock(desc, flags);
        /* set the initial affinity to prevent every interrupt being on CPU0 */
        if (m)
               __irq_set_affinity(irq, m, false);
        return 0;
}
EXPORT_SYMBOL_GPL(irq_set_affinity_hint);

The last thing i want to note is you better disable irqbalance daemon
after setting irq affinity by yourself because the daemon adjust irq
affinity again which can cause unbalanced interrupts handling again.


On Wed, Feb 3, 2016 at 9:13 AM, Mark Jacobson
<mark_jacobson@stackvelocity.com> wrote:
> In that case, please forgive the silly questions, as I am not an
> experienced kernel developer by any means...  (I'm just looking for
> enough information go Googling. I won't ask much more down that line
> of questioning, as I know this list is not for that purpose.)
>
> 1. When you say out-of-tree, do you mean a loadable kernel module?
> (My understanding is that the NVMe driver is now part of the mainline
> Linux kernel source tree, so I'm a bit confused as to where to nab
> that from.)
> 2. Does the upstream 4.4.1 kernel have any of these fixes if I were to
> build it myself with the appropriate support ticked off?
>
> Also, thank you very much for the quick response and assistance. I
> really appreciate the help. :)
> Thank you,
>
> Mark Jacobson
> Software Test Engineer
> Stack Velocity
>
>
> On Wed, Feb 3, 2016@12:58 AM, Keith Busch <keith.busch@intel.com> wrote:
>> On Wed, Feb 03, 2016@12:50:06AM +0100, Mark Jacobson wrote:
>>> Output is below. I'm aware the distro hints are fairly invalid.
>>
>> They're all invalid. This kernel must have forked before the affinity
>> hints were fixed for a blk-mq nvme driver. A more optimal affinity hint
>> would match the mq's cpu_list, which is how it looks upstream.
>>
>> I guess your platform strongly prefers CPU 0 when allowed. You can
>> either manually override the smp_affinity, use an out-of-tree driver
>> with a fix and let irqbalance handle it.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme



-- 
------------------------------------------------------------
the person who practices a truth goes toward light.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-03 14:30         ` Keith Busch
@ 2016-02-03 18:31           ` Azher Mughal
  2016-02-03 18:58             ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: Azher Mughal @ 2016-02-03 18:31 UTC (permalink / raw)


Hi Keith,

I have similar affinity issues in CentOS 7.2. The nvme module is generic
that came along with this distro. Is there a git repo which can be used
with the kernel "3.10.0-327.4.5" ?

Thanks
-Azher

# modinfo nvme
filename:      
/lib/modules/3.10.0-327.4.5.el7.x86_64/kernel/drivers/block/nvme.ko
version:        1.0
license:        GPL
author:         Matthew Wilcox <willy at linux.intel.com>
rhelversion:    7.2
srcversion:     6FE34EC5F6A703F8EDE6C77
alias:          pci:v*d*sv*sd*bc01sc08i02*
depends:       
intree:         Y
vermagic:       3.10.0-327.4.5.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        10:5D:A1:3D:CA:AA:74:AE:50:00:17:E7:D5:2C:DA:9B:7C:C5:10:93
sig_hashalgo:   sha256
parm:           admin_timeout:timeout in seconds for admin commands (byte)
parm:           io_timeout:timeout in seconds for I/O (byte)
parm:           shutdown_timeout:timeout in seconds for controller
shutdown (byte)
parm:           nvme_major:int
parm:           nvme_char_major:int
parm:           use_threaded_interrupts:int


On 2/3/2016 6:30 AM, Keith Busch wrote:
> On Wed, Feb 03, 2016@01:13:23AM +0100, Mark Jacobson wrote:
>> 1. When you say out-of-tree, do you mean a loadable kernel module?
>> (My understanding is that the NVMe driver is now part of the mainline
>> Linux kernel source tree, so I'm a bit confused as to where to nab
>> that from.)
> Yes, a loadable kernel module or a package that can build one. I am not
> aware of any publicly available. Perhaps your vendor provides one.
>
>> 2. Does the upstream 4.4.1 kernel have any of these fixes if I were to
>> build it myself with the appropriate support ticked off?
> The 4.3 kernel has the fixes, and anything newer than that should also
> work. I think it was just the 4.2 kernel that had this wrong (will see
> if there's are any stable patch candidates).
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* NVMe and IRQ Affinity
  2016-02-03 18:31           ` Azher Mughal
@ 2016-02-03 18:58             ` Keith Busch
  0 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2016-02-03 18:58 UTC (permalink / raw)


On Wed, Feb 03, 2016@10:31:45AM -0800, Azher Mughal wrote:
> I have similar affinity issues in CentOS 7.2. The nvme module is generic
> that came along with this distro.

Yes, it's the same issue.

> Is there a git repo which can be used
> with the kernel "3.10.0-327.4.5" ?

Not that I know of.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-02-03 18:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-02 23:31 NVMe and IRQ Affinity Mark Jacobson
2016-02-02 23:45 ` Keith Busch
2016-02-02 23:50   ` Mark Jacobson
2016-02-02 23:58     ` Keith Busch
2016-02-03  0:13       ` Mark Jacobson
2016-02-03 14:30         ` Keith Busch
2016-02-03 18:31           ` Azher Mughal
2016-02-03 18:58             ` Keith Busch
2016-02-03 16:14         ` Kim Kyungsan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.