Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* NVMe and IRQ Affinity, another problem
@ 2018-04-05  0:28 Young Yu
  2018-04-05  1:00 ` Keith Busch
  0 siblings, 1 reply; 4+ messages in thread
From: Young Yu @ 2018-04-05  0:28 UTC (permalink / raw)


Hello,

I know that this is another run on the old topic, but I'm still
wondering what is the right way to bind irq of NVMe-pci devices to the
cores in local NUMA node.  I'm using kernel 4.16.0-1.el7 on CentOS 7.4
and the machine have 2 numa nodes as in

$ lscpu|grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

I have 16 NVMe devices, 8 per NUMA node, nvme0 to 7 to the NUMA 0 and
8 to 15 to NUMA 1. irqbalance was on by default.  The irq of these
devices are all bound to the core 0 and 1 regardless of where they are
physically attached. affinity_hint looks still invalid, however there
is an effective_affinity that matches with some interrupt
bounded. cpu_list on mq was pointed to the wrong cores on the NVMe
devices on NUMA 1. I read it was fixed in kernel 4.3 so not sure
whether I?m looking at it in a right way.

Eventually I?d like to know if there is a way to distribute irq of
each nvme devices to different local cores in NUMA they are attached
to.
e.g. nvme0 - cpu 0
     nvme1 - cpu 2
     ...
     nvme8 - cpu 1
     nvme9 - cpu 3
     ...

Here are the output below.

$ cat /sys/block/nvme0n1/device/device/numa_node 
0

$ cat /sys/block/nvme8n1/device/device/numa_node 
1

$ cat /proc/interrupts |grep nvme0q
$ 143: 1777 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331648-edge nvme0q0, nvme0q1
 152: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331649-edge nvme0q2
 157: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331650-edge nvme0q3
 160: 0 12773 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331651-edge nvme0q4
 161: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331652-edge nvme0q5
 162: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331653-edge nvme0q6
 163: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331654-edge nvme0q7

$  cat /proc/interrupts |grep nvme8q
  51: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827462-edge nvme8q7
  54: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827457-edge nvme8q2
  65: 13931 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827456-edge nvme8q0, nvme8q1
  76: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827458-edge nvme8q3
  87: 0 13380 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  PCI-MSI 71827459-edge nvme8q4
  102: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827460-edge nvme8q5
  117: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827461-edge nvme8q6

$ for i in $(grep nvme0q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  143
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00055555,55555555,55555555
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  152
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000015,55555555,55555555,55500000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  157
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  555555,55555555,55555540,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  160
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,2aaaaaaa,aaaaaaaa
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  161
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,0aaaaaaa,aaaaaaaa,80000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  162
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,02aaaaaa,aaaaaaaa,a0000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  163
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  aaaaaa,aaaaaaaa,a8000000,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

$ for i in $(grep nvme8q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  51
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  aaaaaa,aaaaaaaa,a8000000,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  54
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000015,55555555,55555555,55500000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  65
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00055555,55555555,55555555
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  76
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  555555,55555555,55555540,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  87
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,2aaaaaaa,aaaaaaaa
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  102
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,0aaaaaaa,aaaaaaaa,80000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  117
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,02aaaaaa,aaaaaaaa,a0000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

$ cat /sys/block/nvme8n1/mq/*/cpu_list 
0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,
38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70,
72, 74, 76, 78, 80, 82
84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112,
114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140,
142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164
166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192,
194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220,
222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37,
39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61
63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95,
97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123
125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151,
153, 155, 157, 159, 161, 163, 165, 167, 169, 171, 173, 175, 177, 179,
181, 183, 185
187, 189, 191, 193, 195, 197, 199, 201, 203, 205, 207, 209, 211, 213,
215, 217, 219, 221, 223, 225, 227, 229, 231, 233, 235, 237, 239, 241,
243, 245, 247

$ echo 000000,00000000,00000000,00000000,00000000,000aaaaa,aaaaaaaa,aaaaaaaa
> /proc/irq/143/smp_affinity
bash: echo: write error: Input/output error


After I was looking at this, I have built 4.13.16 kernel myself from
the source, and try to see if there is any difference to the one that
is from ELRepo. However, the hint was still invalid and interrupts are
bound to the core in different NUMA although they are more
distributed. I was not able to manually fix the smp_affinity in both
kernels.


$ cat /proc/interrupts |grep nvme0q
  62: 3687 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  50331648-edge nvme0q0, nvme0q1
 123: 0 0 0 0 0 0 0 0 6642 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331649-edge nvme0q2
 129: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 335 0 0 0 0 0 0 0 PCI-MSI
 50331650-edge nvme0q3
 142: 0 6426 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331651-edge nvme0q4
 155: 0 0 0 0 0 0 0 4842 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331652-edge nvme0q5
 167: 0 0 0 0 0 0 0 0 0 0 0 0 0 2895 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331653-edge nvme0q6
 179: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3063 0 0 0 0 PCI-MSI
 50331654-edge nvme0q7

$ cat /proc/interrupts |grep nvme8q
 134: 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827456-edge nvme8q0, nvme8q1
 147: 0 0 0 0 0 0 0 0 1102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827457-edge nvme8q2
 160: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 950 0 0 0 0 0 0 0 PCI-MSI
 71827458-edge nvme8q3
 172: 0 468 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827459-edge nvme8q4
 181: 0 0 0 0 0 0 0 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827460-edge nvme8q5
 187: 0 0 0 0 0 0 0 0 0 0 0 0 0 552 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827461-edge nvme8q6
 191: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 470 0 0 0 0 PCI-MSI
 71827462-edge nvme8q7

$ for i in $(grep nvme0q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  62
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000055
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  123
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00005500
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
IRQ:  129
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00550000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
IRQ:  142
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0000002a
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  155
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000a80
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
IRQ:  167
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0002a000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
IRQ:  179
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00a80000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000

$ for i in $(grep nvme8q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  134
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000055
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  147
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00005500
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
IRQ:  160
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00550000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
IRQ:  172
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0000002a
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  181
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000a80
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
IRQ:  187
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0002a000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
IRQ:  191
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00a80000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000

$ cat /sys/block/nvme8n1/mq/*/cpu_list 
0, 2, 4, 6
8, 10, 12, 14
16, 18, 20, 22
1, 3, 5
7, 9, 11
13, 15, 17
19, 21, 23

# echo  000000,00000000,00000000,00000000,00000000,00000000,00000000,000000aa
> /proc/irq/134/smp_affinity
bash: echo: write error: Input/output error


I have tried the manual config on one of the other machine we have,
but I still have the same problem except the kernel 4.4 where I can
manually set the smp_affinity.  With the same hardware setup, I cannot
get it to work on the kernel 4.11 and still get the same Input/output
error.

Se-young Yu
Northwestern University

^ permalink raw reply	[flat|nested] 4+ messages in thread

* NVMe and IRQ Affinity, another problem
  2018-04-05  0:28 NVMe and IRQ Affinity, another problem Young Yu
@ 2018-04-05  1:00 ` Keith Busch
  2018-04-05  2:31   ` Young Yu
  0 siblings, 1 reply; 4+ messages in thread
From: Keith Busch @ 2018-04-05  1:00 UTC (permalink / raw)


On Thu, Apr 05, 2018@12:28:05AM +0000, Young Yu wrote:
> Hello,
> 
> I know that this is another run on the old topic, but I'm still
> wondering what is the right way to bind irq of NVMe-pci devices to the
> cores in local NUMA node.  I'm using kernel 4.16.0-1.el7 on CentOS 7.4
> and the machine have 2 numa nodes as in
> 
> $ lscpu|grep NUMA
> NUMA node(s):          2
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
> 
> I have 16 NVMe devices, 8 per NUMA node, nvme0 to 7 to the NUMA 0 and
> 8 to 15 to NUMA 1. irqbalance was on by default.  The irq of these
> devices are all bound to the core 0 and 1 regardless of where they are
> physically attached. affinity_hint looks still invalid, however there
> is an effective_affinity that matches with some interrupt
> bounded. cpu_list on mq was pointed to the wrong cores on the NVMe
> devices on NUMA 1. I read it was fixed in kernel 4.3 so not sure
> whether I?m looking at it in a right way.
> 
> Eventually I?d like to know if there is a way to distribute irq of
> each nvme devices to different local cores in NUMA they are attached
> to.

Bad things happened for a lot of servers when the irq spread used
"present" rather than the "online" CPUs, with the "present" CPUs being
oh-so-much larger than what is actually possible.

I'm guessing there's no chance more than 24 CPUs will actually ever
come online in this system, but your platform says 248 may come online,
so we're getting a poor spread for what is actually there.

I believe Ming Lei has an IRQ affinity patch set that may be going in
4.17 that fixes that.

In the meantime, I think if you add kernel paramter "nr_cpus=24",
that should get you a much much better affinity for submission and
completion sides.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* NVMe and IRQ Affinity, another problem
  2018-04-05  1:00 ` Keith Busch
@ 2018-04-05  2:31   ` Young Yu
  2018-04-05  2:48     ` Keith Busch
  0 siblings, 1 reply; 4+ messages in thread
From: Young Yu @ 2018-04-05  2:31 UTC (permalink / raw)


Thank you for the quick reply Keith,

nr_cpus=24 kernel parameter definitely has limited the present CPU and
helped spread the queues to the interrupt.

If you could forgive me asking another question, the admin queue, and 
half of the I/O queues of all NVMe devices are allocated to cores in a 
NUMA nodes ( in my case it is NUMA 0 as admin queue wants to stay
in the CPU0), and the other half of the I/O queues are allocated with 
the other, even if they are attached to either one of them. This is 
regardless of whether they are attached to NUMA 0 or 1.

I?m trying to read from the NVMe devices and send them to the NIC, 
and they both are attached to the same NUMA node (1). Is it possible 
to manually bind the first half of nvme8 so they all belongs to the cores 
in the same NUMA node so I can avoid accessing them using slow QPI 
between NUMA nodes? (or maybe exclude ones with admin queue 
because there will be a patch to separate the admin queue and the I/O 
queue soon) 


> On Apr 4, 2018,@8:00 PM, Keith Busch <keith.busch@intel.com> wrote:
> 
> On Thu, Apr 05, 2018@12:28:05AM +0000, Young Yu wrote:
>> Hello,
>> 
>> I know that this is another run on the old topic, but I'm still
>> wondering what is the right way to bind irq of NVMe-pci devices to the
>> cores in local NUMA node.  I'm using kernel 4.16.0-1.el7 on CentOS 7.4
>> and the machine have 2 numa nodes as in
>> 
>> $ lscpu|grep NUMA
>> NUMA node(s):          2
>> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
>> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23
>> 
>> I have 16 NVMe devices, 8 per NUMA node, nvme0 to 7 to the NUMA 0 and
>> 8 to 15 to NUMA 1. irqbalance was on by default.  The irq of these
>> devices are all bound to the core 0 and 1 regardless of where they are
>> physically attached. affinity_hint looks still invalid, however there
>> is an effective_affinity that matches with some interrupt
>> bounded. cpu_list on mq was pointed to the wrong cores on the NVMe
>> devices on NUMA 1. I read it was fixed in kernel 4.3 so not sure
>> whether I?m looking at it in a right way.
>> 
>> Eventually I?d like to know if there is a way to distribute irq of
>> each nvme devices to different local cores in NUMA they are attached
>> to.
> 
> Bad things happened for a lot of servers when the irq spread used
> "present" rather than the "online" CPUs, with the "present" CPUs being
> oh-so-much larger than what is actually possible.
> 
> I'm guessing there's no chance more than 24 CPUs will actually ever
> come online in this system, but your platform says 248 may come online,
> so we're getting a poor spread for what is actually there.
> 
> I believe Ming Lei has an IRQ affinity patch set that may be going in
> 4.17 that fixes that.
> 
> In the meantime, I think if you add kernel paramter "nr_cpus=24",
> that should get you a much much better affinity for submission and
> completion sides.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* NVMe and IRQ Affinity, another problem
  2018-04-05  2:31   ` Young Yu
@ 2018-04-05  2:48     ` Keith Busch
  0 siblings, 0 replies; 4+ messages in thread
From: Keith Busch @ 2018-04-05  2:48 UTC (permalink / raw)


On Thu, Apr 05, 2018@02:31:21AM +0000, Young Yu wrote:
> Thank you for the quick reply Keith,
> 
> nr_cpus=24 kernel parameter definitely has limited the present CPU and
> helped spread the queues to the interrupt.
> 
> If you could forgive me asking another question, the admin queue, and 
> half of the I/O queues of all NVMe devices are allocated to cores in a 
> NUMA nodes ( in my case it is NUMA 0 as admin queue wants to stay
> in the CPU0), and the other half of the I/O queues are allocated with 
> the other, even if they are attached to either one of them. This is 
> regardless of whether they are attached to NUMA 0 or 1.
> 
> I?m trying to read from the NVMe devices and send them to the NIC, 
> and they both are attached to the same NUMA node (1). Is it possible 
> to manually bind the first half of nvme8 so they all belongs to the cores 
> in the same NUMA node so I can avoid accessing them using slow QPI 
> between NUMA nodes? (or maybe exclude ones with admin queue 
> because there will be a patch to separate the admin queue and the I/O 
> queue soon) 

If you are getting interrupts on NUMA node 0, that means your request
originated from a thread running on a CPU in NUMA node 0. If you want
interrupts to wake up a CPU in NUMA node 1, you'll need to pin your IO
submission processes to the CPUs in that node.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-04-05  2:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-04-05  0:28 NVMe and IRQ Affinity, another problem Young Yu
2018-04-05  1:00 ` Keith Busch
2018-04-05  2:31   ` Young Yu
2018-04-05  2:48     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox