* Receive side performance issue with multi-10-GigE and NUMA
@ 2009-08-07 21:06 Bill Fink
2009-08-07 21:18 ` Brice Goglin
` (2 more replies)
0 siblings, 3 replies; 89+ messages in thread
From: Bill Fink @ 2009-08-07 21:06 UTC (permalink / raw)
To: Linux Network Developers; +Cc: brice, gallatin
I've run into a major receive side performance issue with multi-10-GigE
on a NUMA system. The system is using a SuperMicro X8DAH+-F motherboard
with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
1333 MHz DDR3 memory. It is a Fedora 10 system but using the latest
2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
from Fedora 10).
The test setup is:
i7test1----(6)----xeontest1----(6)----i7test2
10-GigE 10-GigE
So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
of 12 10-GigE interfaces. eth2 through eth7 (which are on the
second Intel 5520 I/O Hub) are connected to i7test1 while
eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
are connected to i7test2.
Previous direct testing between i7test1 and i7test2 (which use an
Asus P6T6 WS Revolution motherboard) demonstrated that they could
achieve ~70 Gbps performance for either transmit or receive using
8 10-GigE interfaces.
The transmit side performance of xeontest1 is fantastic:
[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
n12: 9648.0522 MB / 10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
n9: 11130.5320 MB / 10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
n11: 9418.1250 MB / 10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
n10: 9279.4758 MB / 10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
n8: 11142.6574 MB / 10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
n13: 9422.1492 MB / 10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
n3: 11471.2500 MB / 10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
n6: 9339.6354 MB / 10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
n4: 9093.2500 MB / 10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
n5: 9121.8367 MB / 10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
n7: 9292.2500 MB / 10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
n2: 11487.1150 MB / 10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT
Aggregate performance: 100.4637 Gbps
The problem is with the receive side performance.
[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n11: 6983.6359 MB / 10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
n10: 7000.1557 MB / 10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
n9: 2451.7206 MB / 10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
n13: 2453.0887 MB / 10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
n12: 2446.5303 MB / 10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
n8: 2462.5890 MB / 10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
n4: 2763.5091 MB / 10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
n5: 2770.0887 MB / 10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
n2: 1777.7277 MB / 10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
n6: 1772.7962 MB / 10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
n3: 1779.4535 MB / 10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
n7: 1770.8359 MB / 10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT
Aggregate performance: 29.9492 Gbps
I suspected that this was because the memory being allocated by the
myri10ge driver was not being allocated on the optimum NUMA node.
BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
is what I would have expected, but this is my first experience with
a NUMA system.
Based upon a patch by Peter Zijlstra that I discovered through Google
searching, I tried patching the myri10ge driver to change its memory
allocation of memory pages from alloc_pages() to alloc_pages_node()
and specifying the NUMA node of the parent device of the Myricom 10-GigE
device, which IIUC should be the PCIe switch. This didn't help.
This could be because I discovered that if I did:
find /sys -name numa_node -exec grep . {} /dev/null \;
that the numa_node associated with all the PCI devices was always 0,
and if IIUC then I believe some of the PCI devices should have been
associated with NUMA node 2. Perhaps this is what is causing all
the memory pages allocated by the myri10ge driver to be on NUMA
node 0, and thus causing the major performance issue.
To kludge around this, I made a different patch to the myri10ge driver.
This time I hardcoded the NUMA node in the call to alloc_pages_node()
to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
This is of course very specific to our specific system (NUMA node ids
and Myricom 10-GigE device IRQs), and is not something that would be
generically applicable. But it was useful as a test, and it did
improve the receive side performance substantially!
[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n5: 8221.2911 MB / 10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
n4: 8237.9524 MB / 10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
n11: 7935.3750 MB / 10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
n2: 4543.1621 MB / 10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
n10: 7916.3925 MB / 10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
n7: 4558.4817 MB / 10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
n13: 4390.1875 MB / 10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
n3: 4572.6478 MB / 10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
n6: 4564.4776 MB / 10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
n8: 4409.8551 MB / 10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
n9: 4412.7836 MB / 10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
n12: 4413.4061 MB / 10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT
Aggregate performance: 56.4703 Gbps
This was basically double the previous receive side performance
without the patch.
I don't know if this is fundamentally a myri10ge driver issue or
some underlying Linux kernel issue, so it's not clear to me what
a proper fix would be.
Finally, while definitely a major improvement, I think it should be
possible to do even better, since we achieved 70 Gbps in the i7 to i7
tests, and probably could have done 80 Gbps except for an Asus
motherboard restriction with the interconnect between the Intel X58
and Nvidia NF200 chips. It's definitely a big step in the right
direction though if this issue can be resolved.
Any help greatly appreicated in advance.
-Thanks
-Bill
^ permalink raw reply [flat|nested] 89+ messages in thread* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink @ 2009-08-07 21:18 ` Brice Goglin 2009-08-07 21:51 ` Bill Fink 2009-08-07 22:12 ` Neil Horman 2009-08-12 23:29 ` David Miller 2 siblings, 1 reply; 89+ messages in thread From: Brice Goglin @ 2009-08-07 21:18 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin Bill Fink wrote: > This could be because I discovered that if I did: > > find /sys -name numa_node -exec grep . {} /dev/null \; > > that the numa_node associated with all the PCI devices was always 0, > and if IIUC then I believe some of the PCI devices should have been > associated with NUMA node 2. Perhaps this is what is causing all > the memory pages allocated by the myri10ge driver to be on NUMA > node 0, and thus causing the major performance issue. > I've seen some cases in the past where numa_node was always 0 on quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in later kernels thanks to patches from Yinghai Lu (CC'ed). Is the corresponding local_cpus sysfs file wrong as well ? Maybe your kernel doesn't properly handle the NUMA location of PCI devices on Nehalem machines yet? Brice ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:18 ` Brice Goglin @ 2009-08-07 21:51 ` Bill Fink 2009-08-07 21:53 ` Brice Goglin 2009-08-08 1:03 ` Andrew Gallatin 0 siblings, 2 replies; 89+ messages in thread From: Bill Fink @ 2009-08-07 21:51 UTC (permalink / raw) To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin On Fri, 07 Aug 2009, Brice Goglin wrote: > Bill Fink wrote: > > This could be because I discovered that if I did: > > > > find /sys -name numa_node -exec grep . {} /dev/null \; > > > > that the numa_node associated with all the PCI devices was always 0, > > and if IIUC then I believe some of the PCI devices should have been > > associated with NUMA node 2. Perhaps this is what is causing all > > the memory pages allocated by the myri10ge driver to be on NUMA > > node 0, and thus causing the major performance issue. > > > > I've seen some cases in the past where numa_node was always 0 on > quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in > later kernels thanks to patches from Yinghai Lu (CC'ed). By later kernels do you mean 2.6.30 or 2.6.31? > Is the corresponding local_cpus sysfs file wrong as well ? All sysfs local_cpus values are the same (00000000,000000ff), so yes they are also wrong. > Maybe your kernel doesn't properly handle the NUMA location of PCI > devices on Nehalem machines yet? I assume so, unless there's some secret NUMA system setting that I'm unaware of that would affect this and needs changing for my setup. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:51 ` Bill Fink @ 2009-08-07 21:53 ` Brice Goglin 2009-08-07 22:08 ` Bill Fink 2009-08-08 1:03 ` Andrew Gallatin 1 sibling, 1 reply; 89+ messages in thread From: Brice Goglin @ 2009-08-07 21:53 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin Bill Fink wrote: >> I've seen some cases in the past where numa_node was always 0 on >> quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in >> later kernels thanks to patches from Yinghai Lu (CC'ed). >> > > By later kernels do you mean 2.6.30 or 2.6.31? > No, I meant "later than when the problem occured". I was using 2.6.22 at this point and the problem was fixed somewhere around 2.6.25. >> Is the corresponding local_cpus sysfs file wrong as well ? >> > > All sysfs local_cpus values are the same (00000000,000000ff), > so yes they are also wrong. > And hyperthreading is enabled, right? Brice ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:53 ` Brice Goglin @ 2009-08-07 22:08 ` Bill Fink 2009-08-07 22:17 ` Brice Goglin 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-07 22:08 UTC (permalink / raw) To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin On Fri, 07 Aug 2009, Brice Goglin wrote: > Bill Fink wrote: > >> I've seen some cases in the past where numa_node was always 0 on > >> quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in > >> later kernels thanks to patches from Yinghai Lu (CC'ed). > > > > By later kernels do you mean 2.6.30 or 2.6.31? > > No, I meant "later than when the problem occured". I was using 2.6.22 at > this point and the problem was fixed somewhere around 2.6.25. OK. The tests were run on a 2.6.29.6 kernel so presumably should have included the fix you mentioned. > >> Is the corresponding local_cpus sysfs file wrong as well ? > > > > All sysfs local_cpus values are the same (00000000,000000ff), > > so yes they are also wrong. > > And hyperthreading is enabled, right? No, hyperthreading is disabled. It's a dual quad-core system so there are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 22:08 ` Bill Fink @ 2009-08-07 22:17 ` Brice Goglin 2009-08-07 22:55 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Brice Goglin @ 2009-08-07 22:17 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin Bill Fink wrote: > OK. The tests were run on a 2.6.29.6 kernel so presumably should > have included the fix you mentioned. > Yes, but I wanted to emphasize that new platforms sometime need some new code to handle this kind of things. Some Nehalem-specific changes might be needed now. >>>> Is the corresponding local_cpus sysfs file wrong as well ? >>>> >>> All sysfs local_cpus values are the same (00000000,000000ff), >>> so yes they are also wrong. >>> >> And hyperthreading is enabled, right? >> > > No, hyperthreading is disabled. It's a dual quad-core system so there > are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2. > So numa_node says that the device is close to node 0 while local_cpus says that it's close to all 8 cores ie close to both node0 and node2 (which may well be wrong as well). Brice ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 22:17 ` Brice Goglin @ 2009-08-07 22:55 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-08-07 22:55 UTC (permalink / raw) To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin On Sat, 08 Aug 2009, Brice Goglin wrote: > Bill Fink wrote: > > OK. The tests were run on a 2.6.29.6 kernel so presumably should > > have included the fix you mentioned. > > Yes, but I wanted to emphasize that new platforms sometime need some new > code to handle this kind of things. Some Nehalem-specific changes might > be needed now. Thanks for the clarification. > >>>> Is the corresponding local_cpus sysfs file wrong as well ? > >>>> > >>> All sysfs local_cpus values are the same (00000000,000000ff), > >>> so yes they are also wrong. > >>> > >> And hyperthreading is enabled, right? > >> > > > > No, hyperthreading is disabled. It's a dual quad-core system so there > > are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2. > > So numa_node says that the device is close to node 0 while local_cpus > says that it's close to all 8 cores ie close to both node0 and node2 > (which may well be wrong as well). I believe it is wrong. The basic system arcitecture is: Memory----CPU1----QPI----CPU2----Memory | | | | QPI QPI | | | | 5520----QPI----5520 |||| |||| |||| |||| |||| |||| PCIe PCIe There are 2 x8, 1 x16, and 1 x4 PCIe 2.0 interfaces on each of the Intel 5520 I/O Hubs. The Myricom dual-port 10-GigE NICs are in the six x8 or better slots. eth2 through eth7 are on the second Intel 5520 I/O Hub, so they should presumably show up on NUMA node 2, and have local CPUs 1, 3, 5, and 7. eth8 through eth13 are on the first Intel 5520 I/O Hub, and thus should be on NUMA node 0 with local CPUs 0, 2, 4, and 6 (CPU info derived from /proc/cpinfo). -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:51 ` Bill Fink 2009-08-07 21:53 ` Brice Goglin @ 2009-08-08 1:03 ` Andrew Gallatin 2009-08-08 1:35 ` Bill Fink 1 sibling, 1 reply; 89+ messages in thread From: Andrew Gallatin @ 2009-08-08 1:03 UTC (permalink / raw) To: Bill Fink; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu Bill Fink wrote: > All sysfs local_cpus values are the same (00000000,000000ff), > so yes they are also wrong. How were you handling IRQ binding? If local_cpus is wrong, the irqbalance will not be able to make good decisions about where to bind the NICs' IRQs. Did you try manually binding each NICs's interrupt to a separate CPU on the correct node? Regards, Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 1:03 ` Andrew Gallatin @ 2009-08-08 1:35 ` Bill Fink 2009-08-08 11:08 ` Andrew Gallatin 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-08 1:35 UTC (permalink / raw) To: Andrew Gallatin; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu On Fri, 07 Aug 2009, Andrew Gallatin wrote: > Bill Fink wrote: > > > All sysfs local_cpus values are the same (00000000,000000ff), > > so yes they are also wrong. > > How were you handling IRQ binding? If local_cpus is wrong, > the irqbalance will not be able to make good decisions about > where to bind the NICs' IRQs. Did you try manually binding > each NICs's interrupt to a separate CPU on the correct node? Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, and the nuttcp application had its CPU affinity set to the same CPU with its memory affinity bound to the same local NUMA node. And the irqbalance daemon wasn't running. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 1:35 ` Bill Fink @ 2009-08-08 11:08 ` Andrew Gallatin 2009-08-08 11:26 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Andrew Gallatin @ 2009-08-08 11:08 UTC (permalink / raw) To: Bill Fink; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu Bill Fink wrote: > On Fri, 07 Aug 2009, Andrew Gallatin wrote: > >> Bill Fink wrote: >> >>> All sysfs local_cpus values are the same (00000000,000000ff), >>> so yes they are also wrong. >> How were you handling IRQ binding? If local_cpus is wrong, >> the irqbalance will not be able to make good decisions about >> where to bind the NICs' IRQs. Did you try manually binding >> each NICs's interrupt to a separate CPU on the correct node? > > Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, > and the nuttcp application had its CPU affinity set to the same > CPU with its memory affinity bound to the same local NUMA node. > And the irqbalance daemon wasn't running. I must be misunderstanding something. I had thought that alloc_pages() on NUMA would wind up doing alloc_pages_current(), which would allocate based on default policy which (if not interleaved) should allocate from the current NUMA node. And since restocking the RX ring happens from a the driver's NAPI softirq context, then it should always be restocking on the same node the memory is destined to be consumed on. Do I just not understand how alloc_pages() works on NUMA? Thanks, Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 11:08 ` Andrew Gallatin @ 2009-08-08 11:26 ` Neil Horman 2009-08-08 18:21 ` Andrew Gallatin 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-08 11:26 UTC (permalink / raw) To: Andrew Gallatin Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: > Bill Fink wrote: > > On Fri, 07 Aug 2009, Andrew Gallatin wrote: > > > >> Bill Fink wrote: > >> > >>> All sysfs local_cpus values are the same (00000000,000000ff), > >>> so yes they are also wrong. > >> How were you handling IRQ binding? If local_cpus is wrong, > >> the irqbalance will not be able to make good decisions about > >> where to bind the NICs' IRQs. Did you try manually binding > >> each NICs's interrupt to a separate CPU on the correct node? > > > > Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, > > and the nuttcp application had its CPU affinity set to the same > > CPU with its memory affinity bound to the same local NUMA node. > > And the irqbalance daemon wasn't running. > > I must be misunderstanding something. I had thought that > alloc_pages() on NUMA would wind up doing alloc_pages_current(), which > would allocate based on default policy which (if not interleaved) > should allocate from the current NUMA node. And since restocking the > RX ring happens from a the driver's NAPI softirq context, then it > should always be restocking on the same node the memory is destined to > be consumed on. > > Do I just not understand how alloc_pages() works on NUMA? > Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx ring in their napi context. netdev_alloc_skb specifically allocates an skb from memory in the node that the actually NIC is local to (rather than the cpu that the interrupt is running on). That cuts out cross numa node chatter when the device is dma-ing a frame from the hardware to the allocated skb. The offshoot of that however (especially in 10G cards with lots of rx queues whos interrupts are spread out through the system) is that the irq affinity for a given irq has an increased risk of not being on the same node as the skb memory. The ftrace module I referenced earlier will help illustrate this, as well as cases where its causing applications to run on processors that create lots of cross-node chatter. Neil > Thanks, > > Drew > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 11:26 ` Neil Horman @ 2009-08-08 18:21 ` Andrew Gallatin 2009-08-08 18:32 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Andrew Gallatin @ 2009-08-08 18:21 UTC (permalink / raw) To: Neil Horman; +Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu Neil Horman wrote: > On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: >> Bill Fink wrote: >>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: >>> >>>> Bill Fink wrote: >>>> >>>>> All sysfs local_cpus values are the same (00000000,000000ff), >>>>> so yes they are also wrong. >>>> How were you handling IRQ binding? If local_cpus is wrong, >>>> the irqbalance will not be able to make good decisions about >>>> where to bind the NICs' IRQs. Did you try manually binding >>>> each NICs's interrupt to a separate CPU on the correct node? >>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, >>> and the nuttcp application had its CPU affinity set to the same >>> CPU with its memory affinity bound to the same local NUMA node. >>> And the irqbalance daemon wasn't running. >> I must be misunderstanding something. I had thought that >> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which >> would allocate based on default policy which (if not interleaved) >> should allocate from the current NUMA node. And since restocking the >> RX ring happens from a the driver's NAPI softirq context, then it >> should always be restocking on the same node the memory is destined to >> be consumed on. >> >> Do I just not understand how alloc_pages() works on NUMA? >> > > Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx > ring in their napi context. netdev_alloc_skb specifically allocates an skb from > memory in the node that the actually NIC is local to (rather than the cpu that > the interrupt is running on). That cuts out cross numa node chatter when the > device is dma-ing a frame from the hardware to the allocated skb. The offshoot > of that however (especially in 10G cards with lots of rx queues whos interrupts > are spread out through the system) is that the irq affinity for a given irq has > an increased risk of not being on the same node as the skb memory. The ftrace > module I referenced earlier will help illustrate this, as well as cases where > its causing applications to run on processors that create lots of cross-node > chatter. One thing worth noting is that myri10ge is rather unusual in that it fills its RX rings with pages, then attaches them to skbs after the receive is done. Given how (I think) alloc_page() works, I don't understand why correct CPU binding does not have the same benefit as Bill's patch to assign the NUMA node manually. I'm certainly willing to change to myri10ge to use alloc_pages_node() based on NIC locality, if that provides a benefit, but I'd really like to understand why CPU binding is not helping. Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 18:21 ` Andrew Gallatin @ 2009-08-08 18:32 ` Neil Horman 2009-08-11 7:32 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-08 18:32 UTC (permalink / raw) To: Andrew Gallatin Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > Neil Horman wrote: >> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: >>> Bill Fink wrote: >>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: >>>> >>>>> Bill Fink wrote: >>>>> >>>>>> All sysfs local_cpus values are the same (00000000,000000ff), >>>>>> so yes they are also wrong. >>>>> How were you handling IRQ binding? If local_cpus is wrong, >>>>> the irqbalance will not be able to make good decisions about >>>>> where to bind the NICs' IRQs. Did you try manually binding >>>>> each NICs's interrupt to a separate CPU on the correct node? >>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, >>>> and the nuttcp application had its CPU affinity set to the same >>>> CPU with its memory affinity bound to the same local NUMA node. >>>> And the irqbalance daemon wasn't running. >>> I must be misunderstanding something. I had thought that >>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which >>> would allocate based on default policy which (if not interleaved) >>> should allocate from the current NUMA node. And since restocking the >>> RX ring happens from a the driver's NAPI softirq context, then it >>> should always be restocking on the same node the memory is destined to >>> be consumed on. >>> >>> Do I just not understand how alloc_pages() works on NUMA? >>> >> >> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx >> ring in their napi context. netdev_alloc_skb specifically allocates an skb from >> memory in the node that the actually NIC is local to (rather than the cpu that >> the interrupt is running on). That cuts out cross numa node chatter when the >> device is dma-ing a frame from the hardware to the allocated skb. The offshoot >> of that however (especially in 10G cards with lots of rx queues whos interrupts >> are spread out through the system) is that the irq affinity for a given irq has >> an increased risk of not being on the same node as the skb memory. The ftrace >> module I referenced earlier will help illustrate this, as well as cases where >> its causing applications to run on processors that create lots of cross-node >> chatter. > > One thing worth noting is that myri10ge is rather unusual in that > it fills its RX rings with pages, then attaches them to skbs after > the receive is done. Given how (I think) alloc_page() works, I > don't understand why correct CPU binding does not have the same > benefit as Bill's patch to assign the NUMA node manually. > > I'm certainly willing to change to myri10ge to use alloc_pages_node() > based on NIC locality, if that provides a benefit, but I'd really > like to understand why CPU binding is not helping. > Thats hard to say. If binding the app to a cpu on the same node doesn't help, that would suggest to me: 1) That the process binding isn't being honored 2) The cpu you're binding to isn't actually on the same node 3) The node which the skb's are allocated on is not the one you think it is 4) The cross numa chatter is improved, but another problem has taken its place (like cpu contention between the process and the interrupt handler on the samme cpu) 5) The problem is something else entirely. Either way, I'd suggest applying and running the patch set that I referenced previously. It will give you a good table representation of how skbs for this process are being allocated and consumed, and let you confirm or eliminate items 1-4 above. Neil > Drew > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 18:32 ` Neil Horman @ 2009-08-11 7:32 ` Bill Fink 2009-08-11 11:02 ` Neil Horman ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Bill Fink @ 2009-08-11 7:32 UTC (permalink / raw) To: Neil Horman Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Sat, 8 Aug 2009, Neil Horman wrote: > On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > > Neil Horman wrote: > >> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: > >>> Bill Fink wrote: > >>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > >>>> > >>>>> Bill Fink wrote: > >>>>> > >>>>>> All sysfs local_cpus values are the same (00000000,000000ff), > >>>>>> so yes they are also wrong. > >>>>> How were you handling IRQ binding? If local_cpus is wrong, > >>>>> the irqbalance will not be able to make good decisions about > >>>>> where to bind the NICs' IRQs. Did you try manually binding > >>>>> each NICs's interrupt to a separate CPU on the correct node? > >>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, > >>>> and the nuttcp application had its CPU affinity set to the same > >>>> CPU with its memory affinity bound to the same local NUMA node. > >>>> And the irqbalance daemon wasn't running. > >>> I must be misunderstanding something. I had thought that > >>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which > >>> would allocate based on default policy which (if not interleaved) > >>> should allocate from the current NUMA node. And since restocking the > >>> RX ring happens from a the driver's NAPI softirq context, then it > >>> should always be restocking on the same node the memory is destined to > >>> be consumed on. > >>> > >>> Do I just not understand how alloc_pages() works on NUMA? > >>> > >> > >> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx > >> ring in their napi context. netdev_alloc_skb specifically allocates an skb from > >> memory in the node that the actually NIC is local to (rather than the cpu that > >> the interrupt is running on). That cuts out cross numa node chatter when the > >> device is dma-ing a frame from the hardware to the allocated skb. The offshoot > >> of that however (especially in 10G cards with lots of rx queues whos interrupts > >> are spread out through the system) is that the irq affinity for a given irq has > >> an increased risk of not being on the same node as the skb memory. The ftrace > >> module I referenced earlier will help illustrate this, as well as cases where > >> its causing applications to run on processors that create lots of cross-node > >> chatter. > > > > One thing worth noting is that myri10ge is rather unusual in that > > it fills its RX rings with pages, then attaches them to skbs after > > the receive is done. Given how (I think) alloc_page() works, I > > don't understand why correct CPU binding does not have the same > > benefit as Bill's patch to assign the NUMA node manually. > > > > I'm certainly willing to change to myri10ge to use alloc_pages_node() > > based on NIC locality, if that provides a benefit, but I'd really > > like to understand why CPU binding is not helping. I originally tried to just use alloc_pages_node() instead of alloc_pages(), but it didn't help. As mentioned in an earlier e-mail, that seems to be because I discovered that doing: find /sys -name numa_node -exec grep . {} /dev/null \; revealed that the NUMA node associated with _all_ the PCI devices was always 0, when at least some of them should have been associated with NUMA node 2, including 6 of the 12 Myricom 10-GigE devices. I discovered today that the NUMA node cpulist/cpumap is also wrong. A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a cpumask of 00000000,000000ff), while the cpulist for node2 is empty (with a cpumask of 00000000,00000000). The distance is correct, with "10 20" for node 0 and "20 10" for node2. Since there seems to be an underlying kernel issue here, what would be the proper place to address the apparently incorrect assignment of NUMA node information for this system? Even with my hacked workaround, which basically doubled the receive side performance without my patch, the performance level was still subpar from what I would have expected should be possible based on some other tests I ran, such as the following single and multiple parallel nuttcp loopback tests. On Asus P6T6 motherboard with single Intel i7 965 3.2 GHz (overclocked to 3.4 GHz) quad-core processor (non-NUMA): Single nuttcp loopback test using CPUs 0 and 1: [root@i7test1 ~]# nuttcp -xc0/1 192.168.1.10 44948.3125 MB / 10.04 sec = 37554.1394 Mbps 99 %TX 75 %RX 0 retrans 0.04 msRTT Two parallel nuttcp loopback tests using CPUs 0, 1, 2, and 3: [root@i7test1 ~]# nuttcp -xc0/1 -p5101 192.168.1.10 & nuttcp -xc2/3 -p5102 192.168.1.10 & 43595.0000 MB / 10.04 sec = 36423.4339 Mbps 99 %TX 82 %RX 0 retrans 0.04 msRTT 43384.5000 MB / 10.04 sec = 36247.5115 Mbps 99 %TX 74 %RX 0 retrans 0.02 msRTT Aggregate performance: 72.6709 Gbps On SuperMicro X8DAH+-F motherboard with dual Intel Xeon 5580 3.2 GHz quad-core processors (NUMA): Single nuttcp loopback test using CPUs 0 and 2 on NUMA node 0: [root@xeontest1 ~]# nuttcp -xc0/2 192.168.1.14 39348.0000 MB / 10.04 sec = 32875.4865 Mbps 99 %TX 59 %RX 0 retrans 0.06 msRTT Two parallel nuttcp loopback tests using CPUs 0, 2, 4, and 6 on NUMA node 0: [root@xeontest1 ~]# nuttcp -xc0/2 -p5101 192.168.1.14 & nuttcp -xc4/6 -p5102 192.168.1.14 & 36197.0625 MB / 10.04 sec = 30245.0918 Mbps 99 %TX 59 %RX 0 retrans 0.06 msRTT 38153.5000 MB / 10.04 sec = 31876.4556 Mbps 99 %TX 75 %RX 0 retrans 0.04 msRTT Aggregate performance: 62.1215 Gbps While the performance using a single Xeon 5580 quad-core processor on the SuperMicro system was 12.5 % to 14.5 % slower than the single i7 965 quad-core processor on the Asus system, when you use both of the Xeon 5580 quad core processors: Four parallel nuttcp loopback tests using CPUs 0, 2, 4, and 6 on NUMA node 0, and CPUs 1, 3, 5, and 7 on NUMA node 2: [root@xeontest1 ~]# nuttcp -xc0/2 -p5101 192.168.1.14 & nuttcp -xc4/6 -p5102 192.168.1.14 & numactl --membind=2 nuttcp -xc1/3 -p5103 192.168.1.14 & numactl --membind=2 nuttcp -xc5/7 -p5104 192.168.1.14 & 36340.4375 MB / 10.04 sec = 30363.2672 Mbps 99 %TX 71 %RX 0 retrans 0.06 msRTT 36344.1250 MB / 10.04 sec = 30365.1838 Mbps 99 %TX 70 %RX 0 retrans 0.04 msRTT 34134.5625 MB / 10.04 sec = 28519.0180 Mbps 98 %TX 67 %RX 0 retrans 0.06 msRTT 34812.6875 MB / 10.04 sec = 29085.5312 Mbps 99 %TX 66 %RX 0 retrans 0.04 msRTT Aggregate performance: 118.3330 Gbps Overall the SuperMicro system outperforms the Asus system by 62.8 %. Since a test between a pair of the i7 test systems achieved an aggregate performance of ~70 Gbps, and could probably have achieved 80 Gbps except for a motherboard restriction, it would seem the dual Xeon system should be able to achieve at least the same level of aggregate performance. On the transmit side it excels, achieving 100 Gbps. But on the receive side, even with my hacked workaround, it tops out at 56 Gbps. I would welcome any further ideas on what might still be limiting the aggregate receive side performance of the dual Xeon NUMA system. > Thats hard to say. If binding the app to a cpu on the same node doesn't help, > that would suggest to me: > > 1) That the process binding isn't being honored > 2) The cpu you're binding to isn't actually on the same node > 3) The node which the skb's are allocated on is not the one you think it is > 4) The cross numa chatter is improved, but another problem has taken its place > (like cpu contention between the process and the interrupt handler on the samme > cpu) > 5) The problem is something else entirely. > > Either way, I'd suggest applying and running the patch set that I referenced > previously. It will give you a good table representation of how skbs for this > process are being allocated and consumed, and let you confirm or eliminate items > 1-4 above. Unfortunately I haven't had a chance to try that yet, as I was away for the weekend and then there was an emergency at work today. But I will hopefully get a chance to try it out shortly. I had some initial concerns about just how much trace data would be generated for a 10-second 10-GigE (or 100-GigE) test, but after doing some quick calculations for 9000 byte jumbo frames, I guess it's a manageable amount of data. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-11 7:32 ` Bill Fink @ 2009-08-11 11:02 ` Neil Horman 2009-08-11 19:15 ` Christoph Lameter 2009-08-11 22:27 ` Andi Kleen 2009-08-12 0:02 ` Brandeburg, Jesse 2 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-11 11:02 UTC (permalink / raw) To: Bill Fink Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Tue, Aug 11, 2009 at 03:32:10AM -0400, Bill Fink wrote: > On Sat, 8 Aug 2009, Neil Horman wrote: > > > On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > > > Neil Horman wrote: > > >> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: > > >>> Bill Fink wrote: > > >>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > > >>>> > > >>>>> Bill Fink wrote: > > >>>>> > > >>>>>> All sysfs local_cpus values are the same (00000000,000000ff), > > >>>>>> so yes they are also wrong. > > >>>>> How were you handling IRQ binding? If local_cpus is wrong, > > >>>>> the irqbalance will not be able to make good decisions about > > >>>>> where to bind the NICs' IRQs. Did you try manually binding > > >>>>> each NICs's interrupt to a separate CPU on the correct node? > > >>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node, > > >>>> and the nuttcp application had its CPU affinity set to the same > > >>>> CPU with its memory affinity bound to the same local NUMA node. > > >>>> And the irqbalance daemon wasn't running. > > >>> I must be misunderstanding something. I had thought that > > >>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which > > >>> would allocate based on default policy which (if not interleaved) > > >>> should allocate from the current NUMA node. And since restocking the > > >>> RX ring happens from a the driver's NAPI softirq context, then it > > >>> should always be restocking on the same node the memory is destined to > > >>> be consumed on. > > >>> > > >>> Do I just not understand how alloc_pages() works on NUMA? > > >>> > > >> > > >> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx > > >> ring in their napi context. netdev_alloc_skb specifically allocates an skb from > > >> memory in the node that the actually NIC is local to (rather than the cpu that > > >> the interrupt is running on). That cuts out cross numa node chatter when the > > >> device is dma-ing a frame from the hardware to the allocated skb. The offshoot > > >> of that however (especially in 10G cards with lots of rx queues whos interrupts > > >> are spread out through the system) is that the irq affinity for a given irq has > > >> an increased risk of not being on the same node as the skb memory. The ftrace > > >> module I referenced earlier will help illustrate this, as well as cases where > > >> its causing applications to run on processors that create lots of cross-node > > >> chatter. > > > > > > One thing worth noting is that myri10ge is rather unusual in that > > > it fills its RX rings with pages, then attaches them to skbs after > > > the receive is done. Given how (I think) alloc_page() works, I > > > don't understand why correct CPU binding does not have the same > > > benefit as Bill's patch to assign the NUMA node manually. > > > > > > I'm certainly willing to change to myri10ge to use alloc_pages_node() > > > based on NIC locality, if that provides a benefit, but I'd really > > > like to understand why CPU binding is not helping. > > I originally tried to just use alloc_pages_node() instead of alloc_pages(), > but it didn't help. As mentioned in an earlier e-mail, that seems to > be because I discovered that doing: > > find /sys -name numa_node -exec grep . {} /dev/null \; > > revealed that the NUMA node associated with _all_ the PCI devices was > always 0, when at least some of them should have been associated with > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices. > > I discovered today that the NUMA node cpulist/cpumap is also wrong. > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a > cpumask of 00000000,000000ff), while the cpulist for node2 is empty > (with a cpumask of 00000000,00000000). The distance is correct, > with "10 20" for node 0 and "20 10" for node2. > > Since there seems to be an underlying kernel issue here, what would > be the proper place to address the apparently incorrect assignment > of NUMA node information for this system? > Well, its possible that there is a kernel bug, but those tables that you're reading are parsed IIRC directly from the systems SRAT table in acpi space. I'm not sure of a way to read those directly from user space, but IIRC if you turn on apic debugging they will get dumped out. It sounds as though perhaps your SRAT table is incorrectly reporting the location of your devices. You may also want to look at dumping out your smbios via dmidecode to see where that places all your 10G nic cards. Neil > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-11 11:02 ` Neil Horman @ 2009-08-11 19:15 ` Christoph Lameter 0 siblings, 0 replies; 89+ messages in thread From: Christoph Lameter @ 2009-08-11 19:15 UTC (permalink / raw) To: Neil Horman Cc: Bill Fink, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Tue, 11 Aug 2009, Neil Horman wrote: > Well, its possible that there is a kernel bug, but those tables that you're > reading are parsed IIRC directly from the systems SRAT table in acpi space. I'm > not sure of a way to read those directly from user space, but IIRC if you turn > on apic debugging they will get dumped out. It sounds as though perhaps your > SRAT table is incorrectly reporting the location of your devices. You may also > want to look at dumping out your smbios via dmidecode to see where that places > all your 10G nic cards. Very likely. Talk to the manufacturer of the machine and make sure that the ACPI information is correct. NUMA is new to many vendors because of the recent introduction of newer processor architectures that support NUMA for the first time in small smp machines. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-11 7:32 ` Bill Fink 2009-08-11 11:02 ` Neil Horman @ 2009-08-11 22:27 ` Andi Kleen 2009-08-12 4:30 ` Bill Fink 2009-08-12 0:02 ` Brandeburg, Jesse 2 siblings, 1 reply; 89+ messages in thread From: Andi Kleen @ 2009-08-11 22:27 UTC (permalink / raw) To: Bill Fink Cc: Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu Bill Fink <billfink@mindspring.com> writes: > > I originally tried to just use alloc_pages_node() instead of alloc_pages(), > but it didn't help. As mentioned in an earlier e-mail, that seems to > be because I discovered that doing: > > find /sys -name numa_node -exec grep . {} /dev/null \; > > revealed that the NUMA node associated with _all_ the PCI devices was > always 0, when at least some of them should have been associated with > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices. > I discovered today that the NUMA node cpulist/cpumap is also wrong. > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a > cpumask of 00000000,000000ff), while the cpulist for node2 is empty > (with a cpumask of 00000000,00000000). The distance is correct, > with "10 20" for node 0 and "20 10" for node2. When the CPU nodes are not correct the device nodes are unlikely to correct either. In fact your system likely has no node 1 configured, right? This information comes from the BIOS. So either your BIOS is broken or you simply didn't enable NUMA mode in the BIOS, but configured memory interleaving. If you post dmesg output somewhere I can take a look. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-11 22:27 ` Andi Kleen @ 2009-08-12 4:30 ` Bill Fink 2009-08-12 7:21 ` Andi Kleen [not found] ` <4A856781.2080301@myri.com> 0 siblings, 2 replies; 89+ messages in thread From: Bill Fink @ 2009-08-12 4:30 UTC (permalink / raw) To: Andi Kleen Cc: Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Wed, 12 Aug 2009, Andi Kleen wrote: > Bill Fink <billfink@mindspring.com> writes: > > > > I originally tried to just use alloc_pages_node() instead of alloc_pages(), > > but it didn't help. As mentioned in an earlier e-mail, that seems to > > be because I discovered that doing: > > > > find /sys -name numa_node -exec grep . {} /dev/null \; > > > > revealed that the NUMA node associated with _all_ the PCI devices was > > always 0, when at least some of them should have been associated with > > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices. > > > I discovered today that the NUMA node cpulist/cpumap is also wrong. > > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a > > cpumask of 00000000,000000ff), while the cpulist for node2 is empty > > (with a cpumask of 00000000,00000000). The distance is correct, > > with "10 20" for node 0 and "20 10" for node2. > > When the CPU nodes are not correct the device nodes are unlikely > to correct either. In fact your system likely has no node 1 configured, > right? That was right. There was no node 1, only nodes 0 and 2. > This information comes from the BIOS. So either your BIOS is broken > or you simply didn't enable NUMA mode in the BIOS, but configured > memory interleaving. > > If you post dmesg output somewhere I can take a look. I did have NUMA enabled, and memory was configured as independent rather than interleaved. Based on all the discussions, it seemed a good possibility that the BIOS was broken. Today a colleague checked the SuperMicro site, and discovered and installed a newer version of the BIOS. Things seem better now, but not totally correct. There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7 (previously the even CPUs were on the first Xeon 5580 processor while the odd CPUs were on the second processor). [root@xeontest1 ~]# numastat node0 node1 numa_hit 28087735 27195340 numa_miss 0 0 numa_foreign 0 0 interleave_hit 12065 11978 local_node 28081559 27182572 other_node 6176 12768 [root@xeontest1 ~]# grep 'physical id' /proc/cpuinfo physical id : 0 physical id : 0 physical id : 0 physical id : 0 physical id : 1 physical id : 1 physical id : 1 physical id : 1 [root@xeontest1 ~]# cat /sys/devices/system/node/node0/cpulist 0-3 [root@xeontest1 ~]# cat /sys/devices/system/node/node1/cpulist 4-7 But _all_ the PCI devices are still just on node 0. [root@xeontest1 ~]# find /sys -name numa_node -exec grep . {} /dev/null \; shows numa_node is always 0. [root@xeontest1 ~]# find /sys -name local_cpulist -exec grep . {} /dev/null \; shows local_cpulist is always 0-3. I now can get basically the same level of aggregate receive side performance (55 Gbps) without my patch that I could previously get only with my hacked workaround in the myri10ge driver. But this still seems significantly subpar to what I believe it should be capable of. BTW when I first booted the test system after upgrading the BIOS, I got a kernel oops because it was still using my hacked myri10ge driver, and apparently it didn't like that I was specifying to use a then nonexistent node 2 (I was checking for success of the alloc_pages_node() call and falling back to the original alloc_pages() call on failure). Or it could have been on the __alloc_skb() call where I had a similar hack for the skb allocation. Are you still interested in me posting the dmesg output? -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-12 4:30 ` Bill Fink @ 2009-08-12 7:21 ` Andi Kleen [not found] ` <4A856781.2080301@myri.com> 1 sibling, 0 replies; 89+ messages in thread From: Andi Kleen @ 2009-08-12 7:21 UTC (permalink / raw) To: Bill Fink Cc: Andi Kleen, Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu, jbarnes > There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs > for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7 > (previously the even CPUs were on the first Xeon 5580 processor while > the odd CPUs were on the second processor). That might be ok, depending on how the APICs are configured. Of course you should have the same number of CPUs on the different nodes. Anyways, it's gone now. > > [root@xeontest1 ~]# numastat > node0 node1 > numa_hit 28087735 27195340 > numa_miss 0 0 > numa_foreign 0 0 > interleave_hit 12065 11978 > local_node 28081559 27182572 > other_node 6176 12768 > > [root@xeontest1 ~]# grep 'physical id' /proc/cpuinfo > physical id : 0 > physical id : 0 > physical id : 0 > physical id : 0 > physical id : 1 > physical id : 1 > physical id : 1 > physical id : 1 > > [root@xeontest1 ~]# cat /sys/devices/system/node/node0/cpulist > 0-3 > [root@xeontest1 ~]# cat /sys/devices/system/node/node1/cpulist > 4-7 > > But _all_ the PCI devices are still just on node 0. Most likely you need the appended patch from linux-next. It should be probably in .31, but I can't see it in linus' tree only in -next. Jesse? Unfortunately the patch seems to combine code movement with fixes :-( > Are you still interested in me posting the dmesg output? No. -Andi commit eaf2f454cc9a76dbe1890af6269e60fe9978a3a5 Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Fri Jul 10 14:04:30 2009 -0700 x86/PCI: initialize PCI bus node numbers early The current mp_bus_to_node array is initialized only by AMD specific code, since AMD platforms have registers that can be used for determining mode numbers. On new Intel platforms it's necessary to initialize this array as well though, otherwise all PCI node numbers will be 0, when in fact they should be -1 (indicating that I/O isn't tied to any particular node). So move the mp_bus_to_node code into the common PCI code, and initialize it early with a default value of -1. This may be overridden later by arch code (e.g. the AMD code). With this change, PCI consistent memory and other node specific allocations (e.g. skbuff allocs) should occur on the "current" node. If, for performance reasons, applications want to be bound to specific nodes, they should open their devices only after being pinned to the CPU where they'll run, for maximum locality. Acked-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Jesse Brandeburg <jesse.brandeburg@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> diff --git a/arch/x86/pci/amd_bus.c b/arch/x86/pci/amd_bus.c index 3ffa10d..572ee97 100644 --- a/arch/x86/pci/amd_bus.c +++ b/arch/x86/pci/amd_bus.c @@ -15,63 +15,6 @@ * also get peer root bus resource for io,mmio */ -#ifdef CONFIG_NUMA - -#define BUS_NR 256 - -#ifdef CONFIG_X86_64 - -static int mp_bus_to_node[BUS_NR]; - -void set_mp_bus_to_node(int busnum, int node) -{ - if (busnum >= 0 && busnum < BUS_NR) - mp_bus_to_node[busnum] = node; -} - -int get_mp_bus_to_node(int busnum) -{ - int node = -1; - - if (busnum < 0 || busnum > (BUS_NR - 1)) - return node; - - node = mp_bus_to_node[busnum]; - - /* - * let numa_node_id to decide it later in dma_alloc_pages - * if there is no ram on that node - */ - if (node != -1 && !node_online(node)) - node = -1; - - return node; -} - -#else /* CONFIG_X86_32 */ - -static unsigned char mp_bus_to_node[BUS_NR]; - -void set_mp_bus_to_node(int busnum, int node) -{ - if (busnum >= 0 && busnum < BUS_NR) - mp_bus_to_node[busnum] = (unsigned char) node; -} - -int get_mp_bus_to_node(int busnum) -{ - int node; - - if (busnum < 0 || busnum > (BUS_NR - 1)) - return 0; - node = mp_bus_to_node[busnum]; - return node; -} - -#endif /* CONFIG_X86_32 */ - -#endif /* CONFIG_NUMA */ - #ifdef CONFIG_X86_64 /* @@ -301,11 +244,6 @@ static int __init early_fill_mp_bus_info(void) u64 val; u32 address; -#ifdef CONFIG_NUMA - for (i = 0; i < BUS_NR; i++) - mp_bus_to_node[i] = -1; -#endif - if (!early_pci_allowed()) return -1; @@ -346,7 +284,7 @@ static int __init early_fill_mp_bus_info(void) node = (reg >> 4) & 0x07; #ifdef CONFIG_NUMA for (j = min_bus; j <= max_bus; j++) - mp_bus_to_node[j] = (unsigned char) node; + set_mp_bus_to_node(j, node); #endif link = (reg >> 8) & 0x03; diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 2202b62..5db96d4 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -600,3 +600,72 @@ struct pci_bus * __devinit pci_scan_bus_with_sysdata(int busno) { return pci_scan_bus_on_node(busno, &pci_root_ops, -1); } + +/* + * NUMA info for PCI busses + * + * Early arch code is responsible for filling in reasonable values here. + * A node id of "-1" means "use current node". In other words, if a bus + * has a -1 node id, it's not tightly coupled to any particular chunk + * of memory (as is the case on some Nehalem systems). + */ +#ifdef CONFIG_NUMA + +#define BUS_NR 256 + +#ifdef CONFIG_X86_64 + +static int mp_bus_to_node[BUS_NR] = { + [0 ... BUS_NR - 1] = -1 +}; + +void set_mp_bus_to_node(int busnum, int node) +{ + if (busnum >= 0 && busnum < BUS_NR) + mp_bus_to_node[busnum] = node; +} + +int get_mp_bus_to_node(int busnum) +{ + int node = -1; + + if (busnum < 0 || busnum > (BUS_NR - 1)) + return node; + + node = mp_bus_to_node[busnum]; + + /* + * let numa_node_id to decide it later in dma_alloc_pages + * if there is no ram on that node + */ + if (node != -1 && !node_online(node)) + node = -1; + + return node; +} + +#else /* CONFIG_X86_32 */ + +static unsigned char mp_bus_to_node[BUS_NR] = { + [0 ... BUS_NR - 1] = -1 +}; + +void set_mp_bus_to_node(int busnum, int node) +{ + if (busnum >= 0 && busnum < BUS_NR) + mp_bus_to_node[busnum] = (unsigned char) node; +} + +int get_mp_bus_to_node(int busnum) +{ + int node; + + if (busnum < 0 || busnum > (BUS_NR - 1)) + return 0; + node = mp_bus_to_node[busnum]; + return node; +} + +#endif /* CONFIG_X86_32 */ + +#endif /* CONFIG_NUMA */ -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply related [flat|nested] 89+ messages in thread
[parent not found: <4A856781.2080301@myri.com>]
* Re: Receive side performance issue with multi-10-GigE and NUMA [not found] ` <4A856781.2080301@myri.com> @ 2009-08-14 16:38 ` Bill Fink 2009-08-14 16:55 ` Andrew Gallatin 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-14 16:38 UTC (permalink / raw) To: Andrew Gallatin; +Cc: netdev Hi Drew, On Fri, 14 Aug 2009, Andrew Gallatin wrote: > Hi Bill, > > A few questions. I was looking at the manual for the > X8DAH+-F, and it claims to support both I/OAT and DCA. > Do you have either or both enabled? I did not explicitly set either one, and the manual indicates they are both enabled by default, which I also vaguely seem to recall was the way they were set. I'm not in at the office today so I can't physically check. > If yes, then > what happens if you disable ioatdma (by setting > net.ipv4.tcp_dma_copybreak=2147483647 with sysctl)? > How about if you disable myri10ge's use of dca (load driver > with myri10ge_dca=0). > > Do you see any changes? Good suggestions but unfortunately it didn't help (or hurt). It may have helped a little bit on the transmit side (I saw one test at 102 Gbps when the previous high I had seen was 101 Gbps), but the receive side was still at 55 Mbps. Would there be any difference between disabling I/OAT and DCA in the BIOS versus the myri10ge module parameter and sysctl setting? I can try any BIOS changes on Monday. > I'm worried about ioatdma because I've seen problems with it > before. At least on Linux, it tends to busywait for the DMA > to complete, which is actually slower than a memory copy in > most cases that I've seen. > > I'm worried about DCA because you've shown that the BIOS is buggy, > so the tag table could be wrong (resulting in bad prefetching hints). The new BIOS seems to be better at setting the NUMA node info. > I'm also worried about DCA because I've never had the chance to > use it on a 5520 based system, and there is always the chance > that we may be doing something wrong ourselves in the NIC firmware > (again resulting in bad prefetching hints). Bad prefetching hints > can cause cross-CPU chatter, and kill performance by wasting > memory bandwidth, and dirtying a cache on another CPU > for no reason. Is there any easy way to monitor active memory bandwidth usage? > Drew -Thanks -Bill P.S. I don't know if it's at all significant, but one time after a reboot that required an fsck because of exceeding the number of mounts without an fsck, thus incurring a significant delay in the boot process, the transmit performance dropped from its normal ~100 Gbps to 57 Gbps (similar to the receive side performance). Another reboot restored the normal ~100 Gbps transmit side performance. I have no idea why this might be, but I saw it once before when an fsck was required on boot, so it may not be a fluke. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 16:38 ` Bill Fink @ 2009-08-14 16:55 ` Andrew Gallatin 2009-08-14 21:13 ` Aviv Greenberg 0 siblings, 1 reply; 89+ messages in thread From: Andrew Gallatin @ 2009-08-14 16:55 UTC (permalink / raw) To: Bill Fink; +Cc: netdev Bill Fink wrote: > Hi Drew, > > On Fri, 14 Aug 2009, Andrew Gallatin wrote: > >> Hi Bill, >> >> A few questions. I was looking at the manual for the >> X8DAH+-F, and it claims to support both I/OAT and DCA. >> Do you have either or both enabled? > > I did not explicitly set either one, and the manual indicates they > are both enabled by default, which I also vaguely seem to recall > was the way they were set. I'm not in at the office today so I > can't physically check. > >> If yes, then >> what happens if you disable ioatdma (by setting >> net.ipv4.tcp_dma_copybreak=2147483647 with sysctl)? >> How about if you disable myri10ge's use of dca (load driver >> with myri10ge_dca=0). >> >> Do you see any changes? > > Good suggestions but unfortunately it didn't help (or hurt). > It may have helped a little bit on the transmit side (I saw one > test at 102 Gbps when the previous high I had seen was 101 Gbps), > but the receive side was still at 55 Mbps. Darn. But it shouldn't matter at all for the transmit side... Speaking of the send side, have you tried using netperf -tTCP_SENDFILE rather than nuttcp to make the transmit side zero-copy? > Would there be any difference between disabling I/OAT and DCA in > the BIOS versus the myri10ge module parameter and sysctl setting? > I can try any BIOS changes on Monday. There should not be, no. >> I'm worried about ioatdma because I've seen problems with it >> before. At least on Linux, it tends to busywait for the DMA >> to complete, which is actually slower than a memory copy in >> most cases that I've seen. >> >> I'm worried about DCA because you've shown that the BIOS is buggy, >> so the tag table could be wrong (resulting in bad prefetching hints). > > The new BIOS seems to be better at setting the NUMA node info. > >> I'm also worried about DCA because I've never had the chance to >> use it on a 5520 based system, and there is always the chance >> that we may be doing something wrong ourselves in the NIC firmware >> (again resulting in bad prefetching hints). Bad prefetching hints >> can cause cross-CPU chatter, and kill performance by wasting >> memory bandwidth, and dirtying a cache on another CPU >> for no reason. > > Is there any easy way to monitor active memory bandwidth usage? There may be something in the chipset, and there may be CPU counters, (via oprofile) but I'm not aware of what they are. It might be interesting to run just 1/2 your test (all to, say, NUMA node 1) and then bind some lmbench memory copy (bw_mem) processes to NUMA node 0, and see if the lmbench slows down (and/or is slowed down) by the ongoing network traffic. Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 16:55 ` Andrew Gallatin @ 2009-08-14 21:13 ` Aviv Greenberg 2009-08-20 7:26 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Aviv Greenberg @ 2009-08-14 21:13 UTC (permalink / raw) To: Andrew Gallatin; +Cc: Bill Fink, netdev > There may be something in the chipset shooting in the dark: when you lspci -vvv and check the MaxPayload and MaxReadReq values for the myri devices - what are the values and are they equal? Are they the same on all your platforms? ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 21:13 ` Aviv Greenberg @ 2009-08-20 7:26 ` Bill Fink 2009-08-20 13:14 ` Ben Hutchings 2009-08-20 13:17 ` Aviv Greenberg 0 siblings, 2 replies; 89+ messages in thread From: Bill Fink @ 2009-08-20 7:26 UTC (permalink / raw) To: Aviv Greenberg; +Cc: Andrew Gallatin, netdev On Sat, 15 Aug 2009, Aviv Greenberg wrote: > > There may be something in the chipset > > shooting in the dark: when you lspci -vvv and check the MaxPayload and > MaxReadReq values for the myri devices - what are the values and are > they equal? Are they the same on all your platforms? IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes, and was the same on both the Asus and SuperMicro systems. I will doublecheck tomorrow at work. I am not clear on the meanings of the different parameters. And is DevCtl for PCI control messages and DevCap for actual data transfers or something else? -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-20 7:26 ` Bill Fink @ 2009-08-20 13:14 ` Ben Hutchings 2009-08-21 4:00 ` Bill Fink 2009-08-20 13:17 ` Aviv Greenberg 1 sibling, 1 reply; 89+ messages in thread From: Ben Hutchings @ 2009-08-20 13:14 UTC (permalink / raw) To: Bill Fink; +Cc: Aviv Greenberg, Andrew Gallatin, netdev On Thu, 2009-08-20 at 03:26 -0400, Bill Fink wrote: > On Sat, 15 Aug 2009, Aviv Greenberg wrote: > > > > There may be something in the chipset > > > > shooting in the dark: when you lspci -vvv and check the MaxPayload and > > MaxReadReq values for the myri devices - what are the values and are > > they equal? Are they the same on all your platforms? > > IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under > DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes, > and was the same on both the Asus and SuperMicro systems. I will > doublecheck tomorrow at work. I am not clear on the meanings of > the different parameters. And is DevCtl for PCI control messages > and DevCap for actual data transfers or something else? DevCap is the capability register, which is read-only; DevCtl is the control register which holds the actual settings. MaxPayload is the MTU and MRU for PCIe packets. Each sub-tree of devices connected to a single PCIe root port needs to have MaxPayload set consistently. MaxReadReq is the maximum size of any DMA read request. It is a per-device setting (or possibly per-function; I forget). It can be much larger than MaxPayload since read completions can be fragmented. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-20 13:14 ` Ben Hutchings @ 2009-08-21 4:00 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-08-21 4:00 UTC (permalink / raw) To: Ben Hutchings; +Cc: Aviv Greenberg, Andrew Gallatin, netdev On Thu, 20 Aug 2009, Ben Hutchings wrote: > On Thu, 2009-08-20 at 03:26 -0400, Bill Fink wrote: > > On Sat, 15 Aug 2009, Aviv Greenberg wrote: > > > > > > There may be something in the chipset > > > > > > shooting in the dark: when you lspci -vvv and check the MaxPayload and > > > MaxReadReq values for the myri devices - what are the values and are > > > they equal? Are they the same on all your platforms? > > > > IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under > > DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes, > > and was the same on both the Asus and SuperMicro systems. I will > > doublecheck tomorrow at work. I am not clear on the meanings of > > the different parameters. And is DevCtl for PCI control messages > > and DevCap for actual data transfers or something else? > > DevCap is the capability register, which is read-only; DevCtl is the > control register which holds the actual settings. > > MaxPayload is the MTU and MRU for PCIe packets. Each sub-tree of > devices connected to a single PCIe root port needs to have MaxPayload > set consistently. MaxReadReq is the maximum size of any DMA read > request. It is a per-device setting (or possibly per-function; I > forget). It can be much larger than MaxPayload since read completions > can be fragmented. Thanks for the explanation. I saw a BIOS setting that allowed increasing the MaxPayload from 128 bytes to 256 bytes, and then verified that an "lspci -vvv" then showed the DevCtl MaxPayload to be 256 bytes. But unfortunately it didn't help improve the read side performance any. -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-20 7:26 ` Bill Fink 2009-08-20 13:14 ` Ben Hutchings @ 2009-08-20 13:17 ` Aviv Greenberg 1 sibling, 0 replies; 89+ messages in thread From: Aviv Greenberg @ 2009-08-20 13:17 UTC (permalink / raw) To: Bill Fink; +Cc: Andrew Gallatin, netdev On Thu, Aug 20, 2009 at 10:26, Bill Fink<billfink@mindspring.com> wrote: > IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under > DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes, > and was the same on both the Asus and SuperMicro systems. I will > doublecheck tomorrow at work. I am not clear on the meanings of > the different parameters. And is DevCtl for PCI control messages > and DevCap for actual data transfers or something else? IIRC DevCap is what the device is capable of, and DevCtl is a control register that is used to limit the device's PCIe MTU if needed (e.g chipset limit). MaxPayload is the one used for RX DMA writes, and 128 bytes might be too low. I suggest you double check that. You have to first figure out if your performance is limited by PCIe bandwidth, or due to the NUMA stuff. -- Stephen Leacock - "I detest life-insurance agents: they always argue that I shall some day die, which is not so." - http://www.brainyquote.com/quotes/authors/s/stephen_leacock.html ^ permalink raw reply [flat|nested] 89+ messages in thread
* RE: Receive side performance issue with multi-10-GigE and NUMA 2009-08-11 7:32 ` Bill Fink 2009-08-11 11:02 ` Neil Horman 2009-08-11 22:27 ` Andi Kleen @ 2009-08-12 0:02 ` Brandeburg, Jesse 2009-08-12 4:38 ` Bill Fink 2 siblings, 1 reply; 89+ messages in thread From: Brandeburg, Jesse @ 2009-08-12 0:02 UTC (permalink / raw) To: Bill Fink, Neil Horman Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu, jbarnes@virtuousgeek.org [-- Attachment #1: Type: text/plain, Size: 947 bytes --] Bill Fink wrote: > On Sat, 8 Aug 2009, Neil Horman wrote: > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: >>> Neil Horman wrote: >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: >>>>> Bill Fink wrote: >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: >>>>>> >>>>>>> Bill Fink wrote: >>>>>>> >>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff), >>>>>>>> so yes they are also wrong. bill, I recently helped Jesse Barnes push a patch that addresses this kind of issue on CoreI7, the root cause was the numa_node variable was initialized based on slot on AMD systems, but needed to be set to -1 by default on systems with a uniform IOH to slot architecture. here is the commit ID: http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38 d674be519109696746192943a6d524019f7f I'm not sure it is in linus' tree yet, this link is to net-next Maybe see if it helps? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 6703 bytes --] ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-12 0:02 ` Brandeburg, Jesse @ 2009-08-12 4:38 ` Bill Fink 2009-08-12 16:00 ` Jesse Barnes 2009-08-14 20:31 ` Bill Fink 0 siblings, 2 replies; 89+ messages in thread From: Bill Fink @ 2009-08-12 4:38 UTC (permalink / raw) To: Brandeburg, Jesse Cc: Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu, jbarnes@virtuousgeek.org On Tue, 11 Aug 2009, Brandeburg, Jesse wrote: > Bill Fink wrote: > > On Sat, 8 Aug 2009, Neil Horman wrote: > > > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > >>> Neil Horman wrote: > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: > >>>>> Bill Fink wrote: > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > >>>>>> > >>>>>>> Bill Fink wrote: > >>>>>>> > >>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff), > >>>>>>>> so yes they are also wrong. > > bill, I recently helped Jesse Barnes push a patch that addresses this kind > of issue on CoreI7, the root cause was the numa_node variable was > initialized based on slot on AMD systems, but needed to be set to -1 by > default on systems with a uniform IOH to slot architecture. > > here is the commit ID: > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38 > d674be519109696746192943a6d524019f7f > > I'm not sure it is in linus' tree yet, this link is to net-next > > Maybe see if it helps? It's worth a shot. Hopefully I can get a chance to build a new kernel tomorrow to check out some of the suggestions, like this one, the setting of ACPI_DEBUG, and the new ftrace module for checking NUMA affinity of skbs. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-12 4:38 ` Bill Fink @ 2009-08-12 16:00 ` Jesse Barnes 2009-08-14 20:31 ` Bill Fink 1 sibling, 0 replies; 89+ messages in thread From: Jesse Barnes @ 2009-08-12 16:00 UTC (permalink / raw) To: Bill Fink Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Wed, 12 Aug 2009 00:38:24 -0400 Bill Fink <billfink@mindspring.com> wrote: > On Tue, 11 Aug 2009, Brandeburg, Jesse wrote: > > > Bill Fink wrote: > > > On Sat, 8 Aug 2009, Neil Horman wrote: > > > > > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > > >>> Neil Horman wrote: > > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin > > >>>> wrote: > > >>>>> Bill Fink wrote: > > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > > >>>>>> > > >>>>>>> Bill Fink wrote: > > >>>>>>> > > >>>>>>>> All sysfs local_cpus values are the same > > >>>>>>>> (00000000,000000ff), so yes they are also wrong. > > > > bill, I recently helped Jesse Barnes push a patch that addresses > > this kind of issue on CoreI7, the root cause was the numa_node > > variable was initialized based on slot on AMD systems, but needed > > to be set to -1 by default on systems with a uniform IOH to slot > > architecture. > > > > here is the commit ID: > > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38 > > d674be519109696746192943a6d524019f7f > > > > I'm not sure it is in linus' tree yet, this link is to net-next > > > > Maybe see if it helps? > > It's worth a shot. > > Hopefully I can get a chance to build a new kernel tomorrow to check > out some of the suggestions, like this one, the setting of ACPI_DEBUG, > and the new ftrace module for checking NUMA affinity of skbs. It's a fairly significant change so I wasn't planning on sending it to Linus for 2.6.31. If you think it *should* go into 2.6.31 (and stable for that matter), please let me know soon. Thanks, -- Jesse Barnes, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-12 4:38 ` Bill Fink 2009-08-12 16:00 ` Jesse Barnes @ 2009-08-14 20:31 ` Bill Fink 2009-08-17 16:53 ` Jesse Barnes 1 sibling, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-14 20:31 UTC (permalink / raw) To: Bill Fink Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu, jbarnes@virtuousgeek.org On Wed, 12 Aug 2009, Bill Fink wrote: > On Tue, 11 Aug 2009, Brandeburg, Jesse wrote: > > > Bill Fink wrote: > > > On Sat, 8 Aug 2009, Neil Horman wrote: > > > > > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote: > > >>> Neil Horman wrote: > > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote: > > >>>>> Bill Fink wrote: > > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > > >>>>>> > > >>>>>>> Bill Fink wrote: > > >>>>>>> > > >>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff), > > >>>>>>>> so yes they are also wrong. > > > > bill, I recently helped Jesse Barnes push a patch that addresses this kind > > of issue on CoreI7, the root cause was the numa_node variable was > > initialized based on slot on AMD systems, but needed to be set to -1 by > > default on systems with a uniform IOH to slot architecture. > > > > here is the commit ID: > > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38 > > d674be519109696746192943a6d524019f7f > > > > I'm not sure it is in linus' tree yet, this link is to net-next > > > > Maybe see if it helps? > > It's worth a shot. > > Hopefully I can get a chance to build a new kernel tomorrow to check > out some of the suggestions, like this one, the setting of ACPI_DEBUG, > and the new ftrace module for checking NUMA affinity of skbs. I applied this patch to my 2.6.29.6 kernel (from Fedora 11). Now when I do: find /sys -name numa_node -exec grep . {} /dev/null \; the numa_node for _all_ PCI devices is -1. When I do: find /sys -name local_cpus -exec grep . {} /dev/null \; I find that local_cpus is always 00000000,00000000. Is that OK or should it be 00000000,000000ff (for my dual quad-core Xeon 5580 system with no hyperthreading)? Also, is it just not possible on this type of Intel Xeon system to properly associate the PCI devices with the nearest NUMA node? In any event, the patch didn't help (or hurt). The transmit performance remained at ~100 Gbps while the receive performance remained at 55 Gbps. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 20:31 ` Bill Fink @ 2009-08-17 16:53 ` Jesse Barnes 2009-08-18 7:07 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Jesse Barnes @ 2009-08-17 16:53 UTC (permalink / raw) To: Bill Fink Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Fri, 14 Aug 2009 16:31:55 -0400 Bill Fink <billfink@mindspring.com> wrote: > On Wed, 12 Aug 2009, Bill Fink wrote: > > > On Tue, 11 Aug 2009, Brandeburg, Jesse wrote: > > > > > Bill Fink wrote: > > > > On Sat, 8 Aug 2009, Neil Horman wrote: > > > > > > > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin > > > >> wrote: > > > >>> Neil Horman wrote: > > > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin > > > >>>> wrote: > > > >>>>> Bill Fink wrote: > > > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote: > > > >>>>>> > > > >>>>>>> Bill Fink wrote: > > > >>>>>>> > > > >>>>>>>> All sysfs local_cpus values are the same > > > >>>>>>>> (00000000,000000ff), so yes they are also wrong. > > > > > > bill, I recently helped Jesse Barnes push a patch that addresses > > > this kind of issue on CoreI7, the root cause was the numa_node > > > variable was initialized based on slot on AMD systems, but needed > > > to be set to -1 by default on systems with a uniform IOH to slot > > > architecture. > > > > > > here is the commit ID: > > > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38 > > > d674be519109696746192943a6d524019f7f > > > > > > I'm not sure it is in linus' tree yet, this link is to net-next > > > > > > Maybe see if it helps? > > > > It's worth a shot. > > > > Hopefully I can get a chance to build a new kernel tomorrow to check > > out some of the suggestions, like this one, the setting of > > ACPI_DEBUG, and the new ftrace module for checking NUMA affinity of > > skbs. > > I applied this patch to my 2.6.29.6 kernel (from Fedora 11). > > Now when I do: > > find /sys -name numa_node -exec grep . {} /dev/null \; > > the numa_node for _all_ PCI devices is -1. Yeah, that sounds right (indicates they're not really tied to a specific node). > When I do: > > find /sys -name local_cpus -exec grep . {} /dev/null \; > > I find that local_cpus is always 00000000,00000000. > > Is that OK or should it be 00000000,000000ff (for my dual quad-core > Xeon 5580 system with no hyperthreading)? Hm, yeah it probably should have the full CPU mask... > Also, is it just not possible on this type of Intel Xeon system to > properly associate the PCI devices with the nearest NUMA node? All the PCI devices hang off the root complex, which is the same distance to each node of memory (at least that's my understanding for current platforms). > In any event, the patch didn't help (or hurt). The transmit > performance remained at ~100 Gbps while the receive performance > remained at 55 Gbps. Maybe the other Jesse has some ideas here. -- Jesse Barnes, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-17 16:53 ` Jesse Barnes @ 2009-08-18 7:07 ` Bill Fink 2009-08-18 11:54 ` Andrew Gallatin 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-18 7:07 UTC (permalink / raw) To: Jesse Barnes Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu On Mon, 17 Aug 2009 09:53:02 -0700, Jesse Barnes wrote: > On Fri, 14 Aug 2009 16:31:55 -0400 > Bill Fink <billfink@mindspring.com> wrote: > > Hm, yeah it probably should have the full CPU mask... > > > Also, is it just not possible on this type of Intel Xeon system to > > properly associate the PCI devices with the nearest NUMA node? > > All the PCI devices hang off the root complex, which is the same > distance to each node of memory (at least that's my understanding for > current platforms). I admit to being confused then. The basic system architecture of the SuperMicro system is: Memory----CPU1----QPI----CPU2----Memory | | | | QPI QPI | | | | 5520----QPI----5520 |||| |||| |||| |||| |||| |||| PCIe PCIe It doesn't appear that a given PCIe device is equidistant to the two nodes of memory. It's one QPI hop to the "local" (same side) node, and two QPI hops to the "remote" (far side) node. But then I don't know what a root complex is, and how it fits into the system architecture above. > > In any event, the patch didn't help (or hurt). The transmit > > performance remained at ~100 Gbps while the receive performance > > remained at 55 Gbps. > > Maybe the other Jesse has some ideas here. Any and all ideas welcome. I even considered the idea that maybe instead of transferring 9000 bytes of payload, perhaps it was transferring the next higher power of 2, namely 16384, since bc told me that 9000/16384*100 was 54.9316. But I tried a test today with an MTU of 8000 and it didn't make any difference. BTW here's a diff of an "lspci -vvvxxxx" on the better receive side performing Asus system (<) versus on the SuperMicro system (>) for one of the Myricom 10-GigE interfaces: [root@xeontest1 ~]# diff -bw /tmp/foo2 /tmp/foo3 1c1 < 06:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01) --- > 04:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01) 3c3 < Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ --- > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ I don't know what the ParErr- versus ParErr+ means. 5,9c5,9 < Latency: 0, Cache Line Size: 64 bytes < Interrupt: pin A routed to IRQ 2277 < Region 0: Memory at da000000 (64-bit, prefetchable) [size=16M] < Region 2: Memory at fa900000 (64-bit, non-prefetchable) [size=1M] < Expansion ROM at fa880000 [disabled] [size=512K] --- > Latency: 0, Cache Line Size: 256 bytes > Interrupt: pin A routed to IRQ 121 > Region 0: Memory at f3000000 (64-bit, prefetchable) [size=16M] > Region 2: Memory at fa300000 (64-bit, non-prefetchable) [size=1M] > Expansion ROM at fa280000 [disabled] [size=512K] 11c11 < Address: 00000000fee0400c Data: 4183 --- > Address: 00000000fee00000 Data: 40cc 45c45 < Capabilities: [1a8] Device Serial Number b6-be-46-ff-ff-dd-60-00 --- > Capabilities: [1a8] Device Serial Number 88-be-46-ff-ff-dd-60-00 I don't see much difference other than a larger Cache Line Size on the SuperMicro system. 47,48c47,48 < 00: c1 14 08 00 06 05 10 00 01 00 00 02 10 00 00 00 < 10: 0c 00 00 da 00 00 00 00 04 00 90 fa 00 00 00 00 --- > 00: c1 14 08 00 46 05 10 00 01 00 00 02 40 00 00 00 > 10: 0c 00 00 f3 00 00 00 00 04 00 30 fa 00 00 00 00 50,52c50,52 < 30: 00 00 88 fa 44 00 00 00 00 00 00 00 0b 01 00 00 < 40: 00 00 00 00 05 54 81 00 0c 40 e0 fe 00 00 00 00 < 50: 83 41 00 00 01 5c 03 00 00 20 00 64 10 a0 02 00 --- > 30: 00 00 28 fa 44 00 00 00 00 00 00 00 0e 01 00 00 > 40: 00 00 00 00 05 54 81 00 00 00 e0 fe 00 00 00 00 > 50: cc 40 00 00 01 5c 03 00 00 20 00 64 10 a0 02 00 73c73 < 1a0: 00 00 00 00 00 00 00 00 03 00 01 00 b6 be 46 ff --- > 1a0: 00 00 00 00 00 00 00 00 03 00 01 00 88 be 46 ff And here's part of the dmesg output on the Asus system: myri10ge: Version 1.4.3-1.358 myri10ge 0000:06:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35 myri10ge 0000:06:00.0: setting latency timer to 64 mtrr: type mismatch for da000000,1000000 old: write-back new: write-combining firmware: requesting myri10ge_eth_z8e.dat myri10ge 0000:06:00.0: Not enabling ECRC on non-root port 0000:05:02.0 firmware: requesting myri10ge_eth_z8e.dat myri10ge 0000:06:00.0: MSI IRQ 2282, tx bndry 4096, fw myri10ge_eth_z8e.dat, WC Disabled And on the SuperMicro system: myri10ge: Version 1.4.4-1.401 alloc irq_desc for 35 on cpu 0 node 0 alloc kstat_irqs on cpu 0 node 0 myri10ge 0000:04:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35 myri10ge 0000:04:00.0: setting latency timer to 64 myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat myri10ge 0000:04:00.0: Not enabling ECRC on non-root port 0000:03:02.0 myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat alloc irq_desc for 112 on cpu 0 node 0 alloc kstat_irqs on cpu 0 node 0 myri10ge 0000:04:00.0: irq 112 for MSI/MSI-X myri10ge 0000:04:00.0: MSI IRQ 112, tx bndry 4096, fw myri10ge_eth_z8e.dat, WC E nabled alloc irq_desc for 24 on cpu 0 node 0 alloc kstat_irqs on cpu 0 node 0 Interestingly, the "WC Enabled" is only indicated on the first two 10-GigE interfaces and disabled on the other ten. For the Asus system it indicates "WC Disabled" on all the interfaces, but also has that earlier bit about "old: write-back new: write-combining", which doesn't appear on the SuperMicro system (although that is using a slightly newer version of the myri10ge driver). -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-18 7:07 ` Bill Fink @ 2009-08-18 11:54 ` Andrew Gallatin 2009-08-19 17:59 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Andrew Gallatin @ 2009-08-18 11:54 UTC (permalink / raw) To: Bill Fink Cc: Jesse Barnes, Brandeburg, Jesse, Neil Horman, Brice Goglin, Linux Network Developers, Yinghai Lu Bill Fink wrote: > < Latency: 0, Cache Line Size: 64 bytes <...> >> Latency: 0, Cache Line Size: 256 bytes A cache line size of 256 clearly seems wrong for a Xeon. I assume all devices on the SuperMicro show the same value? > Interestingly, the "WC Enabled" is only indicated on the first two The WC is probably a red herring. What does ethtool -S show for the DMA write bandwidth of the NICs on the SuperMicro? These values are obtained serially, as the driver resets the NIC (reset happens at load time, and ifconfig up), so they could easily sum to more than the memory bandwidth of the system. But it would be good to check for any anomalies. I can send you a pointer to a tool we use internally, which loads some custom firmware on the NIC, and can exercise the DMA engines on all the NICs in parallel. This would give an idea of the aggregate DMA bandwidth available on the system. Let me know if you're interested. Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-18 11:54 ` Andrew Gallatin @ 2009-08-19 17:59 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-08-19 17:59 UTC (permalink / raw) To: Andrew Gallatin Cc: Jesse Barnes, Brandeburg, Jesse, Neil Horman, Brice Goglin, Linux Network Developers, Yinghai Lu On Tue, 18 Aug 2009, Andrew Gallatin wrote: > Bill Fink wrote: > > > < Latency: 0, Cache Line Size: 64 bytes > > <...> > > >> Latency: 0, Cache Line Size: 256 bytes > > > A cache line size of 256 clearly seems wrong for a Xeon. I assume all > devices on the SuperMicro show the same value? I forgot to check that. > > Interestingly, the "WC Enabled" is only indicated on the first two > > The WC is probably a red herring. > > What does ethtool -S show for the DMA write bandwidth of the > NICs on the SuperMicro? I've attached the full "ethtool -S" output from both the Asus and SuperMicro systems. Here's just the bandwidth info: Asus eth2: [root@i7test1 ~]# ethtool -S eth2 NIC statistics: ... read_dma_bw_MBs: 1625 write_dma_bw_MBs: 1599 read_write_dma_bw_MBs: 3192 SuperMicro eth2 (on 5520 connected to NUMA node 1): [root@xeontest1 ~]# ethtool -S eth2 NIC statistics: ... read_dma_bw_MBs: 1624 write_dma_bw_MBs: 1605 read_write_dma_bw_MBs: 1323 SuperMicro eth8 (on 5520 connected to NUMA node 0): [root@xeontest1 ~]# ethtool -S eth8 NIC statistics: ... read_dma_bw_MBs: 1572 write_dma_bw_MBs: 1605 read_write_dma_bw_MBs: 2113 > These values are obtained serially, as the driver resets > the NIC (reset happens at load time, and ifconfig up), > so they could easily sum to more than the memory bandwidth > of the system. But it would be good to check for any anomalies. > > I can send you a pointer to a tool we use internally, which loads > some custom firmware on the NIC, and can exercise the DMA engines > on all the NICs in parallel. This would give an idea of the > aggregate DMA bandwidth available on the system. Let me know > if you're interested. Yes, I'd be interested. -Thanks -Bill Full ethtool output: -------------------------------------------------------------------------------- Asus eth2: [root@i7test1 ~]# ethtool -S eth2 NIC statistics: rx_packets: 4 tx_packets: 10 rx_bytes: 240 tx_bytes: 708 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_boundary: 4096 WC: 0 irq: 2282 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1625 write_dma_bw_MBs: 1599 read_write_dma_bw_MBs: 3192 serial_number: 356055 watchdog_resets: 0 link_changes: 6 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 631516 dropped_pause: 631516 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 11 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 0 ----------- slice ---------: 0 tx_pkt_start: 421736 tx_pkt_done: 421736 tx_req: 2866189 tx_done: 2866189 rx_small_cnt: 257731 rx_big_cnt: 3830824 wake_queue: 5698 stop_queue: 5698 tx_linearized: 0 LRO aggregated: 1276950 LRO flushed: 264545 LRO avg aggr: 4 LRO no_desc: 0 SuperMicro eth2 (on 5520 connected to NUMA node 1): [root@xeontest1 ~]# ethtool -S eth2 NIC statistics: rx_packets: 0 tx_packets: 10 rx_bytes: 0 tx_bytes: 708 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_boundary: 4096 WC: 0 irq: 112 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1624 write_dma_bw_MBs: 1605 read_write_dma_bw_MBs: 1323 serial_number: 363134 watchdog_resets: 0 dca_capable_firmware: 1 dca_device_present: 0 link_changes: 2 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 200 dropped_pause: 200 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 0 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 0 ----------- slice ---------: 0 tx_pkt_start: 440223 tx_pkt_done: 440223 tx_req: 3412102 tx_done: 3412102 rx_small_cnt: 213976 rx_big_cnt: 3071854 wake_queue: 1846 stop_queue: 1846 tx_linearized: 0 LRO aggregated: 1024029 LRO flushed: 269709 LRO avg aggr: 3 LRO no_desc: 0 SuperMicro eth8 (on 5520 connected to NUMA node 0): [root@xeontest1 ~]# ethtool -S eth8 NIC statistics: rx_packets: 11 tx_packets: 16 rx_bytes: 864 tx_bytes: 1228 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_boundary: 4096 WC: 0 irq: 118 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1572 write_dma_bw_MBs: 1605 read_write_dma_bw_MBs: 2113 serial_number: 361233 watchdog_resets: 0 dca_capable_firmware: 1 dca_device_present: 0 link_changes: 4 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 224 dropped_pause: 224 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 0 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 0 ----------- slice ---------: 0 tx_pkt_start: 575354 tx_pkt_done: 575354 tx_req: 3590761 tx_done: 3590761 rx_small_cnt: 227078 rx_big_cnt: 4733499 wake_queue: 2199 stop_queue: 2199 tx_linearized: 0 LRO aggregated: 1578229 LRO flushed: 404901 LRO avg aggr: 3 LRO no_desc: 0 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink 2009-08-07 21:18 ` Brice Goglin @ 2009-08-07 22:12 ` Neil Horman 2009-08-08 0:54 ` Bill Fink 2009-08-12 23:29 ` David Miller 2 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-07 22:12 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Fri, Aug 07, 2009 at 05:06:00PM -0400, Bill Fink wrote: > I've run into a major receive side performance issue with multi-10-GigE > on a NUMA system. The system is using a SuperMicro X8DAH+-F motherboard > with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of > 1333 MHz DDR3 memory. It is a Fedora 10 system but using the latest > 2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel > from Fedora 10). > > The test setup is: > > i7test1----(6)----xeontest1----(6)----i7test2 > 10-GigE 10-GigE > > So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total > of 12 10-GigE interfaces. eth2 through eth7 (which are on the > second Intel 5520 I/O Hub) are connected to i7test1 while > eth8 through eth13 (which are on the first Intel 5520 I/O Hub) > are connected to i7test2. > > Previous direct testing between i7test1 and i7test2 (which use an > Asus P6T6 WS Revolution motherboard) demonstrated that they could > achieve ~70 Gbps performance for either transmit or receive using > 8 10-GigE interfaces. > > The transmit side performance of xeontest1 is fantastic: > > [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 & > n12: 9648.0522 MB / 10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT > n9: 11130.5320 MB / 10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT > n11: 9418.1250 MB / 10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT > n10: 9279.4758 MB / 10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT > n8: 11142.6574 MB / 10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT > n13: 9422.1492 MB / 10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT > n3: 11471.2500 MB / 10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT > n6: 9339.6354 MB / 10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT > n4: 9093.2500 MB / 10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT > n5: 9121.8367 MB / 10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT > n7: 9292.2500 MB / 10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT > n2: 11487.1150 MB / 10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT > > Aggregate performance: 100.4637 Gbps > > The problem is with the receive side performance. > > [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 & > n11: 6983.6359 MB / 10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT > n10: 7000.1557 MB / 10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT > n9: 2451.7206 MB / 10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT > n13: 2453.0887 MB / 10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT > n12: 2446.5303 MB / 10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT > n8: 2462.5890 MB / 10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT > n4: 2763.5091 MB / 10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT > n5: 2770.0887 MB / 10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT > n2: 1777.7277 MB / 10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT > n6: 1772.7962 MB / 10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT > n3: 1779.4535 MB / 10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT > n7: 1770.8359 MB / 10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT > > Aggregate performance: 29.9492 Gbps > > I suspected that this was because the memory being allocated by the > myri10ge driver was not being allocated on the optimum NUMA node. > BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which > is what I would have expected, but this is my first experience with > a NUMA system. > > Based upon a patch by Peter Zijlstra that I discovered through Google > searching, I tried patching the myri10ge driver to change its memory > allocation of memory pages from alloc_pages() to alloc_pages_node() > and specifying the NUMA node of the parent device of the Myricom 10-GigE > device, which IIUC should be the PCIe switch. This didn't help. > > This could be because I discovered that if I did: > > find /sys -name numa_node -exec grep . {} /dev/null \; > > that the numa_node associated with all the PCI devices was always 0, > and if IIUC then I believe some of the PCI devices should have been > associated with NUMA node 2. Perhaps this is what is causing all > the memory pages allocated by the myri10ge driver to be on NUMA > node 0, and thus causing the major performance issue. > > To kludge around this, I made a different patch to the myri10ge driver. > This time I hardcoded the NUMA node in the call to alloc_pages_node() > to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7) > and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13). > This is of course very specific to our specific system (NUMA node ids > and Myricom 10-GigE device IRQs), and is not something that would be > generically applicable. But it was useful as a test, and it did > improve the receive side performance substantially! > > [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 & > n5: 8221.2911 MB / 10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT > n4: 8237.9524 MB / 10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT > n11: 7935.3750 MB / 10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT > n2: 4543.1621 MB / 10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT > n10: 7916.3925 MB / 10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT > n7: 4558.4817 MB / 10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT > n13: 4390.1875 MB / 10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT > n3: 4572.6478 MB / 10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT > n6: 4564.4776 MB / 10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT > n8: 4409.8551 MB / 10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT > n9: 4412.7836 MB / 10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT > n12: 4413.4061 MB / 10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT > > Aggregate performance: 56.4703 Gbps > > This was basically double the previous receive side performance > without the patch. > > I don't know if this is fundamentally a myri10ge driver issue or > some underlying Linux kernel issue, so it's not clear to me what > a proper fix would be. > > Finally, while definitely a major improvement, I think it should be > possible to do even better, since we achieved 70 Gbps in the i7 to i7 > tests, and probably could have done 80 Gbps except for an Asus > motherboard restriction with the interconnect between the Intel X58 > and Nvidia NF200 chips. It's definitely a big step in the right > direction though if this issue can be resolved. > > Any help greatly appreicated in advance. > > -Thanks > > -Bill > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > You're timing is impeccable! I just posted a patch for an ftrace module to help detect just these kind of conditions: http://marc.info/?l=linux-netdev&m=124967650218846&w=2 Hope that helps you out Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 22:12 ` Neil Horman @ 2009-08-08 0:54 ` Bill Fink 2009-08-08 1:56 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-08 0:54 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Fri, 7 Aug 2009, Neil Horman wrote: > You're timing is impeccable! I just posted a patch for an ftrace module to help > detect just these kind of conditions: > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > Hope that helps you out > Neil Thanks! It could be helpful. Do you have a pointer to documentation on how to use it? And does it require the latest GIT kernel or could it possibly be used with a 2.6.29.6 kernel? -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 0:54 ` Bill Fink @ 2009-08-08 1:56 ` Neil Horman 2009-08-14 20:44 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-08 1:56 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote: > On Fri, 7 Aug 2009, Neil Horman wrote: > > > You're timing is impeccable! I just posted a patch for an ftrace module to help > > detect just these kind of conditions: > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > > > Hope that helps you out > > Neil > > Thanks! It could be helpful. Do you have a pointer to documentation > on how to use it? And does it require the latest GIT kernel or could > it possibly be used with a 2.6.29.6 kernel? > > -Bill > It should apply to 2.6.29.6 no problem (might take a little massaging, but not much). No docs I'm afraid (sorry, I'm horrible about that) Using it is easy though: 1) Patch, build and boot the kernel (make sure to have CONFIG_SKB_SOURCES_TRACER, along with the other FTRACE requisite options) 2) mount -t debugfs nodev /sys/kernel/debug 3) cd /sys/kernel/debug/tracing 4) echo skb_sources > ./current_tracer 5) echo 1 > trace 6) cat ./trace Step 5 clears the trace buffer. Step 6 provides you a list list this PID ANID CNID RXQ CCPU LEN Where: PID - The process receiving an skb ANID - The node which the skb being received was allocated on CNID - The node which the process is running when it read this skb RQQ - The NIC receive queue that received this skb CCPU - The cpu the process was running on when it read the skb in question LEN - The length of the skb being received Each entry in the list denotes a unique skb (obviously), and with a clever awk script you can identify which nodes each process in your system is receiving frames from, so that you can use numactl or taskset to bias that process to run on the same nodes cpus. Note that step (6) wil show a larger list each time you cat that file (as trace records aren't removed during a read. Step 5 is what actually clears the trace buffer and resets the list length to zero. Hope that helps. Please feel free to email me if you have any questions. Regards Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-08 1:56 ` Neil Horman @ 2009-08-14 20:44 ` Bill Fink 2009-08-14 23:25 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-14 20:44 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Fri, 7 Aug 2009, Neil Horman wrote: > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote: > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > You're timing is impeccable! I just posted a patch for an ftrace module to help > > > detect just these kind of conditions: > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > > > > > Hope that helps you out > > > Neil > > > > Thanks! It could be helpful. Do you have a pointer to documentation > > on how to use it? And does it require the latest GIT kernel or could > > it possibly be used with a 2.6.29.6 kernel? > > > > -Bill > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not > much). It doesn't look like I can apply your patches to my 2.6.29.6 kernel. For starters, there's no include/trace/events directory, so there's no include/trace/events/skb.h. There is an include/trace/skb.h file, but there's no TRACE_EVENT defined anywhere in the kernel. I don't suppose it's as simple as defining (from include/linux/tracepoint.h from Linus's GIT tree): #define PARAMS(args...) args #define TRACE_EVENT(name, proto, args, struct, assign, print) \ DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) So do you still think it's reasonable to try applying your patches to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4 or 2.6.31-rc6? -Thanks -Bill > No docs I'm afraid (sorry, I'm horrible about that) > > Using it is easy though: > > 1) Patch, build and boot the kernel (make sure to have > CONFIG_SKB_SOURCES_TRACER, along with the other FTRACE requisite options) > > 2) mount -t debugfs nodev /sys/kernel/debug > > 3) cd /sys/kernel/debug/tracing > > 4) echo skb_sources > ./current_tracer > > 5) echo 1 > trace > > 6) cat ./trace > > Step 5 clears the trace buffer. Step 6 provides you a list list this > > > PID ANID CNID RXQ CCPU LEN > > > Where: > PID - The process receiving an skb > ANID - The node which the skb being received was allocated on > CNID - The node which the process is running when it read this skb > RQQ - The NIC receive queue that received this skb > CCPU - The cpu the process was running on when it read the skb in question > LEN - The length of the skb being received > > Each entry in the list denotes a unique skb (obviously), and with a clever awk > script you can identify which nodes each process in your system is receiving > frames from, so that you can use numactl or taskset to bias that process to run > on the same nodes cpus. > > Note that step (6) wil show a larger list each time you cat that file (as trace > records aren't removed during a read. Step 5 is what actually clears the trace > buffer and resets the list length to zero. > > Hope that helps. Please feel free to email me if you have any questions. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 20:44 ` Bill Fink @ 2009-08-14 23:25 ` Neil Horman 2009-08-20 7:50 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-14 23:25 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote: > On Fri, 7 Aug 2009, Neil Horman wrote: > > > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote: > > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > > > You're timing is impeccable! I just posted a patch for an ftrace module to help > > > > detect just these kind of conditions: > > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > > > > > > > Hope that helps you out > > > > Neil > > > > > > Thanks! It could be helpful. Do you have a pointer to documentation > > > on how to use it? And does it require the latest GIT kernel or could > > > it possibly be used with a 2.6.29.6 kernel? > > > > > > -Bill > > > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not > > much). > > It doesn't look like I can apply your patches to my 2.6.29.6 kernel. > > For starters, there's no include/trace/events directory, so there's > no include/trace/events/skb.h. There is an include/trace/skb.h file, > but there's no TRACE_EVENT defined anywhere in the kernel. > > I don't suppose it's as simple as defining (from include/linux/tracepoint.h > from Linus's GIT tree): > > #define PARAMS(args...) args > > #define TRACE_EVENT(name, proto, args, struct, assign, print) \ > DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) > > So do you still think it's reasonable to try applying your patches > to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4 > or 2.6.31-rc6? > > -Thanks > > -Bill > > > I thought the trace stuff went it around 2.6.29 but I might be mistaken. Easiest thing to do likely would be find where in the tree those were introduced and just apply them prior to my patches, or move to the latest kernel if you can (at least for the purposes of testing) Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-14 23:25 ` Neil Horman @ 2009-08-20 7:50 ` Bill Fink 2009-08-20 20:19 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-20 7:50 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Fri, 14 Aug 2009, Neil Horman wrote: > On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote: > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote: > > > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > > > > > You're timing is impeccable! I just posted a patch for an ftrace module to help > > > > > detect just these kind of conditions: > > > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > > > > > > > > > Hope that helps you out > > > > > Neil > > > > > > > > Thanks! It could be helpful. Do you have a pointer to documentation > > > > on how to use it? And does it require the latest GIT kernel or could > > > > it possibly be used with a 2.6.29.6 kernel? > > > > > > > > -Bill > > > > > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not > > > much). > > > > It doesn't look like I can apply your patches to my 2.6.29.6 kernel. > > > > For starters, there's no include/trace/events directory, so there's > > no include/trace/events/skb.h. There is an include/trace/skb.h file, > > but there's no TRACE_EVENT defined anywhere in the kernel. > > > > I don't suppose it's as simple as defining (from include/linux/tracepoint.h > > from Linus's GIT tree): > > > > #define PARAMS(args...) args > > > > #define TRACE_EVENT(name, proto, args, struct, assign, print) \ > > DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) > > > > So do you still think it's reasonable to try applying your patches > > to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4 > > or 2.6.31-rc6? > > > > -Thanks > > > > -Bill > > > > > > > I thought the trace stuff went it around 2.6.29 but I might be mistaken. > Easiest thing to do likely would be find where in the tree those were introduced > and just apply them prior to my patches, or move to the latest kernel if you > can (at least for the purposes of testing) I finally got a 2.6.31-rc6 kernel built and had some limited success with your ftrace patches. Doing some simple ping tests I was able to verify that everything was mostly as expected regarding CPU and NUMA memory affinity, with one weird exception. eth2 through eth7, which all connect to the 5520 I/O Hub that connects to NUMA node 1, all correctly showed their allocations and consumptions on NUMA node 1. eth8 through eth13 are all connected to the 5520 I/O Hub that connects to NUMA node 0, and eth9 through eth13 all correctly reflected that on the ping ftrace tests. But eth8 showed its allocations being done on NUMA node 1 instead of the expected NUMA node 0, which just doesn't make sense since eth8 and eth9 are part of a dual-port 10-GigE Myricom NIC (and I doublechecked that all the IRQ assignments were correct). When I tried an actual nuttcp performance test, even when rate limiting to just 1 Mbps, I immediately got a kernel oops. I tried to get a crashdump via kexec/kdump, but the kexec kernel, instead of just generating a crashdump, fully booted the new kernel, which was extremely sluggish until I rebooted it through a BIOS re-init, and never produced a crashdump. I tried this several times and an immediate kernel oops was always the result (with either a TCP or UDP test). A ping test of 1000 9000-byte packets with an interval of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand worked just fine. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-20 7:50 ` Bill Fink @ 2009-08-20 20:19 ` Neil Horman 2009-08-21 4:14 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-20 20:19 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > On Fri, 14 Aug 2009, Neil Horman wrote: > > > On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote: > > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > > > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote: > > > > > On Fri, 7 Aug 2009, Neil Horman wrote: > > > > > > > > > > > You're timing is impeccable! I just posted a patch for an ftrace module to help > > > > > > detect just these kind of conditions: > > > > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2 > > > > > > > > > > > > Hope that helps you out > > > > > > Neil > > > > > > > > > > Thanks! It could be helpful. Do you have a pointer to documentation > > > > > on how to use it? And does it require the latest GIT kernel or could > > > > > it possibly be used with a 2.6.29.6 kernel? > > > > > > > > > > -Bill > > > > > > > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not > > > > much). > > > > > > It doesn't look like I can apply your patches to my 2.6.29.6 kernel. > > > > > > For starters, there's no include/trace/events directory, so there's > > > no include/trace/events/skb.h. There is an include/trace/skb.h file, > > > but there's no TRACE_EVENT defined anywhere in the kernel. > > > > > > I don't suppose it's as simple as defining (from include/linux/tracepoint.h > > > from Linus's GIT tree): > > > > > > #define PARAMS(args...) args > > > > > > #define TRACE_EVENT(name, proto, args, struct, assign, print) \ > > > DECLARE_TRACE(name, PARAMS(proto), PARAMS(args)) > > > > > > So do you still think it's reasonable to try applying your patches > > > to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4 > > > or 2.6.31-rc6? > > > > > > -Thanks > > > > > > -Bill > > > > > > > > > > > I thought the trace stuff went it around 2.6.29 but I might be mistaken. > > Easiest thing to do likely would be find where in the tree those were introduced > > and just apply them prior to my patches, or move to the latest kernel if you > > can (at least for the purposes of testing) > > I finally got a 2.6.31-rc6 kernel built and had some limited success > with your ftrace patches. Doing some simple ping tests I was able to > verify that everything was mostly as expected regarding CPU and NUMA > memory affinity, with one weird exception. eth2 through eth7, which > all connect to the 5520 I/O Hub that connects to NUMA node 1, all > correctly showed their allocations and consumptions on NUMA node 1. > eth8 through eth13 are all connected to the 5520 I/O Hub that connects > to NUMA node 0, and eth9 through eth13 all correctly reflected that > on the ping ftrace tests. But eth8 showed its allocations being > done on NUMA node 1 instead of the expected NUMA node 0, which just > doesn't make sense since eth8 and eth9 are part of a dual-port 10-GigE > Myricom NIC (and I doublechecked that all the IRQ assignments were > correct). > Hmm, memory pressure on node zero causing netdev_alloc_skb to allocate on a remote node perhaps? > When I tried an actual nuttcp performance test, even when rate limiting > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > crashdump via kexec/kdump, but the kexec kernel, instead of just > generating a crashdump, fully booted the new kernel, which was > extremely sluggish until I rebooted it through a BIOS re-init, > and never produced a crashdump. I tried this several times and > an immediate kernel oops was always the result (with either a TCP > or UDP test). A ping test of 1000 9000-byte packets with an interval > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > worked just fine. > The sluggishness is expected, since the kdump kernel operates out of such limited memory. don't know why you booted to a full system rather than did a crash recovery. Don't suppose you got a backtrace did you? Neil > -Thanks > > -Bill > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-20 20:19 ` Neil Horman @ 2009-08-21 4:14 ` Bill Fink 2009-08-21 15:23 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-21 4:14 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Thu, 20 Aug 2009, Neil Horman wrote: > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > When I tried an actual nuttcp performance test, even when rate limiting > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > generating a crashdump, fully booted the new kernel, which was > > extremely sluggish until I rebooted it through a BIOS re-init, > > and never produced a crashdump. I tried this several times and > > an immediate kernel oops was always the result (with either a TCP > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > worked just fine. > > The sluggishness is expected, since the kdump kernel operates out of such > limited memory. don't know why you booted to a full system rather than did a > crash recovery. Don't suppose you got a backtrace did you? There was a backtrace on the screen but I didn't have a chance to record it. BTW did anyone ever think to print the backtrace in reverse (first to some reserved memory and then output to the display) so the more interesting parts wouldn't have scrolled off the top of the screen? -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-21 4:14 ` Bill Fink @ 2009-08-21 15:23 ` Neil Horman 2009-08-21 15:36 ` Andrew Gallatin 2009-08-26 7:10 ` Bill Fink 0 siblings, 2 replies; 89+ messages in thread From: Neil Horman @ 2009-08-21 15:23 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > On Thu, 20 Aug 2009, Neil Horman wrote: > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > generating a crashdump, fully booted the new kernel, which was > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > and never produced a crashdump. I tried this several times and > > > an immediate kernel oops was always the result (with either a TCP > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > worked just fine. > > > > The sluggishness is expected, since the kdump kernel operates out of such > > limited memory. don't know why you booted to a full system rather than did a > > crash recovery. Don't suppose you got a backtrace did you? > > There was a backtrace on the screen but I didn't have a chance to > record it. BTW did anyone ever think to print the backtrace in > reverse (first to some reserved memory and then output to the display) > so the more interesting parts wouldn't have scrolled off the top of > the screen? > The real solution is to use a console to which the output doesn't scroll off the screen. Normally people use a serial console they can log, or a RAC card that they can record. Even on a regular vga monitor in text mode, you can set up the vt iirc to allow for scrolling. Neil > -Bill > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-21 15:23 ` Neil Horman @ 2009-08-21 15:36 ` Andrew Gallatin 2009-08-26 7:10 ` Bill Fink 1 sibling, 0 replies; 89+ messages in thread From: Andrew Gallatin @ 2009-08-21 15:36 UTC (permalink / raw) To: Neil Horman; +Cc: Bill Fink, Linux Network Developers, brice Neil Horman wrote: > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: >> On Thu, 20 Aug 2009, Neil Horman wrote: >> >>> On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: >>> >>>> When I tried an actual nuttcp performance test, even when rate limiting >>>> to just 1 Mbps, I immediately got a kernel oops. I tried to get a >>>> crashdump via kexec/kdump, but the kexec kernel, instead of just >>>> generating a crashdump, fully booted the new kernel, which was >>>> extremely sluggish until I rebooted it through a BIOS re-init, >>>> and never produced a crashdump. I tried this several times and >>>> an immediate kernel oops was always the result (with either a TCP >>>> or UDP test). A ping test of 1000 9000-byte packets with an interval >>>> of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand >>>> worked just fine. >>> The sluggishness is expected, since the kdump kernel operates out of such >>> limited memory. don't know why you booted to a full system rather than did a >>> crash recovery. Don't suppose you got a backtrace did you? >> There was a backtrace on the screen but I didn't have a chance to >> record it. BTW did anyone ever think to print the backtrace in >> reverse (first to some reserved memory and then output to the display) >> so the more interesting parts wouldn't have scrolled off the top of >> the screen? >> > The real solution is to use a console to which the output doesn't scroll off the > screen. Normally people use a serial console they can log, or a RAC card that > they can record. Even on a regular vga monitor in text mode, you can set up the > vt iirc to allow for scrolling. Indeed. Another option when setting up a serial console is not practical is netconsole. I've captured a few panics this way on machines like macs, with no serial port support (at the time). Drew ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-21 15:23 ` Neil Horman 2009-08-21 15:36 ` Andrew Gallatin @ 2009-08-26 7:10 ` Bill Fink 2009-08-26 11:00 ` Neil Horman 1 sibling, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-26 7:10 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Fri, 21 Aug 2009, Neil Horman wrote: > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > generating a crashdump, fully booted the new kernel, which was > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > and never produced a crashdump. I tried this several times and > > > > an immediate kernel oops was always the result (with either a TCP > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > worked just fine. > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > limited memory. don't know why you booted to a full system rather than did a > > > crash recovery. Don't suppose you got a backtrace did you? > > > > There was a backtrace on the screen but I didn't have a chance to > > record it. BTW did anyone ever think to print the backtrace in > > reverse (first to some reserved memory and then output to the display) > > so the more interesting parts wouldn't have scrolled off the top of > > the screen? > > > The real solution is to use a console to which the output doesn't scroll off the > screen. Normally people use a serial console they can log, or a RAC card that > they can record. Even on a regular vga monitor in text mode, you can set up the > vt iirc to allow for scrolling. None of our Asus P6T6 systems have serial consoles. I don't know of any RAC cards for them either, nor are there spare PCI slots available in many cases. I wouldn't think the Shift-PageUp trick would work with a crashed kernel, but I admit I didn't try it. I haven't checked out netconsole yet either, but I'm not sure it would help either in a case like this that was a network related kernel crash. In any case, a simple kernel command line that would provide a reversed backtrace would be a simple thing to facilitate Linux users providing useful info to Linux kernel developers in helping to debug kernel problems. The most useful info would still be on the screen, so it could be transcribed or a photo image of the screen could be taken. Fortunately, in this specific case, the SuperMicro X8DAH+-F system does have a serial console, and after a fair amount of effort I was able to get it to work as desired, and was able to finally capture a backtrace of the kernel oops. BTW I believe the reason the kexec/kdump didn't work was probably because it couldn't find a /proc/vmcore file, although I don't know why that would be, and the Fedora 10 /etc/init.d/kdump script will then just boot up normally if it fails to find the /proc/vmcore file (or it's zero size). The following shows a simple ping test usage of the skb_sources tracing feature: [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms --- 192.168.1.10 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 3999ms rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms [root@xeontest1 tracing]# cat trace # tracer: skb_sources # # PID ANID CNID IFC RXQ CCPU LEN # | | | | | | | 4217 1 1 eth2 0 4 1500 4217 1 1 eth2 0 4 1500 4217 1 1 eth2 0 4 1500 4217 1 1 eth2 0 4 1500 4217 1 1 eth2 0 4 1500 All is as was expected. But if I try an actual nuttcp performance test (even rate limited to 1 Mbps), I get the following kernel oops: [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 PGD 337d12067 PUD 337d11067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e CPU 4 Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) Stack: ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef Call Trace: [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 [<ffffffff810f785c>] vfs_read+0xc0/0x107 [<ffffffff810f7971>] sys_read+0x4c/0x75 [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 RSP <ffff8801a5811a88> CR2: 0000000000000038 -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 7:10 ` Bill Fink @ 2009-08-26 11:00 ` Neil Horman 2009-08-26 18:08 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-26 11:00 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > On Fri, 21 Aug 2009, Neil Horman wrote: > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > and never produced a crashdump. I tried this several times and > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > worked just fine. > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > limited memory. don't know why you booted to a full system rather than did a > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > record it. BTW did anyone ever think to print the backtrace in > > > reverse (first to some reserved memory and then output to the display) > > > so the more interesting parts wouldn't have scrolled off the top of > > > the screen? > > > > > The real solution is to use a console to which the output doesn't scroll off the > > screen. Normally people use a serial console they can log, or a RAC card that > > they can record. Even on a regular vga monitor in text mode, you can set up the > > vt iirc to allow for scrolling. > > None of our Asus P6T6 systems have serial consoles. I don't know of > any RAC cards for them either, nor are there spare PCI slots available > in many cases. I wouldn't think the Shift-PageUp trick would work > with a crashed kernel, but I admit I didn't try it. I haven't checked > out netconsole yet either, but I'm not sure it would help either in a > case like this that was a network related kernel crash. > Any USB ports that you can attach a serial dongle to? That would work as well, or, as previously mentioned, netconsole also does the trick. > In any case, a simple kernel command line that would provide a reversed > backtrace would be a simple thing to facilitate Linux users providing > useful info to Linux kernel developers in helping to debug kernel > problems. The most useful info would still be on the screen, so it > could be transcribed or a photo image of the screen could be taken. > I understand what your saying, I'm just saying there are currently several options for you that have already solved this problem in differnt ways. > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > does have a serial console, and after a fair amount of effort I was > able to get it to work as desired, and was able to finally capture > a backtrace of the kernel oops. BTW I believe the reason the > kexec/kdump didn't work was probably because it couldn't find > a /proc/vmcore file, although I don't know why that would be, > and the Fedora 10 /etc/init.d/kdump script will then just boot > up normally if it fails to find the /proc/vmcore file (or it's > zero size). > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be happy to look into it further. > The following shows a simple ping test usage of the skb_sources > tracing feature: > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > --- 192.168.1.10 ping statistics --- > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > [root@xeontest1 tracing]# cat trace > # tracer: skb_sources > # > # PID ANID CNID IFC RXQ CCPU LEN > # | | | | | | | > 4217 1 1 eth2 0 4 1500 > 4217 1 1 eth2 0 4 1500 > 4217 1 1 eth2 0 4 1500 > 4217 1 1 eth2 0 4 1500 > 4217 1 1 eth2 0 4 1500 > > All is as was expected. > > But if I try an actual nuttcp performance test (even rate limited > to 1 Mbps), I get the following kernel oops: > thank you, I think I see the problem, I'll have a patch for you in just a bit Thanks Neil > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > PGD 337d12067 PUD 337d11067 PMD 0 > Oops: 0000 [#1] SMP > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > CPU 4 > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > Stack: > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > Call Trace: > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > [<ffffffff810f7971>] sys_read+0x4c/0x75 > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > RSP <ffff8801a5811a88> > CR2: 0000000000000038 > > -Thanks > > -Bill > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 11:00 ` Neil Horman @ 2009-08-26 18:08 ` Neil Horman 2009-08-26 18:15 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 89+ messages in thread From: Neil Horman @ 2009-08-26 18:08 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > and never produced a crashdump. I tried this several times and > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > worked just fine. > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > record it. BTW did anyone ever think to print the backtrace in > > > > reverse (first to some reserved memory and then output to the display) > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > the screen? > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > screen. Normally people use a serial console they can log, or a RAC card that > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > vt iirc to allow for scrolling. > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > any RAC cards for them either, nor are there spare PCI slots available > > in many cases. I wouldn't think the Shift-PageUp trick would work > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > out netconsole yet either, but I'm not sure it would help either in a > > case like this that was a network related kernel crash. > > > Any USB ports that you can attach a serial dongle to? That would work as well, > or, as previously mentioned, netconsole also does the trick. > > > In any case, a simple kernel command line that would provide a reversed > > backtrace would be a simple thing to facilitate Linux users providing > > useful info to Linux kernel developers in helping to debug kernel > > problems. The most useful info would still be on the screen, so it > > could be transcribed or a photo image of the screen could be taken. > > > I understand what your saying, I'm just saying there are currently several > options for you that have already solved this problem in differnt ways. > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > does have a serial console, and after a fair amount of effort I was > > able to get it to work as desired, and was able to finally capture > > a backtrace of the kernel oops. BTW I believe the reason the > > kexec/kdump didn't work was probably because it couldn't find > > a /proc/vmcore file, although I don't know why that would be, > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > up normally if it fails to find the /proc/vmcore file (or it's > > zero size). > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > happy to look into it further. > > > The following shows a simple ping test usage of the skb_sources > > tracing feature: > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > --- 192.168.1.10 ping statistics --- > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > [root@xeontest1 tracing]# cat trace > > # tracer: skb_sources > > # > > # PID ANID CNID IFC RXQ CCPU LEN > > # | | | | | | | > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > > > All is as was expected. > > > > But if I try an actual nuttcp performance test (even rate limited > > to 1 Mbps), I get the following kernel oops: > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > Thanks > Neil > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > PGD 337d12067 PUD 337d11067 PMD 0 > > Oops: 0000 [#1] SMP > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > CPU 4 > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > Stack: > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > Call Trace: > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > RSP <ffff8801a5811a88> > > CR2: 0000000000000038 > > > > -Thanks > > > > -Bill > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Here you go, I think this will fix your oops. Fix NULL pointer deref in skb sources ftracer Its possible that skb->sk will be null in this path, so we shouldn't just assume we can pass it to sock_net Signed-off-by: Neil Horman <nhorman@tuxdriver.com> trace_skb_sources.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c index 40eb071..8bf518f 100644 --- a/kernel/trace/trace_skb_sources.c +++ b/kernel/trace/trace_skb_sources.c @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) struct ring_buffer_event *event; struct trace_skb_event *entry; struct trace_array *tr = skb_trace; - struct net_device *dev; + struct net_device *dev = NULL; if (!trace_skb_source_enabled) return; @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) entry->event_data.rx_queue = skb->queue_mapping; entry->event_data.ccpu = smp_processor_id(); - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); + if (skb->sk) + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); + if (dev) { memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); dev_put(dev); ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 18:08 ` Neil Horman @ 2009-08-26 18:15 ` Ingo Molnar 2009-08-26 19:04 ` Neil Horman 2009-08-27 17:32 ` Bill Fink 2009-08-27 17:44 ` Bill Fink 2 siblings, 1 reply; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 18:15 UTC (permalink / raw) To: Neil Horman; +Cc: Bill Fink, Linux Network Developers, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > > and never produced a crashdump. I tried this several times and > > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > > worked just fine. > > > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > > record it. BTW did anyone ever think to print the backtrace in > > > > > reverse (first to some reserved memory and then output to the display) > > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > > the screen? > > > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > > screen. Normally people use a serial console they can log, or a RAC card that > > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > > vt iirc to allow for scrolling. > > > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > > any RAC cards for them either, nor are there spare PCI slots available > > > in many cases. I wouldn't think the Shift-PageUp trick would work > > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > > out netconsole yet either, but I'm not sure it would help either in a > > > case like this that was a network related kernel crash. > > > > > Any USB ports that you can attach a serial dongle to? That would work as well, > > or, as previously mentioned, netconsole also does the trick. > > > > > In any case, a simple kernel command line that would provide a reversed > > > backtrace would be a simple thing to facilitate Linux users providing > > > useful info to Linux kernel developers in helping to debug kernel > > > problems. The most useful info would still be on the screen, so it > > > could be transcribed or a photo image of the screen could be taken. > > > > > I understand what your saying, I'm just saying there are currently several > > options for you that have already solved this problem in differnt ways. > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > does have a serial console, and after a fair amount of effort I was > > > able to get it to work as desired, and was able to finally capture > > > a backtrace of the kernel oops. BTW I believe the reason the > > > kexec/kdump didn't work was probably because it couldn't find > > > a /proc/vmcore file, although I don't know why that would be, > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > up normally if it fails to find the /proc/vmcore file (or it's > > > zero size). > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > happy to look into it further. > > > > > The following shows a simple ping test usage of the skb_sources > > > tracing feature: > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > --- 192.168.1.10 ping statistics --- > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > [root@xeontest1 tracing]# cat trace > > > # tracer: skb_sources > > > # > > > # PID ANID CNID IFC RXQ CCPU LEN > > > # | | | | | | | > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > > > > All is as was expected. > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > to 1 Mbps), I get the following kernel oops: > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > Thanks > > Neil > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > Oops: 0000 [#1] SMP > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > CPU 4 > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > Stack: > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > Call Trace: > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > RSP <ffff8801a5811a88> > > > CR2: 0000000000000038 > > > > > > -Thanks > > > > > > -Bill > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > Here you go, I think this will fix your oops. > > > Fix NULL pointer deref in skb sources ftracer > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > we can pass it to sock_net > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > trace_skb_sources.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) ok if this is just a temporary fix until TRACE_EVENT() is done, but we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's pushed to .32, right? Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 18:15 ` Ingo Molnar @ 2009-08-26 19:04 ` Neil Horman 2009-08-26 19:08 ` Ingo Molnar 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-26 19:04 UTC (permalink / raw) To: Ingo Molnar; +Cc: Bill Fink, Linux Network Developers, brice, gallatin On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote: > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > > > and never produced a crashdump. I tried this several times and > > > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > > > worked just fine. > > > > > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > > > record it. BTW did anyone ever think to print the backtrace in > > > > > > reverse (first to some reserved memory and then output to the display) > > > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > > > the screen? > > > > > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > > > screen. Normally people use a serial console they can log, or a RAC card that > > > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > > > vt iirc to allow for scrolling. > > > > > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > > > any RAC cards for them either, nor are there spare PCI slots available > > > > in many cases. I wouldn't think the Shift-PageUp trick would work > > > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > > > out netconsole yet either, but I'm not sure it would help either in a > > > > case like this that was a network related kernel crash. > > > > > > > Any USB ports that you can attach a serial dongle to? That would work as well, > > > or, as previously mentioned, netconsole also does the trick. > > > > > > > In any case, a simple kernel command line that would provide a reversed > > > > backtrace would be a simple thing to facilitate Linux users providing > > > > useful info to Linux kernel developers in helping to debug kernel > > > > problems. The most useful info would still be on the screen, so it > > > > could be transcribed or a photo image of the screen could be taken. > > > > > > > I understand what your saying, I'm just saying there are currently several > > > options for you that have already solved this problem in differnt ways. > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > does have a serial console, and after a fair amount of effort I was > > > > able to get it to work as desired, and was able to finally capture > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > kexec/kdump didn't work was probably because it couldn't find > > > > a /proc/vmcore file, although I don't know why that would be, > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > zero size). > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > happy to look into it further. > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > tracing feature: > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > # tracer: skb_sources > > > > # > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > # | | | | | | | > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > All is as was expected. > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > Thanks > > > Neil > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > Oops: 0000 [#1] SMP > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > CPU 4 > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > Stack: > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > Call Trace: > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > RSP <ffff8801a5811a88> > > > > CR2: 0000000000000038 > > > > > > > > -Thanks > > > > > > > > -Bill > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > we can pass it to sock_net > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > trace_skb_sources.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > ok if this is just a temporary fix until TRACE_EVENT() is done, but > we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's > pushed to .32, right? > Not sure that the two are related. I think you meant to send this to the other thread, didnt you? Neil > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:04 ` Neil Horman @ 2009-08-26 19:08 ` Ingo Molnar 2009-08-26 19:36 ` David Miller 2009-08-26 20:01 ` Neil Horman 0 siblings, 2 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 19:08 UTC (permalink / raw) To: Neil Horman, David S. Miller, Steven Rostedt, =?unknown-8bit?B?RnLDqWTDqXJpYw==?= Weisbecker Cc: Bill Fink, Linux Network Developers, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote: > > > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > > > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > > > > and never produced a crashdump. I tried this several times and > > > > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > > > > worked just fine. > > > > > > > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > > > > record it. BTW did anyone ever think to print the backtrace in > > > > > > > reverse (first to some reserved memory and then output to the display) > > > > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > > > > the screen? > > > > > > > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > > > > screen. Normally people use a serial console they can log, or a RAC card that > > > > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > > > > vt iirc to allow for scrolling. > > > > > > > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > > > > any RAC cards for them either, nor are there spare PCI slots available > > > > > in many cases. I wouldn't think the Shift-PageUp trick would work > > > > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > > > > out netconsole yet either, but I'm not sure it would help either in a > > > > > case like this that was a network related kernel crash. > > > > > > > > > Any USB ports that you can attach a serial dongle to? That would work as well, > > > > or, as previously mentioned, netconsole also does the trick. > > > > > > > > > In any case, a simple kernel command line that would provide a reversed > > > > > backtrace would be a simple thing to facilitate Linux users providing > > > > > useful info to Linux kernel developers in helping to debug kernel > > > > > problems. The most useful info would still be on the screen, so it > > > > > could be transcribed or a photo image of the screen could be taken. > > > > > > > > > I understand what your saying, I'm just saying there are currently several > > > > options for you that have already solved this problem in differnt ways. > > > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > > does have a serial console, and after a fair amount of effort I was > > > > > able to get it to work as desired, and was able to finally capture > > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > > kexec/kdump didn't work was probably because it couldn't find > > > > > a /proc/vmcore file, although I don't know why that would be, > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > > zero size). > > > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > > happy to look into it further. > > > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > > tracing feature: > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > > # tracer: skb_sources > > > > > # > > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > > # | | | | | | | > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > > > All is as was expected. > > > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > > > Thanks > > > > Neil > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > > Oops: 0000 [#1] SMP > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > > CPU 4 > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > > Stack: > > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > > Call Trace: > > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > RSP <ffff8801a5811a88> > > > > > CR2: 0000000000000038 > > > > > > > > > > -Thanks > > > > > > > > > > -Bill > > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > > we can pass it to sock_net > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > trace_skb_sources.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > ok if this is just a temporary fix until TRACE_EVENT() is done, but > > we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's > > pushed to .32, right? > > Not sure that the two are related. I think you meant to send this > to the other thread, didnt you? Sigh, no. Please re-read the past discussions about this. trace_skb_sources.c is a hack and should be converted to generic tracepoints. Is there anything in it that cannot be expressed in terms of TRACE_EVENT()? Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:08 ` Ingo Molnar @ 2009-08-26 19:36 ` David Miller 2009-08-26 19:48 ` Ingo Molnar 2009-08-26 20:01 ` Neil Horman 1 sibling, 1 reply; 89+ messages in thread From: David Miller @ 2009-08-26 19:36 UTC (permalink / raw) To: mingo; +Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin From: Ingo Molnar <mingo@elte.hu> Date: Wed, 26 Aug 2009 21:08:30 +0200 > Sigh, no. Please re-read the past discussions about this. > trace_skb_sources.c is a hack and should be converted to generic > tracepoints. Is there anything in it that cannot be expressed in > terms of TRACE_EVENT()? Neil explained why he needed to implement it this way in his reply to Steven Rostedt. I attach it here for your convenience. Subject: Re: [PATCH 3/3] net: skb ftracer - Add actual ftrace code to kernel (v3) From: Neil Horman <nhorman@tuxdriver.com> To: Steven Rostedt <rostedt@goodmis.org> Cc: netdev@vger.kernel.org, davem@davemloft.net Date: Tue, 18 Aug 2009 12:39:58 -0400 User-Agent: Mutt/1.5.18 (2008-05-17) X-Mew: tab/spc characters on Subject: are simplified. On Mon, Aug 17, 2009 at 04:55:38PM -0400, Steven Rostedt wrote: > > Hi Neil! > > Sorry for the late reply, I've been on vacation for the last week. > > On Thu, 13 Aug 2009, Neil Horman wrote: > > > skb allocation / consumption correlator > > > > Add ftracer module to kernel to print out a list that correlates a process id, > > an skb it read, and the numa nodes on wich the process was running when it was > > read along with the numa node the skbuff was allocated on. > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > Makefile | 1 > > trace.h | 19 ++++++ > > trace_skb_sources.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 174 insertions(+) > > > > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile > > index 844164d..ee5e5b1 100644 > > --- a/kernel/trace/Makefile > > +++ b/kernel/trace/Makefile > > @@ -49,6 +49,7 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o > > ifeq ($(CONFIG_BLOCK),y) > > obj-$(CONFIG_EVENT_TRACING) += blktrace.o > > endif > > +obj-$(CONFIG_SKB_SOURCES_TRACER) += trace_skb_sources.o > > obj-$(CONFIG_EVENT_TRACING) += trace_events.o > > obj-$(CONFIG_EVENT_TRACING) += trace_export.o > > obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o > > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h > > index 8b9f4f6..8a6281b 100644 > > --- a/kernel/trace/trace.h > > +++ b/kernel/trace/trace.h > > @@ -11,6 +11,7 @@ > > #include <trace/boot.h> > > #include <linux/kmemtrace.h> > > #include <trace/power.h> > > +#include <trace/events/skb.h> > > > > #include <linux/trace_seq.h> > > #include <linux/ftrace_event.h> > > @@ -40,6 +41,7 @@ enum trace_type { > > TRACE_KMEM_FREE, > > TRACE_POWER, > > TRACE_BLK, > > + TRACE_SKB_SOURCE, > > > > __TRACE_LAST_TYPE, > > }; > > @@ -171,6 +173,21 @@ struct trace_power { > > struct power_trace state_data; > > }; > > > > +struct skb_record { > > + pid_t pid; /* pid of the copying process */ > > + int anid; /* node where skb was allocated */ > > + int cnid; /* node to which skb was copied in userspace */ > > + char ifname[IFNAMSIZ]; /* Name of the receiving interface */ > > + int rx_queue; /* The rx queue the skb was received on */ > > + int ccpu; /* Cpu the application got this frame from */ > > + int len; /* length of the data copied */ > > +}; > > + > > +struct trace_skb_event { > > + struct trace_entry ent; > > + struct skb_record event_data; > > +}; > > + > > enum kmemtrace_type_id { > > KMEMTRACE_TYPE_KMALLOC = 0, /* kmalloc() or kfree(). */ > > KMEMTRACE_TYPE_CACHE, /* kmem_cache_*(). */ > > @@ -323,6 +340,8 @@ extern void __ftrace_bad_type(void); > > TRACE_SYSCALL_ENTER); \ > > IF_ASSIGN(var, ent, struct syscall_trace_exit, \ > > TRACE_SYSCALL_EXIT); \ > > + IF_ASSIGN(var, ent, struct trace_skb_event, \ > > + TRACE_SKB_SOURCE); \ > > __ftrace_bad_type(); \ > > } while (0) > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > > new file mode 100644 > > index 0000000..4ba3671 > > --- /dev/null > > +++ b/kernel/trace/trace_skb_sources.c > > @@ -0,0 +1,154 @@ > > +/* > > + * ring buffer based tracer for analyzing per-socket skb sources > > + * > > + * Neil Horman <nhorman@tuxdriver.com> > > + * Copyright (C) 2009 > > + * > > + * > > + */ > > + > > +#include <linux/init.h> > > +#include <linux/debugfs.h> > > +#include <trace/events/skb.h> > > +#include <linux/kallsyms.h> > > +#include <linux/module.h> > > +#include <linux/hardirq.h> > > +#include <linux/netdevice.h> > > +#include <net/sock.h> > > + > > +#include "trace.h" > > +#include "trace_output.h" > > + > > +EXPORT_TRACEPOINT_SYMBOL_GPL(skb_copy_datagram_iovec); > > + > > +static struct trace_array *skb_trace; > > +static int __read_mostly trace_skb_source_enabled; > > + > > +static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > +{ > > + struct ring_buffer_event *event; > > + struct trace_skb_event *entry; > > + struct trace_array *tr = skb_trace; > > + struct net_device *dev; > > + > > + if (!trace_skb_source_enabled) > > + return; > > + > > + if (in_interrupt()) > > + return; > > Is there a reason for not doing this in an interrupt? > Because the idea is to correlate skb consumption to a process. If we get in this tracepoint in an interrupt, it doesn't make sense to record. > > + > > + event = trace_buffer_lock_reserve(tr, TRACE_SKB_SOURCE, > > + sizeof(*entry), 0, 0); > > + if (!event) > > + return; > > + entry = ring_buffer_event_data(event); > > + > > + entry->event_data.pid = current->pid; > > Note, the trace_buffer_lock_reserve will record the current pid, thus you > do not need to record it here. > > > + entry->event_data.anid = page_to_nid(virt_to_page(skb->data)); > > + entry->event_data.cnid = cpu_to_node(smp_processor_id()); > > + entry->event_data.len = len; > > + entry->event_data.rx_queue = skb->queue_mapping; > > + entry->event_data.ccpu = smp_processor_id(); > > Also, the cpu is recorded in the ring buffer. They are per cpu ring > buffers and that determines the cpu it was recorded on. > > > + > > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > + if (dev) { > > + memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > > + dev_put(dev); > > + } else { > > + strcpy(entry->event_data.ifname, "Unknown"); > > + } > > + > > + trace_buffer_unlock_commit(tr, event, 0, 0); > > +} > > + > > +static int tracing_skb_source_register(void) > > +{ > > + int ret; > > + > > + ret = register_trace_skb_copy_datagram_iovec(probe_skb_dequeue); > > + if (ret) > > + pr_info("skb source trace: Couldn't activate dequeue tracepoint"); > > + > > + return ret; > > +} > > + > > +static void start_skb_source_trace(struct trace_array *tr) > > +{ > > + trace_skb_source_enabled = 1; > > +} > > + > > +static void stop_skb_source_trace(struct trace_array *tr) > > +{ > > + trace_skb_source_enabled = 0; > > +} > > + > > +static void skb_source_trace_reset(struct trace_array *tr) > > +{ > > + trace_skb_source_enabled = 0; > > + unregister_trace_skb_copy_datagram_iovec(probe_skb_dequeue); > > +} > > + > > + > > +static int skb_source_trace_init(struct trace_array *tr) > > +{ > > + int cpu; > > + skb_trace = tr; > > + > > + trace_skb_source_enabled = 1; > > + tracing_skb_source_register(); > > + > > + for_each_cpu(cpu, cpu_possible_mask) > > + tracing_reset(tr, cpu); > > + return 0; > > +} > > + > > +static enum print_line_t skb_source_print_line(struct trace_iterator *iter) > > +{ > > + int ret = 0; > > + struct trace_entry *entry = iter->ent; > > iter->cpu has the cpu that trace was recorded on. > entry->pid has the pid of the process that did the recording. > ok, I'll clean this up in a subsequent patch, since davem has already rolled them in. > > + struct trace_skb_event *event; > > + struct skb_record *record; > > + struct trace_seq *s = &iter->seq; > > + > > + trace_assign_type(event, entry); > > + record = &event->event_data; > > + if (entry->type != TRACE_SKB_SOURCE) > > + return TRACE_TYPE_UNHANDLED; > > + > > + ret = trace_seq_printf(s, " %d %d %d %s %d %d %d\n", > > + record->pid, > > + record->anid, > > + record->cnid, > > + record->ifname, > > + record->rx_queue, > > + record->ccpu, > > + record->len); > > + > > + if (!ret) > > + return TRACE_TYPE_PARTIAL_LINE; > > + > > + return TRACE_TYPE_HANDLED; > > +} > > + > > +static void skb_source_print_header(struct seq_file *s) > > +{ > > + seq_puts(s, "# PID ANID CNID IFC RXQ CCPU LEN\n"); > > + seq_puts(s, "# | | | | | | |\n"); > > +} > > + > > +static struct tracer skb_source_tracer __read_mostly = > > +{ > > + .name = "skb_sources", > > + .init = skb_source_trace_init, > > + .start = start_skb_source_trace, > > + .stop = stop_skb_source_trace, > > + .reset = skb_source_trace_reset, > > + .print_line = skb_source_print_line, > > + .print_header = skb_source_print_header, > > +}; > > + > > +static int init_skb_source_trace(void) > > +{ > > + return register_tracer(&skb_source_tracer); > > +} > > +device_initcall(init_skb_source_trace); > > > > BTW, why not just do this as events? Or was this just a easy way to > communicate with the user space tools? > Thats exactly why I did it. the idea is for me to now write a user space tool that lets me analyze the events and ajust process scheduling to optimize the rx path. Neil > -- Steve > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:36 ` David Miller @ 2009-08-26 19:48 ` Ingo Molnar 2009-08-26 20:23 ` Neil Horman 2009-08-26 20:28 ` Ingo Molnar 0 siblings, 2 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 19:48 UTC (permalink / raw) To: David Miller Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > Sigh, no. Please re-read the past discussions about this. > > trace_skb_sources.c is a hack and should be converted to generic > > tracepoints. Is there anything in it that cannot be expressed in > > terms of TRACE_EVENT()? > > Neil explained why he needed to implement it this way in his reply > to Steven Rostedt. I attach it here for your convenience. thanks. The argument is invalid: > > BTW, why not just do this as events? Or was this just a easy way > > to communicate with the user space tools? > > Thats exactly why I did it. the idea is for me to now write a > user space tool that lets me analyze the events and ajust process > scheduling to optimize the rx path. Neil All tooling (in fact _more_ tooling) can be done based on generic, TRACE_EVENT() based tracepoints. Generic tracepoints are far more available, have a generalized format with format parsers and user tooling implemented, etc. etc. Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:48 ` Ingo Molnar @ 2009-08-26 20:23 ` Neil Horman 2009-08-26 20:40 ` Ingo Molnar 2009-08-26 23:46 ` Frederic Weisbecker 2009-08-26 20:28 ` Ingo Molnar 1 sibling, 2 replies; 89+ messages in thread From: Neil Horman @ 2009-08-26 20:23 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > * David Miller <davem@davemloft.net> wrote: > > > From: Ingo Molnar <mingo@elte.hu> > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > Sigh, no. Please re-read the past discussions about this. > > > trace_skb_sources.c is a hack and should be converted to generic > > > tracepoints. Is there anything in it that cannot be expressed in > > > terms of TRACE_EVENT()? > > > > Neil explained why he needed to implement it this way in his reply > > to Steven Rostedt. I attach it here for your convenience. > > thanks. The argument is invalid: > Just because you assert that doesn't make it so, Ingo. > > > BTW, why not just do this as events? Or was this just a easy way > > > to communicate with the user space tools? > > > > Thats exactly why I did it. the idea is for me to now write a > > user space tool that lets me analyze the events and ajust process > > scheduling to optimize the rx path. Neil > > All tooling (in fact _more_ tooling) can be done based on generic, > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > available, have a generalized format with format parsers and user > tooling implemented, etc. etc. > Then why allow for ftrace modules at all? I grant that the skb ftracer is a bit trivial at the moment for an ftrace module, but I really prefer to leave it is so that I can expand it with additional tracepoints. And looking at them, anything you've said above applies to any of the currently implemented ftrace modules. If you're so adamant that we should just do everything with TRACE_EVENT log messages, then lets get rid of the ftrace infrastructure all together. Until we do that, however, I like my skb tracer just as it is. Neil > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 20:23 ` Neil Horman @ 2009-08-26 20:40 ` Ingo Molnar 2009-08-26 22:39 ` Neil Horman 2009-08-26 23:46 ` Frederic Weisbecker 1 sibling, 1 reply; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 20:40 UTC (permalink / raw) To: Neil Horman Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > * David Miller <davem@davemloft.net> wrote: > > > > > From: Ingo Molnar <mingo@elte.hu> > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > terms of TRACE_EVENT()? > > > > > > Neil explained why he needed to implement it this way in his reply > > > to Steven Rostedt. I attach it here for your convenience. > > > > thanks. The argument is invalid: > > Just because you assert that doesn't make it so, Ingo. I stand by that statement, the argument is invalid, for the many reasons i outlined in my previous mails. (you'd have gotten those same arguments had you submitted that patch to the folks who maintain kernel/trace/) > > > > BTW, why not just do this as events? Or was this just a easy way > > > > to communicate with the user space tools? > > > > > > Thats exactly why I did it. the idea is for me to now write a > > > user space tool that lets me analyze the events and ajust process > > > scheduling to optimize the rx path. Neil > > > > All tooling (in fact _more_ tooling) can be done based on generic, > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > > available, have a generalized format with format parsers and user > > tooling implemented, etc. etc. > > Then why allow for ftrace modules at all? [...] We routinely reject trivial plugins like yours and ask people to use the proper mechanism: TRACE_EVENT(). We are also converting non-trivial plugins to generic tracepoints. A recent example are the system call tracepoints, but we also converted blktrace and kmemtrace to generic tracepoints. But trace_skb_sources.c got committed to the networking tree, without review and acks from the tracing folks. Now you are unwilling to fix it and that's not very constructive. > [...] I grant that the skb ftracer is a bit trivial at the moment > for an ftrace module, but I really prefer to leave it is so that I > can expand it with additional tracepoints. And looking at them, > anything you've said above applies to any of the currently > implemented ftrace modules. If you're so adamant that we should > just do everything with TRACE_EVENT log messages, then lets get > rid of the ftrace infrastructure all together. Until we do that, > however, I like my skb tracer just as it is. You dont seem to be aware of the breath of features and capabilities that TRACE_EVENT() based tooling allows us to do. Please see my previous mail about an (incomplete) list. ( One item i forgot to mention there: using them you can for example trace full workloads, such as a kernel build - without other workloads mixed into that trace. Etc. etc. - the list goes on. ) Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 20:40 ` Ingo Molnar @ 2009-08-26 22:39 ` Neil Horman 2009-08-26 22:44 ` David Miller ` (3 more replies) 0 siblings, 4 replies; 89+ messages in thread From: Neil Horman @ 2009-08-26 22:39 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote: > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > > > * David Miller <davem@davemloft.net> wrote: > > > > > > > From: Ingo Molnar <mingo@elte.hu> > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > > terms of TRACE_EVENT()? > > > > > > > > Neil explained why he needed to implement it this way in his reply > > > > to Steven Rostedt. I attach it here for your convenience. > > > > > > thanks. The argument is invalid: > > > > Just because you assert that doesn't make it so, Ingo. > > I stand by that statement, the argument is invalid, for the many > reasons i outlined in my previous mails. (you'd have gotten those > same arguments had you submitted that patch to the folks who > maintain kernel/trace/) > Steven specifically told me to submit the patch to the subsystem maintainer that I'm adding tracepoints for, and the only feedback I got on it was his one question, the answer to which I assume satisfied him, due to that there was no subseuqent discussion. I'm going to ignore your previous emails, because, despite the various advantages of just using plain TRACE_EVENTs because you provide the ftrace interface, and I found it useful. Your observation is correct, I like it, and thats what I wanted to use, so I used it. If you don't want people to use it, don't provide it. > > > > > BTW, why not just do this as events? Or was this just a easy way > > > > > to communicate with the user space tools? > > > > > > > > Thats exactly why I did it. the idea is for me to now write a > > > > user space tool that lets me analyze the events and ajust process > > > > scheduling to optimize the rx path. Neil > > > > > > All tooling (in fact _more_ tooling) can be done based on generic, > > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > > > available, have a generalized format with format parsers and user > > > tooling implemented, etc. etc. > > > > Then why allow for ftrace modules at all? [...] > > We routinely reject trivial plugins like yours and ask people to use > the proper mechanism: TRACE_EVENT(). > Again, if you consider there to only be one proper mechanism here, don't provide others. > We are also converting non-trivial plugins to generic tracepoints. A > recent example are the system call tracepoints, but we also > converted blktrace and kmemtrace to generic tracepoints. > If you're getting rid of ftrace, then fine, just say so. If the interface I chose is getting removed, I'll change it. But I'm not going to change it just because you're going around saying my previous work sucks. Theres nothing wrong with it, it works quite well right now as it is. > But trace_skb_sources.c got committed to the networking tree, > without review and acks from the tracing folks. Now you are > unwilling to fix it and that's not very constructive. > I'm not willing to fix it because its not broken. I submitted it where steven suggested that I submitted it, and the reviews that I got were positive. All you've told me is that you think theres a better way. Its fine if theres a better way, but the way I have currently is sufficient. I have acutal bugs to fix. Rewriting this to suit your opinions after the fact really isn't productive for me. > > [...] I grant that the skb ftracer is a bit trivial at the moment > > for an ftrace module, but I really prefer to leave it is so that I > > can expand it with additional tracepoints. And looking at them, > > anything you've said above applies to any of the currently > > implemented ftrace modules. If you're so adamant that we should > > just do everything with TRACE_EVENT log messages, then lets get > > rid of the ftrace infrastructure all together. Until we do that, > > however, I like my skb tracer just as it is. > > You dont seem to be aware of the breath of features and capabilities > that TRACE_EVENT() based tooling allows us to do. Please see my > previous mail about an (incomplete) list. > Fine, I grant you that TRACE_EVENT might provide great advantages over an ftrace module. What you seem to be missing is that an ftrace module is sufficnet for the needs of what I was tracing. Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. The code I wrote works fairly well in my view, and I feel like the review on it was both positive and sufficent for inclusion. But thats not my call, its yours. I can meet my own need with a raw TRACE_EVENT for now just as easily. IF you feel like the skb plugin should be pulled, please do so, and let me know. All I ask is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place. If you pull the ftrace plugin, I'll submit a subsequent patch to agument the printing format so that I can gather the numa allocation and consumption data directly there. Regards Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:39 ` Neil Horman @ 2009-08-26 22:44 ` David Miller 2009-08-26 23:05 ` Ingo Molnar ` (2 more replies) 2009-08-26 23:14 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 3 replies; 89+ messages in thread From: David Miller @ 2009-08-26 22:44 UTC (permalink / raw) To: nhorman; +Cc: mingo, rostedt, fweisbec, billfink, netdev, brice, gallatin From: Neil Horman <nhorman@tuxdriver.com> Date: Wed, 26 Aug 2009 18:39:22 -0400 > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. I've gotten this kind of urging, both in private and in public, from both of you now. And I'm sorry, that's not how this works. It is not my job to somehow force you turkeys how to work effectively together. :-) What we can do is ask Mr. Rostedt to asses the situation and give his feedback. So if Steven could give some feedback about this specific situation that would be great and might help us move forward. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:44 ` David Miller @ 2009-08-26 23:05 ` Ingo Molnar 2009-08-26 23:08 ` David Miller 2009-08-26 23:05 ` Steven Rostedt 2009-08-26 23:19 ` Neil Horman 2 siblings, 1 reply; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 23:05 UTC (permalink / raw) To: David Miller Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin * David Miller <davem@davemloft.net> wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Wed, 26 Aug 2009 18:39:22 -0400 > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your > > hands. > > I've gotten this kind of urging, both in private and in public, > from both of you now. And I'm sorry, that's not how this works. > > It is not my job to somehow force you turkeys how to work > effectively together. :-) It is definitely your job to ensure that you do not commit deficient patches to kernel/trace/ via the networking tree. You created this situation to begin with so you might as well take some responsibility and help resolve it. And the thing is, we are not rigid about these things in the tracing tree and if this was a good change we would not mind and you'd have my Ack. The problem is that it's a crappy change and that Neil is refusing to fix it. So please fix it, Thanks, Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:05 ` Ingo Molnar @ 2009-08-26 23:08 ` David Miller 2009-08-26 23:58 ` Ingo Molnar 0 siblings, 1 reply; 89+ messages in thread From: David Miller @ 2009-08-26 23:08 UTC (permalink / raw) To: mingo; +Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin From: Ingo Molnar <mingo@elte.hu> Date: Thu, 27 Aug 2009 01:05:14 +0200 > The problem is that it's a crappy change and that Neil is > refusing to fix it. So please fix it, Thankfully, Steven Rostedt gave a much more useful and reasonable response than you. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:08 ` David Miller @ 2009-08-26 23:58 ` Ingo Molnar 2009-08-27 0:05 ` Steven Rostedt 2009-08-27 0:35 ` Christoph Hellwig 0 siblings, 2 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 23:58 UTC (permalink / raw) To: David Miller Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Thu, 27 Aug 2009 01:05:14 +0200 > > > And the thing is, we are not rigid about these things in the > > tracing tree and if this was a good change we would not mind and > > you'd have my Ack. The problem is that it's a crappy change and > > that Neil is refusing to fix it. So please fix it, > > Thankfully, Steven Rostedt gave a much more useful and reasonable > response than you. I'm sorry you got that impression, but you are a maintainer yourself so you might perhaps understand it why sooner or later, if a maintainer's review does not get acted upon, one has to insist on clean patches in stronger terms. Unfortunately you took away the "do not apply the patch" option from me that could have avoided the stronger words and could have kept this discussion more polite. Thanks, Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:58 ` Ingo Molnar @ 2009-08-27 0:05 ` Steven Rostedt 2009-08-27 0:35 ` Christoph Hellwig 1 sibling, 0 replies; 89+ messages in thread From: Steven Rostedt @ 2009-08-27 0:05 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, nhorman, fweisbec, billfink, netdev, brice, gallatin On Thu, 27 Aug 2009, Ingo Molnar wrote: > > * David Miller <davem@davemloft.net> wrote: > > > From: Ingo Molnar <mingo@elte.hu> > > Date: Thu, 27 Aug 2009 01:05:14 +0200 > > > > > And the thing is, we are not rigid about these things in the > > > tracing tree and if this was a good change we would not mind and > > > you'd have my Ack. The problem is that it's a crappy change and > > > that Neil is refusing to fix it. So please fix it, > > > > Thankfully, Steven Rostedt gave a much more useful and reasonable > > response than you. > > I'm sorry you got that impression, but you are a maintainer yourself > so you might perhaps understand it why sooner or later, if a > maintainer's review does not get acted upon, one has to insist on > clean patches in stronger terms. > > Unfortunately you took away the "do not apply the patch" option from > me that could have avoided the stronger words and could have kept > this discussion more polite. I feel somewhat at fault here. Neil did give me a heads up on his project, but his patches went out when I was getting ready for vacation and had other priorities at the time. I could have brought up these issues before Dave took them, and he may have taken them because I did not. But this is all water under the bridge. Time to be more productive. -- Steve ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:58 ` Ingo Molnar 2009-08-27 0:05 ` Steven Rostedt @ 2009-08-27 0:35 ` Christoph Hellwig 2009-08-27 9:28 ` Ingo Molnar 1 sibling, 1 reply; 89+ messages in thread From: Christoph Hellwig @ 2009-08-27 0:35 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin On Thu, Aug 27, 2009 at 01:58:26AM +0200, Ingo Molnar wrote: > I'm sorry you got that impression, but you are a maintainer yourself > so you might perhaps understand it why sooner or later, if a > maintainer's review does not get acted upon, one has to insist on > clean patches in stronger terms. Cool down a bit :) While I totally agree with you on all the technical bits here I think a slightly nicer attitude towars Neil and Dave would help the cause a lot.. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 0:35 ` Christoph Hellwig @ 2009-08-27 9:28 ` Ingo Molnar 0 siblings, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-27 9:28 UTC (permalink / raw) To: Christoph Hellwig Cc: David Miller, nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin * Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Aug 27, 2009 at 01:58:26AM +0200, Ingo Molnar wrote: > > > I'm sorry you got that impression, but you are a maintainer > > yourself so you might perhaps understand it why sooner or later, > > if a maintainer's review does not get acted upon, one has to > > insist on clean patches in stronger terms. > > Cool down a bit :) [...] Hello Pot, Kettle here ;-) I guess i'll have to test the limits of your patience by queueing up some bad commit into fs/libfs.c via say the iommu tree, without acks and with commit log damage, which patch then triggers a build failure and a crash in linux-next (like this one did), and refuse to revert and not do anything substantial about your (initially polite) review feedback for 2 weeks (like it happened here), and see how measured your response will be after the 12th mail that gets faced with such passive-aggressive inaction ;-) At that point, will your wall of patience finally start to crumble a tiny bit and will you resort to using the taboo term 'crappy patch' perhaps, like i did here? ;-) Anyway, as Steve said it's now finally water under the bridge, time to move on. Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:44 ` David Miller 2009-08-26 23:05 ` Ingo Molnar @ 2009-08-26 23:05 ` Steven Rostedt 2009-08-26 23:09 ` David Miller 2009-08-26 23:23 ` Neil Horman 2009-08-26 23:19 ` Neil Horman 2 siblings, 2 replies; 89+ messages in thread From: Steven Rostedt @ 2009-08-26 23:05 UTC (permalink / raw) To: David Miller Cc: nhorman, Ingo Molnar, Frederic Weisbecker, billfink, netdev, brice, gallatin On Wed, 26 Aug 2009, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Wed, 26 Aug 2009 18:39:22 -0400 > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. > > I've gotten this kind of urging, both in private and in public, from > both of you now. And I'm sorry, that's not how this works. > > It is not my job to somehow force you turkeys how to work effectively > together. :-) > > What we can do is ask Mr. Rostedt to asses the situation and give > his feedback. > > So if Steven could give some feedback about this specific situation > that would be great and might help us move forward. OK, here's my thought on the matter. How about Neil try out doing all he can with the existing TRACE_EVENT work. I'm sure Ingo and myself would be fine with helping him with any issues he comes up with. If there is something that he hates about it, that really makes his user space code messy, then he can put the ball back in our court, and Ingo and I will need to come up with a solution. If we truly hit a show stopper, than we can always fall back to the ftrace plugin. But until we find out for sure that TRACE_EVENT is not good enough, then we should try that out. How's that sound? -- Steve ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:05 ` Steven Rostedt @ 2009-08-26 23:09 ` David Miller 2009-08-26 23:30 ` Ingo Molnar 2009-08-26 23:23 ` Neil Horman 1 sibling, 1 reply; 89+ messages in thread From: David Miller @ 2009-08-26 23:09 UTC (permalink / raw) To: rostedt; +Cc: nhorman, mingo, fweisbec, billfink, netdev, brice, gallatin From: Steven Rostedt <rostedt@goodmis.org> Date: Wed, 26 Aug 2009 19:05:20 -0400 (EDT) > How about Neil try out doing all he can with the existing TRACE_EVENT > work. I'm sure Ingo and myself would be fine with helping him with any > issues he comes up with. If there is something that he hates about it, > that really makes his user space code messy, then he can put the ball back > in our court, and Ingo and I will need to come up with a solution. > > If we truly hit a show stopper, than we can always fall back to the ftrace > plugin. But until we find out for sure that TRACE_EVENT is not good > enough, then we should try that out. > > How's that sound? That works for me, thanks Steven! I'll revert Neil's change from net-next-2.6 and we can work on a usable solution, long-term. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:09 ` David Miller @ 2009-08-26 23:30 ` Ingo Molnar 0 siblings, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 23:30 UTC (permalink / raw) To: David Miller Cc: rostedt, nhorman, fweisbec, billfink, netdev, brice, gallatin * David Miller <davem@davemloft.net> wrote: > From: Steven Rostedt <rostedt@goodmis.org> > Date: Wed, 26 Aug 2009 19:05:20 -0400 (EDT) > > > How about Neil try out doing all he can with the existing TRACE_EVENT > > work. I'm sure Ingo and myself would be fine with helping him with any > > issues he comes up with. If there is something that he hates about it, > > that really makes his user space code messy, then he can put the ball back > > in our court, and Ingo and I will need to come up with a solution. > > > > If we truly hit a show stopper, than we can always fall back to the ftrace > > plugin. But until we find out for sure that TRACE_EVENT is not good > > enough, then we should try that out. > > > > How's that sound? > > That works for me, thanks Steven! > > I'll revert Neil's change from net-next-2.6 and we can work on a > usable solution, long-term. thanks David! Also, my prior offer to help out with the TRACE_EVENT conversion stands as well, plus more TRACE_EVENT() tracepoints would be welcome too in the networking code. It's a very useful feature and they are a lot easier (and more decentralized) to add than new tracing plugins. Thanks, Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:05 ` Steven Rostedt 2009-08-26 23:09 ` David Miller @ 2009-08-26 23:23 ` Neil Horman 2009-08-26 23:29 ` David Miller 1 sibling, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-26 23:23 UTC (permalink / raw) To: Steven Rostedt Cc: David Miller, Ingo Molnar, Frederic Weisbecker, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 07:05:20PM -0400, Steven Rostedt wrote: > > On Wed, 26 Aug 2009, David Miller wrote: > > > From: Neil Horman <nhorman@tuxdriver.com> > > Date: Wed, 26 Aug 2009 18:39:22 -0400 > > > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. > > > > I've gotten this kind of urging, both in private and in public, from > > both of you now. And I'm sorry, that's not how this works. > > > > It is not my job to somehow force you turkeys how to work effectively > > together. :-) > > > > What we can do is ask Mr. Rostedt to asses the situation and give > > his feedback. > > > > So if Steven could give some feedback about this specific situation > > that would be great and might help us move forward. > > OK, here's my thought on the matter. > > How about Neil try out doing all he can with the existing TRACE_EVENT > work. I'm sure Ingo and myself would be fine with helping him with any > issues he comes up with. If there is something that he hates about it, > that really makes his user space code messy, then he can put the ball back > in our court, and Ingo and I will need to come up with a solution. > > If we truly hit a show stopper, than we can always fall back to the ftrace > plugin. But until we find out for sure that TRACE_EVENT is not good > enough, then we should try that out. > > How's that sound? > Ok, thats fine by me. I really don't have any oposition to just using raw TRACE_EVENTS for my current purposes, but as Ingo's previous mail shows, theres _alot_ to it. Using the ftrace interface was really, in the end, just simpler for me. But if just using TRACE_EVENT is the way it needs to be, so be it. Dave, would you please revert commit 9ec04da7489d2c9ae01ea6e9b5fa313ccf3d35fb and 5a165657bef7c47e5ff4cd138f7758ef6278e87b? That should remove the ftrace code, and leave the TRACE_EVENT tracepoint for skb_copy_datagram_to_iovec in place. I'll submit a patch in the next few days to augment the TRACE_EVENT format to export all the data that I need. Thanks! Neil > -- Steve > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:23 ` Neil Horman @ 2009-08-26 23:29 ` David Miller 0 siblings, 0 replies; 89+ messages in thread From: David Miller @ 2009-08-26 23:29 UTC (permalink / raw) To: nhorman; +Cc: rostedt, mingo, fweisbec, billfink, netdev, brice, gallatin From: Neil Horman <nhorman@tuxdriver.com> Date: Wed, 26 Aug 2009 19:23:04 -0400 > Dave, would you please revert commit 9ec04da7489d2c9ae01ea6e9b5fa313ccf3d35fb > and 5a165657bef7c47e5ff4cd138f7758ef6278e87b? That should remove the ftrace > code, and leave the TRACE_EVENT tracepoint for skb_copy_datagram_to_iovec in > place. I'll submit a patch in the next few days to augment the TRACE_EVENT > format to export all the data that I need. Ok. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:44 ` David Miller 2009-08-26 23:05 ` Ingo Molnar 2009-08-26 23:05 ` Steven Rostedt @ 2009-08-26 23:19 ` Neil Horman 2 siblings, 0 replies; 89+ messages in thread From: Neil Horman @ 2009-08-26 23:19 UTC (permalink / raw) To: David Miller; +Cc: mingo, rostedt, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 03:44:10PM -0700, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Wed, 26 Aug 2009 18:39:22 -0400 > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. > > I've gotten this kind of urging, both in private and in public, from > both of you now. And I'm sorry, that's not how this works. > > It is not my job to somehow force you turkeys how to work effectively > together. :-) > > What we can do is ask Mr. Rostedt to asses the situation and give > his feedback. > > So if Steven could give some feedback about this specific situation > that would be great and might help us move forward. > Thats a fine solution by me. I'll go with Stevens decision. Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:39 ` Neil Horman 2009-08-26 22:44 ` David Miller @ 2009-08-26 23:14 ` Ingo Molnar 2009-08-26 23:33 ` Steven Rostedt 2009-08-27 0:34 ` Christoph Hellwig 3 siblings, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 23:14 UTC (permalink / raw) To: Neil Horman Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote: > > > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > > > > > * David Miller <davem@davemloft.net> wrote: > > > > > > > > > From: Ingo Molnar <mingo@elte.hu> > > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > > > terms of TRACE_EVENT()? > > > > > > > > > > Neil explained why he needed to implement it this way in his reply > > > > > to Steven Rostedt. I attach it here for your convenience. > > > > > > > > thanks. The argument is invalid: > > > > > > Just because you assert that doesn't make it so, Ingo. > > > > I stand by that statement, the argument is invalid, for the many > > reasons i outlined in my previous mails. (you'd have gotten those > > same arguments had you submitted that patch to the folks who > > maintain kernel/trace/) > > Steven specifically told me to submit the patch to the subsystem > maintainer that I'm adding tracepoints for, and the only feedback > I got on it was his one question, the answer to which I assume > satisfied him, due to that there was no subseuqent discussion. I dont speak for Steve but i cannot imagine him suggesting to you to add a new plugin to kernel/trace/. 'adding tracepoints' is a shortcut for TRACE_EVENT() these days. Those are fundamentally decentralized indeed - but that's not what you used. > I'm going to ignore your previous emails, because, despite the > various advantages of just using plain TRACE_EVENTs because you > provide the ftrace interface, and I found it useful. Your > observation is correct, I like it, and thats what I wanted to use, > so I used it. If you don't want people to use it, don't provide > it. This might be convenient to you, but that's not how kernel maintenance works. By your argument it would be fine for me to add a new networking protocol to net/ and ignore the objections from networking maintainers, with the argument that 'you provided protocol interfaces and i just made use of it and like it'? > > > > > > BTW, why not just do this as events? Or was this just a easy way > > > > > > to communicate with the user space tools? > > > > > > > > > > Thats exactly why I did it. the idea is for me to now write a > > > > > user space tool that lets me analyze the events and ajust process > > > > > scheduling to optimize the rx path. Neil > > > > > > > > All tooling (in fact _more_ tooling) can be done based on generic, > > > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > > > > available, have a generalized format with format parsers and user > > > > tooling implemented, etc. etc. > > > > > > Then why allow for ftrace modules at all? [...] > > > > We routinely reject trivial plugins like yours and ask people to > > use the proper mechanism: TRACE_EVENT(). > > Again, if you consider there to only be one proper mechanism here, > don't provide others. We dont provide them. kernel/trace/ is an internal directory to the tracing subsystem. > > We are also converting non-trivial plugins to generic tracepoints. A > > recent example are the system call tracepoints, but we also > > converted blktrace and kmemtrace to generic tracepoints. > > If you're getting rid of ftrace, then fine, just say so. If the > interface I chose is getting removed, I'll change it. But I'm not > going to change it just because you're going around saying my > previous work sucks. Theres nothing wrong with it, it works quite > well right now as it is. where did i say that we are getting rid of ftrace? We are not getting rid of it. > > But trace_skb_sources.c got committed to the networking tree, > > without review and acks from the tracing folks. Now you are > > unwilling to fix it and that's not very constructive. > > I'm not willing to fix it because its not broken. I submitted it > where steven suggested that I submitted it, and the reviews that I > got were positive. All you've told me is that you think theres a > better way. Its fine if theres a better way, but the way I have > currently is sufficient. I have acutal bugs to fix. Rewriting > this to suit your opinions after the fact really isn't productive > for me. No, you should do it differently because 1) it's in the wrong tree 2) the maintainers of this code asked you to do that. We'd never have committed your patch to the tracing tree - it was David's mistake to commit it. Btw., commit 9ec04da74 lacks Steve's ack and has an ugly diffstat mixed into the commit log: Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Makefile | 1 trace.h | 19 ++++++ trace_skb_sources.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 174 insertions(+) Signed-off-by: David S. Miller <davem@davemloft.net> > > > [...] I grant that the skb ftracer is a bit trivial at the moment > > > for an ftrace module, but I really prefer to leave it is so that I > > > can expand it with additional tracepoints. And looking at them, > > > anything you've said above applies to any of the currently > > > implemented ftrace modules. If you're so adamant that we should > > > just do everything with TRACE_EVENT log messages, then lets get > > > rid of the ftrace infrastructure all together. Until we do that, > > > however, I like my skb tracer just as it is. > > > > You dont seem to be aware of the breath of features and capabilities > > that TRACE_EVENT() based tooling allows us to do. Please see my > > previous mail about an (incomplete) list. > > Fine, I grant you that TRACE_EVENT might provide great advantages > over an ftrace module. What you seem to be missing is that an > ftrace module is sufficnet for the needs of what I was tracing. > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your > hands. The code I wrote works fairly well in my view, and I feel > like the review on it was both positive and sufficent for > inclusion. But thats not my call, its yours. I can meet my own > need with a raw TRACE_EVENT for now just as easily. IF you feel > like the skb plugin should be pulled, please do so, and let me > know. All I ask is that you keep the skb_copy_datagram_iovec > TRACE_EVENT in place. If you pull the ftrace plugin, I'll submit > a subsequent patch to agument the printing format so that I can > gather the numa allocation and consumption data directly there. Well, David is not maintaining kernel/trace/ last i checked, so i'm puzzled why you leave it 'in the hands' of him. Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:39 ` Neil Horman 2009-08-26 22:44 ` David Miller 2009-08-26 23:14 ` Ingo Molnar @ 2009-08-26 23:33 ` Steven Rostedt 2009-08-27 0:14 ` Neil Horman 2009-08-27 0:34 ` Christoph Hellwig 3 siblings, 1 reply; 89+ messages in thread From: Steven Rostedt @ 2009-08-26 23:33 UTC (permalink / raw) To: Neil Horman Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin On Wed, 26 Aug 2009, Neil Horman wrote: > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote: > > > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > > > > > * David Miller <davem@davemloft.net> wrote: > > > > > > > > > From: Ingo Molnar <mingo@elte.hu> > > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > > > terms of TRACE_EVENT()? > > > > > > > > > > Neil explained why he needed to implement it this way in his reply > > > > > to Steven Rostedt. I attach it here for your convenience. > > > > > > > > thanks. The argument is invalid: > > > > > > Just because you assert that doesn't make it so, Ingo. > > > > I stand by that statement, the argument is invalid, for the many > > reasons i outlined in my previous mails. (you'd have gotten those > > same arguments had you submitted that patch to the folks who > > maintain kernel/trace/) > > > Steven specifically told me to submit the patch to the subsystem maintainer that > I'm adding tracepoints for, and the only feedback I got on it was his one > question, the answer to which I assume satisfied him, due to that there was no > subseuqent discussion. I'm going to ignore your previous emails, because, > despite the various advantages of just using plain TRACE_EVENTs because you > provide the ftrace interface, and I found it useful. Your observation is > correct, I like it, and thats what I wanted to use, so I used it. If you don't > want people to use it, don't provide it. Actually, I suggested to submit it to the subsystem maintainer if there was no changes to the tracing infrastructure. We may have just had a misunderstanding there. No biggy. Yes, the plugins are there for the tracers that are not really events. Those are the latency tracers (they have a double trace buffer for recording maxes), the function tracers (they are a separate beast themselves). The other tracers are on their way to being obsoleted. Ideally the only plugins we should have are: function, function_graph, mmiotrace, wakeup_rt, wakeup, irqsoff, preemptoff, preemptirqsoff. The mmiotrace is a neat thing that traps calls of binary drivers to their devices, and traces what is written and read. Thus, the plugins are reserved for the off the wall type of tracing. Not something that can easily be accomplished with tracepoints. Currently sched_switch is still there, because the recording of task->comm's is associated with that tracer, and until we remove that binding, it will stay. But expect it to eventually disappear too. > > > > > > > BTW, why not just do this as events? Or was this just a easy way > > > > > > to communicate with the user space tools? > > > > > > > > > > Thats exactly why I did it. the idea is for me to now write a > > > > > user space tool that lets me analyze the events and ajust process > > > > > scheduling to optimize the rx path. Neil > > > > > > > > All tooling (in fact _more_ tooling) can be done based on generic, > > > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > > > > available, have a generalized format with format parsers and user > > > > tooling implemented, etc. etc. > > > > > > Then why allow for ftrace modules at all? [...] > > > > We routinely reject trivial plugins like yours and ask people to use > > the proper mechanism: TRACE_EVENT(). > > > Again, if you consider there to only be one proper mechanism here, don't provide > others. I guess the issue is that the plugins were there first, and that we did what trace events do today with the plugins. When TRACE_EVENT became mature, it obsoleted a lot of the plugins. Thus we are trying to get rid of them. But for those tracers that do not do events, then we still need the plugin facility. > > > We are also converting non-trivial plugins to generic tracepoints. A > > recent example are the system call tracepoints, but we also > > converted blktrace and kmemtrace to generic tracepoints. > > > If you're getting rid of ftrace, then fine, just say so. If the interface I > chose is getting removed, I'll change it. But I'm not going to change it just > because you're going around saying my previous work sucks. Theres nothing wrong > with it, it works quite well right now as it is. It does not suck, but it's "old school" ;-) > > > But trace_skb_sources.c got committed to the networking tree, > > without review and acks from the tracing folks. Now you are > > unwilling to fix it and that's not very constructive. > > > I'm not willing to fix it because its not broken. I submitted it where steven > suggested that I submitted it, and the reviews that I got were positive. All > you've told me is that you think theres a better way. Its fine if theres a > better way, but the way I have currently is sufficient. I have acutal bugs to > fix. Rewriting this to suit your opinions after the fact really isn't > productive for me. I feel guilty here. I misunderstood the scope of your changes, and did not realize you were adding a plugin. > > > > [...] I grant that the skb ftracer is a bit trivial at the moment > > > for an ftrace module, but I really prefer to leave it is so that I > > > can expand it with additional tracepoints. And looking at them, > > > anything you've said above applies to any of the currently > > > implemented ftrace modules. If you're so adamant that we should > > > just do everything with TRACE_EVENT log messages, then lets get > > > rid of the ftrace infrastructure all together. Until we do that, > > > however, I like my skb tracer just as it is. > > > > You dont seem to be aware of the breath of features and capabilities > > that TRACE_EVENT() based tooling allows us to do. Please see my > > previous mail about an (incomplete) list. > > > Fine, I grant you that TRACE_EVENT might provide great advantages over an ftrace > module. What you seem to be missing is that an ftrace module is sufficnet for > the needs of what I was tracing. > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. The code > I wrote works fairly well in my view, and I feel like the review on it was both > positive and sufficent for inclusion. But thats not my call, its yours. I can > meet my own need with a raw TRACE_EVENT for now just as easily. IF you feel > like the skb plugin should be pulled, please do so, and let me know. All I ask > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place. If you pull > the ftrace plugin, I'll submit a subsequent patch to agument the printing format > so that I can gather the numa allocation and consumption data directly there. Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-). You probably already read my previous email on the matter. Don't delete your plugin patch until we get everything you need with TRACE_EVENT alone. Thanks, -- Steve ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 23:33 ` Steven Rostedt @ 2009-08-27 0:14 ` Neil Horman 2009-08-27 0:29 ` Steven Rostedt 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-27 0:14 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 07:33:55PM -0400, Steven Rostedt wrote: > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote: > > > > > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > > > > > > > * David Miller <davem@davemloft.net> wrote: > > > > > > > > > > > From: Ingo Molnar <mingo@elte.hu> > > > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > > > > terms of TRACE_EVENT()? > > > > > > > > > > > > Neil explained why he needed to implement it this way in his reply > > > > > > to Steven Rostedt. I attach it here for your convenience. > > > > > > > > > > thanks. The argument is invalid: > > > > > > > > Just because you assert that doesn't make it so, Ingo. > > > > > > I stand by that statement, the argument is invalid, for the many > > > reasons i outlined in my previous mails. (you'd have gotten those > > > same arguments had you submitted that patch to the folks who > > > maintain kernel/trace/) > > > > > > Steven specifically told me to submit the patch to the subsystem maintainer that > > I'm adding tracepoints for, and the only feedback I got on it was his one > > question, the answer to which I assume satisfied him, due to that there was no > > subseuqent discussion. I'm going to ignore your previous emails, because, > > despite the various advantages of just using plain TRACE_EVENTs because you > > provide the ftrace interface, and I found it useful. Your observation is > > correct, I like it, and thats what I wanted to use, so I used it. If you don't > > want people to use it, don't provide it. > > Actually, I suggested to submit it to the subsystem maintainer if there > was no changes to the tracing infrastructure. We may have just had a > misunderstanding there. No biggy. > I'm not sure how the addition of an ftrace module constitutes a change to the tracing infrastructure, but whatever, yes, no biggy. I've bugun modifying the TRACE_EVENT that I added to export the data I need directly. Should be pretty straightforward. Dave I'll have a patch up on netdev in a day or two after I test it. Steven, should this still just go to netdev with a cc to you? I'd like to avoid repeating the same confusion here a second time around if I can > > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. The code > > I wrote works fairly well in my view, and I feel like the review on it was both > > positive and sufficent for inclusion. But thats not my call, its yours. I can > > meet my own need with a raw TRACE_EVENT for now just as easily. IF you feel > > like the skb plugin should be pulled, please do so, and let me know. All I ask > > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place. If you pull > > the ftrace plugin, I'll submit a subsequent patch to agument the printing format > > so that I can gather the numa allocation and consumption data directly there. > > Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-). > Yes, that is rather centeral to what I'm monitoring :) > You probably already read my previous email on the matter. Don't delete > your plugin patch until we get everything you need with TRACE_EVENT alone. > Its ok, I should have the TRACE_EVENT modified to export this stuff directly by tomorrow or friday anyway. I really honestly just liked the ftrace interface better, I found it a bit less confusing :) Best Neil > Thanks, > > -- Steve > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 0:14 ` Neil Horman @ 2009-08-27 0:29 ` Steven Rostedt 2009-08-27 1:17 ` Neil Horman 2009-08-27 9:34 ` Ingo Molnar 0 siblings, 2 replies; 89+ messages in thread From: Steven Rostedt @ 2009-08-27 0:29 UTC (permalink / raw) To: Neil Horman Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin On Wed, 26 Aug 2009, Neil Horman wrote: > > > I'm not sure how the addition of an ftrace module constitutes a change to the > tracing infrastructure, but whatever, yes, no biggy. I've bugun modifying the > TRACE_EVENT that I added to export the data I need directly. Should be pretty > straightforward. Dave I'll have a patch up on netdev in a day or two after I > test it. Steven, should this still just go to netdev with a cc to you? I'd > like to avoid repeating the same confusion here a second time around if I can Yes, please Cc myself, and Ingo on those changes. I see where the confusion came. It is where the code changes. The code in kernel/trace is considered ftrace internals (there's internal tracing upkeep that is needed for all plugins). But with TRACE_EVENT, those can happen totally inside a subsystem without touching any tracing directory. Those are yours, and the TRACE_EVENT is just an API to the rest of the kernel. We don't even care if you add a header to include/trace/events/ (if it follows the standard format). But by adding a plugin, it causes more work for us. The plugin types do not get automated like TRACE_EVENTs and for binary readers like perf and trace-cmd, we need to hand export the binary format for them. > > > > > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. The code > > > I wrote works fairly well in my view, and I feel like the review on it was both > > > positive and sufficent for inclusion. But thats not my call, its yours. I can > > > meet my own need with a raw TRACE_EVENT for now just as easily. IF you feel > > > like the skb plugin should be pulled, please do so, and let me know. All I ask > > > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place. If you pull > > > the ftrace plugin, I'll submit a subsequent patch to agument the printing format > > > so that I can gather the numa allocation and consumption data directly there. > > > > Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-). > > > Yes, that is rather centeral to what I'm monitoring :) > > > You probably already read my previous email on the matter. Don't delete > > your plugin patch until we get everything you need with TRACE_EVENT alone. > > > Its ok, I should have the TRACE_EVENT modified to export this stuff directly by > tomorrow or friday anyway. I really honestly just liked the ftrace interface > better, I found it a bit less confusing :) Heh, because it was just a bit of cut and paste. But as Frederic said, very much prone to errors. And it breaks the binary userspace readers. Your new type did not get exported via trace_export.c. TRACE_EVENT can be a little harder to learn, because it is all MACRO magic, but once you understand them, you'll find that they are very easy. Thanks! -- Steve ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 0:29 ` Steven Rostedt @ 2009-08-27 1:17 ` Neil Horman 2009-08-27 9:06 ` Ingo Molnar 2009-08-27 9:34 ` Ingo Molnar 1 sibling, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-27 1:17 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 08:29:59PM -0400, Steven Rostedt wrote: > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > > > I'm not sure how the addition of an ftrace module constitutes a change to the > > tracing infrastructure, but whatever, yes, no biggy. I've bugun modifying the > > TRACE_EVENT that I added to export the data I need directly. Should be pretty > > straightforward. Dave I'll have a patch up on netdev in a day or two after I > > test it. Steven, should this still just go to netdev with a cc to you? I'd > > like to avoid repeating the same confusion here a second time around if I can > > Yes, please Cc myself, and Ingo on those changes. I see where the > confusion came. It is where the code changes. The code in kernel/trace is > considered ftrace internals (there's internal tracing upkeep that is > needed for all plugins). But with TRACE_EVENT, those can happen totally > inside a subsystem without touching any tracing directory. Those are > yours, and the TRACE_EVENT is just an API to the rest of the kernel. We > don't even care if you add a header to include/trace/events/ (if it > follows the standard format). > > But by adding a plugin, it causes more work for us. The plugin types do > not get automated like TRACE_EVENTs and for binary readers like perf and > trace-cmd, we need to hand export the binary format for them. > Understood, I'll keep that in mind in the future. > > > > > > > > > > Ok, I'm rather tired of arguing. Dave, I'll leave this in your hands. The code > > > > I wrote works fairly well in my view, and I feel like the review on it was both > > > > positive and sufficent for inclusion. But thats not my call, its yours. I can > > > > meet my own need with a raw TRACE_EVENT for now just as easily. IF you feel > > > > like the skb plugin should be pulled, please do so, and let me know. All I ask > > > > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place. If you pull > > > > the ftrace plugin, I'll submit a subsequent patch to agument the printing format > > > > so that I can gather the numa allocation and consumption data directly there. > > > > > > Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-). > > > > > Yes, that is rather centeral to what I'm monitoring :) > > > > > You probably already read my previous email on the matter. Don't delete > > > your plugin patch until we get everything you need with TRACE_EVENT alone. > > > > > Its ok, I should have the TRACE_EVENT modified to export this stuff directly by > > tomorrow or friday anyway. I really honestly just liked the ftrace interface > > better, I found it a bit less confusing :) > > Heh, because it was just a bit of cut and paste. But as Frederic said, > very much prone to errors. And it breaks the binary userspace readers. > Your new type did not get exported via trace_export.c. > > TRACE_EVENT can be a little harder to learn, because it is all MACRO > magic, but once you understand them, you'll find that they are very easy. > Yeah, macro magic is an understatement. But I'll have the conversions done in the next few days, no worries. Thanks! Neil > Thanks! > > -- Steve > > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 1:17 ` Neil Horman @ 2009-08-27 9:06 ` Ingo Molnar 0 siblings, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-27 9:06 UTC (permalink / raw) To: Neil Horman Cc: Steven Rostedt, David Miller, fweisbec, billfink, netdev, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > > TRACE_EVENT can be a little harder to learn, because it is all > > MACRO magic, but once you understand them, you'll find that they > > are very easy. > > Yeah, macro magic is an understatement. But I'll have the > conversions done in the next few days, no worries. Cool, thanks Neil! [ And we tracing folks are rather fond of that macro abuse, so if you can think of ways it could be made even more abusively C-alike, we are all ears ;-) ] Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 0:29 ` Steven Rostedt 2009-08-27 1:17 ` Neil Horman @ 2009-08-27 9:34 ` Ingo Molnar 1 sibling, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-27 9:34 UTC (permalink / raw) To: Steven Rostedt Cc: Neil Horman, David Miller, fweisbec, billfink, netdev, brice, gallatin * Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > > > I'm not sure how the addition of an ftrace module constitutes a change to the > > tracing infrastructure, but whatever, yes, no biggy. I've bugun modifying the > > TRACE_EVENT that I added to export the data I need directly. Should be pretty > > straightforward. Dave I'll have a patch up on netdev in a day or two after I > > test it. Steven, should this still just go to netdev with a cc to you? I'd > > like to avoid repeating the same confusion here a second time around if I can > > Yes, please Cc myself, and Ingo on those changes. I see where the > confusion came. It is where the code changes. The code in > kernel/trace is considered ftrace internals (there's internal > tracing upkeep that is needed for all plugins). [...] yeah - i pointed that out in the very first mail to David 9 days ago when this patch broke the build in linux-next: kernel/trace/ is like net/core/. It would be nice and important if the networking tree treated it as such in the future. See the: [PATCH -next] trace_skb: fix build when CONFIG_NET is not enabled discussion on lkml: http://lkml.org/lkml/2009/8/17/378 Thanks, Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 22:39 ` Neil Horman ` (2 preceding siblings ...) 2009-08-26 23:33 ` Steven Rostedt @ 2009-08-27 0:34 ` Christoph Hellwig 3 siblings, 0 replies; 89+ messages in thread From: Christoph Hellwig @ 2009-08-27 0:34 UTC (permalink / raw) To: Neil Horman Cc: Ingo Molnar, David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 06:39:22PM -0400, Neil Horman wrote: > Steven specifically told me to submit the patch to the subsystem maintainer that > I'm adding tracepoints for, and the only feedback I got on it was his one > question, the answer to which I assume satisfied him, due to that there was no > subseuqent discussion. I'm going to ignore your previous emails, because, > despite the various advantages of just using plain TRACE_EVENTs because you > provide the ftrace interface, and I found it useful. Your observation is > correct, I like it, and thats what I wanted to use, so I used it. If you don't > want people to use it, don't provide it. Neil, this attitude is a perfect way to end up on a shitlist. I think there is a fair case to make you didn't know that the ftrace plugin was wrong when you did, but now you do. And btw, I completely agree with Ingo here - the TRACE_EVENT stuff is extremly userful to get borader pictures of what's going on. E.g. the combination of my unfortunately not yet included xfs tracer and blktrace allowed debugging quite a lot of interesting issues. So instead of playing jackass here listen to what is the right approach for it and fix it up. It'll help us all in the end. Looking forward to the day when plain DECLARE_TRACE goes away so people can't "accidentally" use it. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 20:23 ` Neil Horman 2009-08-26 20:40 ` Ingo Molnar @ 2009-08-26 23:46 ` Frederic Weisbecker 1 sibling, 0 replies; 89+ messages in thread From: Frederic Weisbecker @ 2009-08-26 23:46 UTC (permalink / raw) To: Neil Horman Cc: Ingo Molnar, David Miller, rostedt, billfink, netdev, brice, gallatin On Wed, Aug 26, 2009 at 04:23:44PM -0400, Neil Horman wrote: > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote: > > > > * David Miller <davem@davemloft.net> wrote: > > > > > From: Ingo Molnar <mingo@elte.hu> > > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > > > Sigh, no. Please re-read the past discussions about this. > > > > trace_skb_sources.c is a hack and should be converted to generic > > > > tracepoints. Is there anything in it that cannot be expressed in > > > > terms of TRACE_EVENT()? > > > > > > Neil explained why he needed to implement it this way in his reply > > > to Steven Rostedt. I attach it here for your convenience. > > > > thanks. The argument is invalid: > > > Just because you assert that doesn't make it so, Ingo. > > > > > BTW, why not just do this as events? Or was this just a easy way > > > > to communicate with the user space tools? > > > > > > Thats exactly why I did it. the idea is for me to now write a > > > user space tool that lets me analyze the events and ajust process > > > scheduling to optimize the rx path. Neil > > > > All tooling (in fact _more_ tooling) can be done based on generic, > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > > available, have a generalized format with format parsers and user > > tooling implemented, etc. etc. > > > Then why allow for ftrace modules at all? Well, the old way to implement a tracer was done as you did: create a whole ftrace plugin (ie: a tracer). But it's a bit of a burden to implement a tracer: you have to deal with ring buffer directly using code that is pretty the same from a trivial tracer to another, you have to deal with output formatting, define explicitely your fields, their types, their format separately if you want the filters to be supported. Oh and you also need to handle your tracepoints by hand, check their registration results. You also need to implement by your stop and start callbacks that deactivate your tracepoints. So that's a lot of repetitive and error-prone work. Also kernel/trace hosts a lot of such error-prone code and it doesn't only become a due diligence of maintainance from you but also for us. The goal of the TRACE_EVENTs is to reduce the impact of everything I explained above. You only need to care with the strict necessary things for your traces: - field name - field type - field formats And that's pretty all. All the burden of copying in the ring buffer, filtering, tracepoints, formats, output is done in background. Also your tracer becomes non-ABI dependant because the formats of your fields are dynamically described in dedicated debugfs files. Tracer fields, even though we have workarounds to describe their format, have much more contraints. Their format have a bit more constraints to be fixed. Also a lot of things are developed in userspace that can profit to every TRACE_EVENTs as Ingo has shown with perf. Steve's trace-cmd tool also handles them. The ftrace tracers plugin are still used for non trivial cases where tracing based on tracepoints are not sufficient. For example the function/function graph tracers that require hot patching and a gcc feature plus a lot of background subtle things, or the preemptoff/irqsoff/preemptirqsoff tracers that require a snapshot of a maximum latency trace, etc... That's why the ftrace tracers plugins still exist: to cover the non-trivial cases. But using them for tracing based on simple static tracepoints like yours is a pure legacy. Frederic. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:48 ` Ingo Molnar 2009-08-26 20:23 ` Neil Horman @ 2009-08-26 20:28 ` Ingo Molnar 1 sibling, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 20:28 UTC (permalink / raw) To: David Miller Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin * Ingo Molnar <mingo@elte.hu> wrote: > * David Miller <davem@davemloft.net> wrote: > > > From: Ingo Molnar <mingo@elte.hu> > > Date: Wed, 26 Aug 2009 21:08:30 +0200 > > > > > Sigh, no. Please re-read the past discussions about this. > > > trace_skb_sources.c is a hack and should be converted to generic > > > tracepoints. Is there anything in it that cannot be expressed in > > > terms of TRACE_EVENT()? > > > > Neil explained why he needed to implement it this way in his reply > > to Steven Rostedt. I attach it here for your convenience. > > thanks. The argument is invalid: > > > > BTW, why not just do this as events? Or was this just a easy way > > > to communicate with the user space tools? > > > > Thats exactly why I did it. the idea is for me to now write a > > user space tool that lets me analyze the events and ajust process > > scheduling to optimize the rx path. Neil > > All tooling (in fact _more_ tooling) can be done based on generic, > TRACE_EVENT() based tracepoints. Generic tracepoints are far more > available, have a generalized format with format parsers and user > tooling implemented, etc. etc. To expand on the 'etc. etc.'. Right now we already have once TRACE_EVENT() based generic tracepoint for skbs - the skb_free one in include/trace/events/skb.h. Here's a list of examples of what that single generic tracepoint allows us to do, which Neil's kernel/trace/trace_skb_sources.c code cannot do: - structured format/field description: aldebaran:~> cat /debug/tracing/events/skb/kfree_skb/format name: kfree_skb ID: 603 format: field:unsigned short common_type; offset:0; size:2; field:unsigned char common_flags; offset:2; size:1; field:unsigned char common_preempt_count; offset:3; size:1; field:int common_pid; offset:4; size:4; field:int common_tgid; offset:8; size:4; field:void * skbaddr; offset:16; size:8; field:unsigned short protocol; offset:24; size:2; field:void * location; offset:32; size:8; print fmt: "skbaddr=%p protocol=%u location=%p", REC->skbaddr, REC->protocol, REC->location The advantages of that are numerous: we have a user-space parser for that, so new tracepoints or changes to tracepoints can be propagated across the tooling automatically. (see below examples about how this works in practice) - perfcounters integration: - it's enumerated and visible in the list of tracepoints: aldebaran:~> perf list 2>&1 | grep skb skb:kfree_skb [Tracepoint event] - the tracepoint can be used for statistics (perf stat): aldebaran:~> perf stat -e skb:kfree_skb -a sleep 1 Performance counter stats for 'sleep 1': - noise analysis: aldebaran:~> perf stat --repeat 10 -e skb:kfree_skb -a sleep 1 Performance counter stats for 'sleep 1' (10 runs): 25 skb:kfree_skb ( +- 7.692% ) - the tracepoint can be used for profiling: aldebaran:~> perf top -e skb:kfree_skb -c 1 ------------------------------------------------------------------------------ PerfTop: 334 irqs/sec kernel: 0.3% [1 skb:kfree_skb], (all, 16 CPUs) ------------------------------------------------------------------------------ samples pcnt RIP kernel function ______ _______ _____ ________________ _______________ 23.00 - 100.0% - ffffffff81266828 : store_bind - can be used to do call-graph profiling that captures kernel and user-space call-graphs as well: aldebaran:~> perf record --call-graph -e skb:kfree_skb -c 1 -f -a sleep 1 [ perf record: Captured and wrote 0.035 MB perf.data (~1547 samples) ] aldebaran:~> perf report ... # Samples: 4102 # # Overhead Command Shared Object Symbol # ........ ............... ........................................................................................................ ...... # 88.44% distccd 3641efb1d0 [.] 0x00003641efb1d0 3.07% Xorg 3641ed6590 [.] 0x00003641ed6590 2.51% at-spi-registry 3642a0db50 [.] 0x00003642a0db50 2.24% sshd /lib64/libc-2.8.so [.] __libc_read 0.73% sshd 7f71d4e69590 [.] 0x007f71d4e69590 0.63% init [kernel] [k] store_bind 0.56% sshd /lib64/libc-2.8.so [.] __recvmsg 0.49% gnome-settings- 3642a0db8b [.] 0x00003642a0db8b 0.39% sshd /lib64/libc-2.8.so [.] __GI___libc_connect 0.39% sshd /lib64/libc-2.8.so [.] __sendto_nocancel 0.15% id /lib64/libc-2.8.so [.] __GI___libc_connect | |--50.00%-- get_mapping | __nscd_get_map_ref | --50.00%-- __nscd_open_socket 0.10% metacity 3641ed6590 [.] 0x00003641ed6590 0.07% gdm-simple-gree 3642a0db8b [.] 0x00003642a0db8b | |--66.67%-- 0x3641ed65cb | --33.33%-- 0x3642a0db8b 0.05% bash /lib64/libc-2.8.so [.] __GI___libc_connect | |--50.00%-- get_mapping | __nscd_get_map_ref | --50.00%-- __nscd_open_socket 0.05% :3129 /lib64/libc-2.8.so [.] __GI___libc_connect | |--50.00%-- get_mapping | __nscd_get_map_ref | --50.00%-- __nscd_open_socket 0.05% :3098 /lib64/libc-2.8.so [.] __GI___libc_connect | |--50.00%-- get_mapping | __nscd_get_map_ref | --50.00%-- __nscd_open_socket 0.02% init [kernel] [k] bind_con_driver 0.02% gnome-power-man 3642a0db50 [.] 0x00003642a0db50 0.02% cc1 /opt/crosstool/gcc-4.2.2-glibc-2.3.6/i686-unknown-linux-gnu/libexec/gcc/i686-unknown-linux-gnu/4.2.2/cc1 [.] num_positive - can be used to capture traces to user-space and analyze them there: aldebaran:/home/mingo> perf record -e skb:kfree_skb:r -c 1 -R -f -a sleep 10 [ perf record: Captured and wrote 4.426 MB perf.data (~193365 samples) ] aldebaran:/home/mingo> perf trace version = 0.5B6 init-0 [000] 0.000000: kfree_skb: skbaddr=0xffff8801bcc15300 protocol=2048 location=0xffffffff81461c94 Xorg-4411 [000] 0.000000: kfree_skb: skbaddr=0xffff8801bb955a00 protocol=0 location=0xffffffff814e8aff at-spi-registry-4948 [000] 0.000000: kfree_skb: skbaddr=0xffff8801bb955a00 protocol=0 location=0xffffffff814e8aff ... - generic tracepoints can be available with lots of other tracepoints at once - while the skb_sources plugin is exclusive. (no other plugin can be active at the same time) Generic tracepoints have separate toggles - any sub-set of tracepoints can be active at any time. - per tracepoint filter expressions support, such as: aldebaran:/debug/tracing/events/skb/kfree_skb> echo 'protocol == 0 && common_pid == 123' > filter aldebaran:/debug/tracing/events/skb/kfree_skb> cat filter protocol == 0 && common_pid == 123 protocol == 0 && common_pid == 123 When this filter is modified, the kernel creates a (safe) list of (atomically evaluatable) predicaments from the expression and the data is filtered before it's traced. The filter engine works in process, softirq, IRQ, NMI and any other context and is very fast as well. (no parsing overhead in the fastpath - we pre-parse the expression and break it down.) In other words, generic tracepoints are _vastly_ superior to the skb_sources plugin, and this fact is obvious to all tracing developers, that's why every tracing developer who commented on this thread asked (in a rather befuddled way) "why not TRACE_EVENT()?". And note that the above examples were based on a _single_ existing generic tracepoint of very limited utility - and still it already allowed a lot of interesting data to be captured. If we had a more comprehensive set of skb tracepoints, a whole lot of interesting possibilities would open up ... All in one, we dont do new ftrace plugins that can be done via generic tracepoints - we only limit ftrace plugins to vastly different things like the function tracer or the latency tracer. That's why we have things like a tracing tree and a review process, to address such issues before patches get committed. David, please sort this out before sending any bits in this area to Linus, Neil's response is basically "i want it this way" which is not really acceptable - the maintainers of kernel/trace/* dont want it this way, for very good technical reasons. The skb_sources hack should be converted to a proper TRACE_EVENT(skb_dequeue) tracepoint. Also, as we offered it on the onset, we'd be glad to help out with the conversion. I can do a patch if nobody volunteers. Plus we'd like to encourage more TRACE_EVENT() networking tracepoints like the existing skb_free. They are a great tool. Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 19:08 ` Ingo Molnar 2009-08-26 19:36 ` David Miller @ 2009-08-26 20:01 ` Neil Horman 2009-08-26 22:57 ` Ingo Molnar 1 sibling, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-26 20:01 UTC (permalink / raw) To: Ingo Molnar Cc: David S. Miller, Steven Rostedt, Frédéric Weisbecker, Bill Fink, Linux Network Developers, brice, gallatin On Wed, Aug 26, 2009 at 09:08:30PM +0200, Ingo Molnar wrote: > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote: > > > > > > * Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > > > > > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > > > > > and never produced a crashdump. I tried this several times and > > > > > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > > > > > worked just fine. > > > > > > > > > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > > > > > record it. BTW did anyone ever think to print the backtrace in > > > > > > > > reverse (first to some reserved memory and then output to the display) > > > > > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > > > > > the screen? > > > > > > > > > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > > > > > screen. Normally people use a serial console they can log, or a RAC card that > > > > > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > > > > > vt iirc to allow for scrolling. > > > > > > > > > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > > > > > any RAC cards for them either, nor are there spare PCI slots available > > > > > > in many cases. I wouldn't think the Shift-PageUp trick would work > > > > > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > > > > > out netconsole yet either, but I'm not sure it would help either in a > > > > > > case like this that was a network related kernel crash. > > > > > > > > > > > Any USB ports that you can attach a serial dongle to? That would work as well, > > > > > or, as previously mentioned, netconsole also does the trick. > > > > > > > > > > > In any case, a simple kernel command line that would provide a reversed > > > > > > backtrace would be a simple thing to facilitate Linux users providing > > > > > > useful info to Linux kernel developers in helping to debug kernel > > > > > > problems. The most useful info would still be on the screen, so it > > > > > > could be transcribed or a photo image of the screen could be taken. > > > > > > > > > > > I understand what your saying, I'm just saying there are currently several > > > > > options for you that have already solved this problem in differnt ways. > > > > > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > > > does have a serial console, and after a fair amount of effort I was > > > > > > able to get it to work as desired, and was able to finally capture > > > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > > > kexec/kdump didn't work was probably because it couldn't find > > > > > > a /proc/vmcore file, although I don't know why that would be, > > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > > > zero size). > > > > > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > > > happy to look into it further. > > > > > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > > > tracing feature: > > > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > > > # tracer: skb_sources > > > > > > # > > > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > > > # | | | | | | | > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > > > > > All is as was expected. > > > > > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > > > > > Thanks > > > > > Neil > > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > > > Oops: 0000 [#1] SMP > > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > > > CPU 4 > > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > > > Stack: > > > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > > > Call Trace: > > > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > > RSP <ffff8801a5811a88> > > > > > > CR2: 0000000000000038 > > > > > > > > > > > > -Thanks > > > > > > > > > > > > -Bill > > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > > > we can pass it to sock_net > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > trace_skb_sources.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > ok if this is just a temporary fix until TRACE_EVENT() is done, but > > > we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's > > > pushed to .32, right? > > > > Not sure that the two are related. I think you meant to send this > > to the other thread, didnt you? > > Sigh, no. Please re-read the past discussions about this. > trace_skb_sources.c is a hack and should be converted to generic > tracepoints. Is there anything in it that cannot be expressed in > terms of TRACE_EVENT()? > As David noted in my previous posting, no, I don't intend to change this. It would certainly be possible to express this in terms of just a TRACE_EVENT, but it would much more complex and messy for any user space tool to do so, IMHO. SO I'd like to leave it as it is. To say its a hack as it is would really be to say any of the current ftrace modules are a hack, as all of them could just as easily be expressed as a series of trace events which were later parsed by a user space tool. I thought you're comments were related to the conversion of the napi_poll tracepoint to a TRACE_EVENT structure, Which is current in progress. Best Neil > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 20:01 ` Neil Horman @ 2009-08-26 22:57 ` Ingo Molnar 0 siblings, 0 replies; 89+ messages in thread From: Ingo Molnar @ 2009-08-26 22:57 UTC (permalink / raw) To: Neil Horman Cc: David S. Miller, Steven Rostedt, Fr?d?ric Weisbecker, Bill Fink, Linux Network Developers, brice, gallatin * Neil Horman <nhorman@tuxdriver.com> wrote: > > Is there anything in it that cannot be expressed in terms of > > TRACE_EVENT()? > > As David noted in my previous posting, no, I don't intend to > change this. [...] Well, this change lacks the ack of the maintainers of kernel/trace/* for the technical reasons outlined in the (many...) mails sent on this topic, so for the .32 networking tree to be properly pushable to Linus you'll have to come up with a better answer than "I don't intend to change this". David, i tried to help but i really dont have time to deal with an inefficient workflow like this. Two weeks ago you committed a clearly broken patch to the tracing code, it had bugs, it was objected to because it does the wrong thing altogether and Neil refuses to fix it and i'm supposed to convince Neil what the right solution is? That's not how maintenance is supposed to work, it's utterly not scalable. Please deal with it one way or another. Thanks, Ingo ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 18:08 ` Neil Horman 2009-08-26 18:15 ` Ingo Molnar @ 2009-08-27 17:32 ` Bill Fink 2009-09-02 5:28 ` Bill Fink 2009-08-27 17:44 ` Bill Fink 2 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-27 17:32 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Wed, 26 Aug 2009, Neil Horman wrote: > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > On Fri, 21 Aug 2009, Neil Horman wrote: > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote: > > > > On Thu, 20 Aug 2009, Neil Horman wrote: > > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote: > > > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting > > > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just > > > > > > generating a crashdump, fully booted the new kernel, which was > > > > > > extremely sluggish until I rebooted it through a BIOS re-init, > > > > > > and never produced a crashdump. I tried this several times and > > > > > > an immediate kernel oops was always the result (with either a TCP > > > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand > > > > > > worked just fine. > > > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such > > > > > limited memory. don't know why you booted to a full system rather than did a > > > > > crash recovery. Don't suppose you got a backtrace did you? > > > > > > > > There was a backtrace on the screen but I didn't have a chance to > > > > record it. BTW did anyone ever think to print the backtrace in > > > > reverse (first to some reserved memory and then output to the display) > > > > so the more interesting parts wouldn't have scrolled off the top of > > > > the screen? > > > > > > > The real solution is to use a console to which the output doesn't scroll off the > > > screen. Normally people use a serial console they can log, or a RAC card that > > > they can record. Even on a regular vga monitor in text mode, you can set up the > > > vt iirc to allow for scrolling. > > > > None of our Asus P6T6 systems have serial consoles. I don't know of > > any RAC cards for them either, nor are there spare PCI slots available > > in many cases. I wouldn't think the Shift-PageUp trick would work > > with a crashed kernel, but I admit I didn't try it. I haven't checked > > out netconsole yet either, but I'm not sure it would help either in a > > case like this that was a network related kernel crash. > > > Any USB ports that you can attach a serial dongle to? That would work as well, > or, as previously mentioned, netconsole also does the trick. I didn't know you could use a USB serial port as a serial console. And after wasting several hours yesterday trying to get a USB serial console to work without any success, I'm giving up on that idea. Also since it requires building the required usb modules into the kernel, it wouldn't be practical, since I'd have to rebuild the kernel quite frequently given the frequency of Fedora kernel updates. I still need to check into netconsole. > > In any case, a simple kernel command line that would provide a reversed > > backtrace would be a simple thing to facilitate Linux users providing > > useful info to Linux kernel developers in helping to debug kernel > > problems. The most useful info would still be on the screen, so it > > could be transcribed or a photo image of the screen could be taken. > > > I understand what your saying, I'm just saying there are currently several > options for you that have already solved this problem in differnt ways. I would have been with you if the USB serial console idea had panned out. But I've just about eliminated all the proposed alternatives as viable, except for netconsole which I haven't investigated yet. Sometimes the additional low tech option of a reversed traceroute would be quite convenient and not require lots of extra effort from the user. BTW ISTR that someone else suggested the same idea a while back, but it didn't get any traction then either (can't find it in the archives though from a quick search). > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > does have a serial console, and after a fair amount of effort I was > > able to get it to work as desired, and was able to finally capture > > a backtrace of the kernel oops. BTW I believe the reason the > > kexec/kdump didn't work was probably because it couldn't find > > a /proc/vmcore file, although I don't know why that would be, > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > up normally if it fails to find the /proc/vmcore file (or it's > > zero size). > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > happy to look into it further. It's odd. kexec/kdump works fine with the 2.6.29.6-217.2.3.fc11.x86_64 kernel from Fedora 11 (running on the Fedora 10 system). I will try again with the kernel-2.6.31-0.174.rc7.git2.fc12.src.rpm from Fedora 12, in case it has some secret sauce in one of the Fedora patches to make the Fedora /etc/init.d/kdump script happy. kexec/kdump is my preferred method of dealing with kernel oopses if I can get it to work. Also, to get the /sbin/mkdumprd to work right, I had to make the following change to it: --- .orig/mkdumprd 2009-04-07 10:03:58.000000000 -0400 +++ .mod/mkdumprd 2009-08-19 19:04:38.000000000 -0400 @@ -384,7 +384,7 @@ vg_list="$vg_list $vg" for device in `vgdisplay -v $vg 2>/dev/null | sed -n 's/PV Name//p'`; do IS_UUID=`echo $device | grep UUID` - IS_LABEL=`echo $device | grep UUID` + IS_LABEL=`echo $device | grep LABEL` if [ -n "$IS_UUID" -o -n "$IS_LABEL" ] then devname=`findfs $device` @@ -398,7 +398,7 @@ esac else IS_UUID=`echo $1 | grep UUID` - IS_LABEL=`echo $1 | grep UUID` + IS_LABEL=`echo $1 | grep LABEL` if [ -n "$IS_UUID" -o -n "$IS_LABEL" ] then devname=`findfs $1` Without the patch to the /sbin/mkdumprd script, it couldn't find my root filesystem on LABEL=root. > > The following shows a simple ping test usage of the skb_sources > > tracing feature: > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > --- 192.168.1.10 ping statistics --- > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > [root@xeontest1 tracing]# cat trace > > # tracer: skb_sources > > # > > # PID ANID CNID IFC RXQ CCPU LEN > > # | | | | | | | > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > 4217 1 1 eth2 0 4 1500 > > > > All is as was expected. > > > > But if I try an actual nuttcp performance test (even rate limited > > to 1 Mbps), I get the following kernel oops: > > > thank you, I think I see the problem, I'll have a patch for you in just a bit Thanks for the patch. I'll address the results of using the patch in a separate e-mail. -Bill > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > PGD 337d12067 PUD 337d11067 PMD 0 > > Oops: 0000 [#1] SMP > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > CPU 4 > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > Stack: > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > Call Trace: > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > RSP <ffff8801a5811a88> > > CR2: 0000000000000038 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 17:32 ` Bill Fink @ 2009-09-02 5:28 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-09-02 5:28 UTC (permalink / raw) To: Bill Fink; +Cc: Neil Horman, Linux Network Developers, brice, gallatin On Thu, 27 Aug 2009, Bill Fink wrote: > On Wed, 26 Aug 2009, Neil Horman wrote: > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > does have a serial console, and after a fair amount of effort I was > > > able to get it to work as desired, and was able to finally capture > > > a backtrace of the kernel oops. BTW I believe the reason the > > > kexec/kdump didn't work was probably because it couldn't find > > > a /proc/vmcore file, although I don't know why that would be, > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > up normally if it fails to find the /proc/vmcore file (or it's > > > zero size). > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > happy to look into it further. > > It's odd. kexec/kdump works fine with the 2.6.29.6-217.2.3.fc11.x86_64 > kernel from Fedora 11 (running on the Fedora 10 system). I will try > again with the kernel-2.6.31-0.174.rc7.git2.fc12.src.rpm from Fedora 12, > in case it has some secret sauce in one of the Fedora patches to make > the Fedora /etc/init.d/kdump script happy. kexec/kdump is my preferred > method of dealing with kernel oopses if I can get it to work. The Fedora 12 kernel-2.6.31-0.174.rc7.git2 kernel didn't help with the kexec/kdump issue, so I may file a bug if I can't figure anything out. Also that kernel had a huge performance hit on my tests. Where I usually get ~100 Gbps of aggregate transmit performance, I was instead getting a mere 3 Gbps, with individual streams only getting about 200 to 400 Mbps. If I get a chance, I'll have to try the vanilla version to see if it has the same issue (a vanilla 2.6.31-rc6 is fine). -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-26 18:08 ` Neil Horman 2009-08-26 18:15 ` Ingo Molnar 2009-08-27 17:32 ` Bill Fink @ 2009-08-27 17:44 ` Bill Fink 2009-08-27 17:51 ` Neil Horman 2 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-08-27 17:44 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Wed, 26 Aug 2009, Neil Horman wrote: > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > does have a serial console, and after a fair amount of effort I was > > > able to get it to work as desired, and was able to finally capture > > > a backtrace of the kernel oops. BTW I believe the reason the > > > kexec/kdump didn't work was probably because it couldn't find > > > a /proc/vmcore file, although I don't know why that would be, > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > up normally if it fails to find the /proc/vmcore file (or it's > > > zero size). > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > happy to look into it further. > > > > > The following shows a simple ping test usage of the skb_sources > > > tracing feature: > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > --- 192.168.1.10 ping statistics --- > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > [root@xeontest1 tracing]# cat trace > > > # tracer: skb_sources > > > # > > > # PID ANID CNID IFC RXQ CCPU LEN > > > # | | | | | | | > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > 4217 1 1 eth2 0 4 1500 > > > > > > All is as was expected. > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > to 1 Mbps), I get the following kernel oops: > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > Thanks > > Neil > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > Oops: 0000 [#1] SMP > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > CPU 4 > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > Stack: > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > Call Trace: > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > RSP <ffff8801a5811a88> > > > CR2: 0000000000000038 > > > > Here you go, I think this will fix your oops. > > > Fix NULL pointer deref in skb sources ftracer > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > we can pass it to sock_net > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > trace_skb_sources.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > index 40eb071..8bf518f 100644 > --- a/kernel/trace/trace_skb_sources.c > +++ b/kernel/trace/trace_skb_sources.c > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > struct ring_buffer_event *event; > struct trace_skb_event *entry; > struct trace_array *tr = skb_trace; > - struct net_device *dev; > + struct net_device *dev = NULL; > > if (!trace_skb_source_enabled) > return; > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > entry->event_data.rx_queue = skb->queue_mapping; > entry->event_data.ccpu = smp_processor_id(); > > - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > + if (skb->sk) > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > + > if (dev) { > memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > dev_put(dev); On the positive side, it did fix the oops. But the results of the skb_sources tracing was not that useful. [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp 5521 ttyS0 S 0:00 nuttcp -In2 -xc4/0 192.168.1.10 n2: 11819.0786 MB / 10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT First off, only 10 trace entries were made: [root@xeontest1 tracing]# wc trace 14 90 334 trace And here they are: [root@xeontest1 tracing]# cat trace # tracer: skb_sources # # PID ANID CNID IFC RXQ CCPU LEN # | | | | | | | 5521 0 0 Unknown 0 3 888 5521 0 0 Unknown 0 3 896 5521 0 0 Unknown 0 3 20 5521 0 0 Unknown 0 3 888 5521 0 0 Unknown 0 3 896 5521 0 0 Unknown 0 3 20 5521 1 1 Unknown 0 4 20 5521 1 1 Unknown 0 4 11 5521 1 1 Unknown 0 4 540 5521 1 1 Unknown 0 4 0 Even for these 10 entries, why is the IFC Unknown, and the LENs seem to be wrong too. -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 17:44 ` Bill Fink @ 2009-08-27 17:51 ` Neil Horman 2009-09-02 5:11 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-08-27 17:51 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote: > On Wed, 26 Aug 2009, Neil Horman wrote: > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > does have a serial console, and after a fair amount of effort I was > > > > able to get it to work as desired, and was able to finally capture > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > kexec/kdump didn't work was probably because it couldn't find > > > > a /proc/vmcore file, although I don't know why that would be, > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > zero size). > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > happy to look into it further. > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > tracing feature: > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > # tracer: skb_sources > > > > # > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > # | | | | | | | > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > All is as was expected. > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > Thanks > > > Neil > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > Oops: 0000 [#1] SMP > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > CPU 4 > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > Stack: > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > Call Trace: > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > RSP <ffff8801a5811a88> > > > > CR2: 0000000000000038 > > > > > > > > Here you go, I think this will fix your oops. > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > we can pass it to sock_net > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > trace_skb_sources.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > > index 40eb071..8bf518f 100644 > > --- a/kernel/trace/trace_skb_sources.c > > +++ b/kernel/trace/trace_skb_sources.c > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > struct ring_buffer_event *event; > > struct trace_skb_event *entry; > > struct trace_array *tr = skb_trace; > > - struct net_device *dev; > > + struct net_device *dev = NULL; > > > > if (!trace_skb_source_enabled) > > return; > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > entry->event_data.rx_queue = skb->queue_mapping; > > entry->event_data.ccpu = smp_processor_id(); > > > > - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > + if (skb->sk) > > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > + > > if (dev) { > > memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > > dev_put(dev); > > > > On the positive side, it did fix the oops. But the results of the > skb_sources tracing was not that useful. > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp > 5521 ttyS0 S 0:00 nuttcp -In2 -xc4/0 192.168.1.10 > n2: 11819.0786 MB / 10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT > > First off, only 10 trace entries were made: > > [root@xeontest1 tracing]# wc trace > 14 90 334 trace > > And here they are: > > [root@xeontest1 tracing]# cat trace > # tracer: skb_sources > # > # PID ANID CNID IFC RXQ CCPU LEN > # | | | | | | | > 5521 0 0 Unknown 0 3 888 > 5521 0 0 Unknown 0 3 896 > 5521 0 0 Unknown 0 3 20 > 5521 0 0 Unknown 0 3 888 > 5521 0 0 Unknown 0 3 896 > 5521 0 0 Unknown 0 3 20 > 5521 1 1 Unknown 0 4 20 > 5521 1 1 Unknown 0 4 11 > 5521 1 1 Unknown 0 4 540 > 5521 1 1 Unknown 0 4 0 > > Even for these 10 entries, why is the IFC Unknown, and the LENs > seem to be wrong too. > > -Bill > I'm not sure why you're getting Unknown Interface names. Nominally that indicates that the skb->iif value in the skb was incorrect or otherwise not set, which shouldn't be the case. As for the lengths that just seems wrong. That length value is taken directly from skb->len, so if its not right, it seems like its not getting set correctly someplace. As you may have seen we're removing the ftrace module, and replacing it with the use of raw trace events. When I have that working, I'll see if I get simmilar results. I never did in my local testing of the ftrace module, but perhaps its related to load or something. Neil ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-27 17:51 ` Neil Horman @ 2009-09-02 5:11 ` Bill Fink 2009-09-02 10:49 ` Neil Horman 0 siblings, 1 reply; 89+ messages in thread From: Bill Fink @ 2009-09-02 5:11 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Thu, 27 Aug 2009, Neil Horman wrote: > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote: > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > > does have a serial console, and after a fair amount of effort I was > > > > > able to get it to work as desired, and was able to finally capture > > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > > kexec/kdump didn't work was probably because it couldn't find > > > > > a /proc/vmcore file, although I don't know why that would be, > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > > zero size). > > > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > > happy to look into it further. > > > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > > tracing feature: > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > > # tracer: skb_sources > > > > > # > > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > > # | | | | | | | > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > > > All is as was expected. > > > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > > > Thanks > > > > Neil > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > > Oops: 0000 [#1] SMP > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > > CPU 4 > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > > Stack: > > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > > Call Trace: > > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > RSP <ffff8801a5811a88> > > > > > CR2: 0000000000000038 > > > > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > > we can pass it to sock_net > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > trace_skb_sources.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > > > index 40eb071..8bf518f 100644 > > > --- a/kernel/trace/trace_skb_sources.c > > > +++ b/kernel/trace/trace_skb_sources.c > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > struct ring_buffer_event *event; > > > struct trace_skb_event *entry; > > > struct trace_array *tr = skb_trace; > > > - struct net_device *dev; > > > + struct net_device *dev = NULL; > > > > > > if (!trace_skb_source_enabled) > > > return; > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > entry->event_data.rx_queue = skb->queue_mapping; > > > entry->event_data.ccpu = smp_processor_id(); > > > > > > - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > + if (skb->sk) > > > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > + > > > if (dev) { > > > memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > > > dev_put(dev); > > > > > > > > On the positive side, it did fix the oops. But the results of the > > skb_sources tracing was not that useful. > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp > > 5521 ttyS0 S 0:00 nuttcp -In2 -xc4/0 192.168.1.10 > > n2: 11819.0786 MB / 10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT > > > > First off, only 10 trace entries were made: > > > > [root@xeontest1 tracing]# wc trace > > 14 90 334 trace > > > > And here they are: > > > > [root@xeontest1 tracing]# cat trace > > # tracer: skb_sources > > # > > # PID ANID CNID IFC RXQ CCPU LEN > > # | | | | | | | > > 5521 0 0 Unknown 0 3 888 > > 5521 0 0 Unknown 0 3 896 > > 5521 0 0 Unknown 0 3 20 > > 5521 0 0 Unknown 0 3 888 > > 5521 0 0 Unknown 0 3 896 > > 5521 0 0 Unknown 0 3 20 > > 5521 1 1 Unknown 0 4 20 > > 5521 1 1 Unknown 0 4 11 > > 5521 1 1 Unknown 0 4 540 > > 5521 1 1 Unknown 0 4 0 > > > > Even for these 10 entries, why is the IFC Unknown, and the LENs > > seem to be wrong too. > > > > -Bill > > > I'm not sure why you're getting Unknown Interface names. Nominally that > indicates that the skb->iif value in the skb was incorrect or otherwise not set, > which shouldn't be the case. As for the lengths that just seems wrong. That > length value is taken directly from skb->len, so if its not right, it seems like > its not getting set correctly someplace. > > As you may have seen we're removing the ftrace module, and replacing it with the > use of raw trace events. When I have that working, I'll see if I get simmilar > results. I never did in my local testing of the ftrace module, but perhaps its > related to load or something. IIUC I should keep the first of your original three ftrace patches, revert all the rest, and then apply your very latest patch that augments the skb_copy_datagram_iovec TRACE_EVENT. Do I have that basically correct? Then I just need to ask how do I use this new method? -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-09-02 5:11 ` Bill Fink @ 2009-09-02 10:49 ` Neil Horman 2009-09-02 15:38 ` Bill Fink 0 siblings, 1 reply; 89+ messages in thread From: Neil Horman @ 2009-09-02 10:49 UTC (permalink / raw) To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote: > On Thu, 27 Aug 2009, Neil Horman wrote: > > > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote: > > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > > > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote: > > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote: > > > > > > > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system > > > > > > does have a serial console, and after a fair amount of effort I was > > > > > > able to get it to work as desired, and was able to finally capture > > > > > > a backtrace of the kernel oops. BTW I believe the reason the > > > > > > kexec/kdump didn't work was probably because it couldn't find > > > > > > a /proc/vmcore file, although I don't know why that would be, > > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot > > > > > > up normally if it fails to find the /proc/vmcore file (or it's > > > > > > zero size). > > > > > > > > > > > I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be > > > > > happy to look into it further. > > > > > > > > > > > The following shows a simple ping test usage of the skb_sources > > > > > > tracing feature: > > > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10 > > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data. > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms > > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms > > > > > > > > > > > > --- 192.168.1.10 ping statistics --- > > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms > > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms > > > > > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > > > # tracer: skb_sources > > > > > > # > > > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > > > # | | | | | | | > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > 4217 1 1 eth2 0 4 1500 > > > > > > > > > > > > All is as was expected. > > > > > > > > > > > > But if I try an actual nuttcp performance test (even rate limited > > > > > > to 1 Mbps), I get the following kernel oops: > > > > > > > > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit > > > > > > > > > > Thanks > > > > > Neil > > > > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10 > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 > > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > > PGD 337d12067 PUD 337d11067 PMD 0 > > > > > > Oops: 0000 [#1] SMP > > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e > > > > > > CPU 4 > > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ] > > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH > > > > > > RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12 > > > > > > RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213 > > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d > > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044 > > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00 > > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400 > > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890 > > > > > > FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000 > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0 > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00) > > > > > > Stack: > > > > > > ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000 > > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8 > > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef > > > > > > Call Trace: > > > > > > [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5 > > > > > > [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db > > > > > > [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366 > > > > > > [<ffffffff8135f99e>] ? release_sock+0xab/0xb4 > > > > > > [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6 > > > > > > [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f > > > > > > [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0 > > > > > > [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c > > > > > > [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f > > > > > > [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda > > > > > > [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318 > > > > > > [<ffffffff810f6d4f>] do_sync_read+0xec/0x132 > > > > > > [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d > > > > > > [<ffffffff811b646c>] ? security_file_permission+0x16/0x18 > > > > > > [<ffffffff810f785c>] vfs_read+0xc0/0x107 > > > > > > [<ffffffff810f7971>] sys_read+0x4c/0x75 > > > > > > [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b > > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e > > > > > > RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152 > > > > > > RSP <ffff8801a5811a88> > > > > > > CR2: 0000000000000038 > > > > > > > > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > > > we can pass it to sock_net > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > trace_skb_sources.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > > > > index 40eb071..8bf518f 100644 > > > > --- a/kernel/trace/trace_skb_sources.c > > > > +++ b/kernel/trace/trace_skb_sources.c > > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > > struct ring_buffer_event *event; > > > > struct trace_skb_event *entry; > > > > struct trace_array *tr = skb_trace; > > > > - struct net_device *dev; > > > > + struct net_device *dev = NULL; > > > > > > > > if (!trace_skb_source_enabled) > > > > return; > > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > > entry->event_data.rx_queue = skb->queue_mapping; > > > > entry->event_data.ccpu = smp_processor_id(); > > > > > > > > - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > > + if (skb->sk) > > > > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > > + > > > > if (dev) { > > > > memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > > > > dev_put(dev); > > > > > > > > > > > > On the positive side, it did fix the oops. But the results of the > > > skb_sources tracing was not that useful. > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp > > > 5521 ttyS0 S 0:00 nuttcp -In2 -xc4/0 192.168.1.10 > > > n2: 11819.0786 MB / 10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT > > > > > > First off, only 10 trace entries were made: > > > > > > [root@xeontest1 tracing]# wc trace > > > 14 90 334 trace > > > > > > And here they are: > > > > > > [root@xeontest1 tracing]# cat trace > > > # tracer: skb_sources > > > # > > > # PID ANID CNID IFC RXQ CCPU LEN > > > # | | | | | | | > > > 5521 0 0 Unknown 0 3 888 > > > 5521 0 0 Unknown 0 3 896 > > > 5521 0 0 Unknown 0 3 20 > > > 5521 0 0 Unknown 0 3 888 > > > 5521 0 0 Unknown 0 3 896 > > > 5521 0 0 Unknown 0 3 20 > > > 5521 1 1 Unknown 0 4 20 > > > 5521 1 1 Unknown 0 4 11 > > > 5521 1 1 Unknown 0 4 540 > > > 5521 1 1 Unknown 0 4 0 > > > > > > Even for these 10 entries, why is the IFC Unknown, and the LENs > > > seem to be wrong too. > > > > > > -Bill > > > > > I'm not sure why you're getting Unknown Interface names. Nominally that > > indicates that the skb->iif value in the skb was incorrect or otherwise not set, > > which shouldn't be the case. As for the lengths that just seems wrong. That > > length value is taken directly from skb->len, so if its not right, it seems like > > its not getting set correctly someplace. > > > > As you may have seen we're removing the ftrace module, and replacing it with the > > use of raw trace events. When I have that working, I'll see if I get simmilar > > results. I never did in my local testing of the ftrace module, but perhaps its > > related to load or something. > > IIUC I should keep the first of your original three ftrace patches, > revert all the rest, and then apply your very latest patch that > augments the skb_copy_datagram_iovec TRACE_EVENT. Do I have that > basically correct? > Thats exactly correct, yes. > Then I just need to ask how do I use this new method? > It works in basically the same way. Except instead of doing this: echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer you do this: echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe] Best Neil > -Thanks > > -Bill > ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-09-02 10:49 ` Neil Horman @ 2009-09-02 15:38 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-09-02 15:38 UTC (permalink / raw) To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin On Wed, 2 Sep 2009, Neil Horman wrote: > On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote: > > On Thu, 27 Aug 2009, Neil Horman wrote: > > > > > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote: > > > > On Wed, 26 Aug 2009, Neil Horman wrote: > > > > > > > > > Here you go, I think this will fix your oops. > > > > > > > > > > > > > > > Fix NULL pointer deref in skb sources ftracer > > > > > > > > > > Its possible that skb->sk will be null in this path, so we shouldn't just assume > > > > > we can pass it to sock_net > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > > > trace_skb_sources.c | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c > > > > > index 40eb071..8bf518f 100644 > > > > > --- a/kernel/trace/trace_skb_sources.c > > > > > +++ b/kernel/trace/trace_skb_sources.c > > > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > > > struct ring_buffer_event *event; > > > > > struct trace_skb_event *entry; > > > > > struct trace_array *tr = skb_trace; > > > > > - struct net_device *dev; > > > > > + struct net_device *dev = NULL; > > > > > > > > > > if (!trace_skb_source_enabled) > > > > > return; > > > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len) > > > > > entry->event_data.rx_queue = skb->queue_mapping; > > > > > entry->event_data.ccpu = smp_processor_id(); > > > > > > > > > > - dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > > > + if (skb->sk) > > > > > + dev = dev_get_by_index(sock_net(skb->sk), skb->iif); > > > > > + > > > > > if (dev) { > > > > > memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ); > > > > > dev_put(dev); > > > > > > > > > > > > > > > > On the positive side, it did fix the oops. But the results of the > > > > skb_sources tracing was not that useful. > > > > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp > > > > 5521 ttyS0 S 0:00 nuttcp -In2 -xc4/0 192.168.1.10 > > > > n2: 11819.0786 MB / 10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT > > > > > > > > First off, only 10 trace entries were made: > > > > > > > > [root@xeontest1 tracing]# wc trace > > > > 14 90 334 trace > > > > > > > > And here they are: > > > > > > > > [root@xeontest1 tracing]# cat trace > > > > # tracer: skb_sources > > > > # > > > > # PID ANID CNID IFC RXQ CCPU LEN > > > > # | | | | | | | > > > > 5521 0 0 Unknown 0 3 888 > > > > 5521 0 0 Unknown 0 3 896 > > > > 5521 0 0 Unknown 0 3 20 > > > > 5521 0 0 Unknown 0 3 888 > > > > 5521 0 0 Unknown 0 3 896 > > > > 5521 0 0 Unknown 0 3 20 > > > > 5521 1 1 Unknown 0 4 20 > > > > 5521 1 1 Unknown 0 4 11 > > > > 5521 1 1 Unknown 0 4 540 > > > > 5521 1 1 Unknown 0 4 0 > > > > > > > > Even for these 10 entries, why is the IFC Unknown, and the LENs > > > > seem to be wrong too. > > > > > > > > -Bill > > > > > > > I'm not sure why you're getting Unknown Interface names. Nominally that > > > indicates that the skb->iif value in the skb was incorrect or otherwise not set, > > > which shouldn't be the case. As for the lengths that just seems wrong. That > > > length value is taken directly from skb->len, so if its not right, it seems like > > > its not getting set correctly someplace. > > > > > > As you may have seen we're removing the ftrace module, and replacing it with the > > > use of raw trace events. When I have that working, I'll see if I get simmilar > > > results. I never did in my local testing of the ftrace module, but perhaps its > > > related to load or something. > > > > IIUC I should keep the first of your original three ftrace patches, > > revert all the rest, and then apply your very latest patch that > > augments the skb_copy_datagram_iovec TRACE_EVENT. Do I have that > > basically correct? > > > Thats exactly correct, yes. > > > Then I just need to ask how do I use this new method? > > > It works in basically the same way. Except instead of doing this: > echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer > you do this: > echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable > Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe] Thanks! I'll probably give this a try later today and report back. -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink 2009-08-07 21:18 ` Brice Goglin 2009-08-07 22:12 ` Neil Horman @ 2009-08-12 23:29 ` David Miller 2009-08-13 2:35 ` Bill Fink 2 siblings, 1 reply; 89+ messages in thread From: David Miller @ 2009-08-12 23:29 UTC (permalink / raw) To: billfink; +Cc: netdev, brice, gallatin From: Bill Fink <billfink@mindspring.com> Date: Fri, 7 Aug 2009 17:06:00 -0400 > To kludge around this, I made a different patch to the myri10ge driver. > This time I hardcoded the NUMA node in the call to alloc_pages_node() > to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7) > and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13). > This is of course very specific to our specific system (NUMA node ids > and Myricom 10-GigE device IRQs), and is not something that would be > generically applicable. But it was useful as a test, and it did > improve the receive side performance substantially! This, unfortunately, won't be comprehensive. You'd also need to kludge the NUMA node used for allocation of the skb->data buffer via the netdev_alloc_skb() calls in myri10ge_rx_done() and friends. This could possibly account for why, with your kludge, you still were only getting 56.4703 Gbps ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Receive side performance issue with multi-10-GigE and NUMA 2009-08-12 23:29 ` David Miller @ 2009-08-13 2:35 ` Bill Fink 0 siblings, 0 replies; 89+ messages in thread From: Bill Fink @ 2009-08-13 2:35 UTC (permalink / raw) To: David Miller; +Cc: netdev, brice, gallatin On Wed, 12 Aug 2009, David Miller wrote: > From: Bill Fink <billfink@mindspring.com> > Date: Fri, 7 Aug 2009 17:06:00 -0400 > > > To kludge around this, I made a different patch to the myri10ge driver. > > This time I hardcoded the NUMA node in the call to alloc_pages_node() > > to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7) > > and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13). > > This is of course very specific to our specific system (NUMA node ids > > and Myricom 10-GigE device IRQs), and is not something that would be > > generically applicable. But it was useful as a test, and it did > > improve the receive side performance substantially! > > This, unfortunately, won't be comprehensive. You'd also need to > kludge the NUMA node used for allocation of the skb->data buffer via > the netdev_alloc_skb() calls in myri10ge_rx_done() and friends. > > This could possibly account for why, with your kludge, you still > were only getting 56.4703 Gbps I actually did try this. I changed the netdev_alloc_skb() call in the myri10ge driver to an __alloc_skb() call and explicitly specified the correct NUMA node (plus all the necessary extra code that gets done under the covers by netdev_alloc_skb()). It didn't help. Not being a kernel developer, one thing I didn't know though was if the skb was initially allocated on NUMA node A, as the skb got expanded during its processing, would it always stay on NUMA node A, or could it possibly be migrated subsequently to a different NUMA node B. -Thanks -Bill ^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2009-09-02 15:38 UTC | newest]
Thread overview: 89+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
2009-08-07 21:18 ` Brice Goglin
2009-08-07 21:51 ` Bill Fink
2009-08-07 21:53 ` Brice Goglin
2009-08-07 22:08 ` Bill Fink
2009-08-07 22:17 ` Brice Goglin
2009-08-07 22:55 ` Bill Fink
2009-08-08 1:03 ` Andrew Gallatin
2009-08-08 1:35 ` Bill Fink
2009-08-08 11:08 ` Andrew Gallatin
2009-08-08 11:26 ` Neil Horman
2009-08-08 18:21 ` Andrew Gallatin
2009-08-08 18:32 ` Neil Horman
2009-08-11 7:32 ` Bill Fink
2009-08-11 11:02 ` Neil Horman
2009-08-11 19:15 ` Christoph Lameter
2009-08-11 22:27 ` Andi Kleen
2009-08-12 4:30 ` Bill Fink
2009-08-12 7:21 ` Andi Kleen
[not found] ` <4A856781.2080301@myri.com>
2009-08-14 16:38 ` Bill Fink
2009-08-14 16:55 ` Andrew Gallatin
2009-08-14 21:13 ` Aviv Greenberg
2009-08-20 7:26 ` Bill Fink
2009-08-20 13:14 ` Ben Hutchings
2009-08-21 4:00 ` Bill Fink
2009-08-20 13:17 ` Aviv Greenberg
2009-08-12 0:02 ` Brandeburg, Jesse
2009-08-12 4:38 ` Bill Fink
2009-08-12 16:00 ` Jesse Barnes
2009-08-14 20:31 ` Bill Fink
2009-08-17 16:53 ` Jesse Barnes
2009-08-18 7:07 ` Bill Fink
2009-08-18 11:54 ` Andrew Gallatin
2009-08-19 17:59 ` Bill Fink
2009-08-07 22:12 ` Neil Horman
2009-08-08 0:54 ` Bill Fink
2009-08-08 1:56 ` Neil Horman
2009-08-14 20:44 ` Bill Fink
2009-08-14 23:25 ` Neil Horman
2009-08-20 7:50 ` Bill Fink
2009-08-20 20:19 ` Neil Horman
2009-08-21 4:14 ` Bill Fink
2009-08-21 15:23 ` Neil Horman
2009-08-21 15:36 ` Andrew Gallatin
2009-08-26 7:10 ` Bill Fink
2009-08-26 11:00 ` Neil Horman
2009-08-26 18:08 ` Neil Horman
2009-08-26 18:15 ` Ingo Molnar
2009-08-26 19:04 ` Neil Horman
2009-08-26 19:08 ` Ingo Molnar
2009-08-26 19:36 ` David Miller
2009-08-26 19:48 ` Ingo Molnar
2009-08-26 20:23 ` Neil Horman
2009-08-26 20:40 ` Ingo Molnar
2009-08-26 22:39 ` Neil Horman
2009-08-26 22:44 ` David Miller
2009-08-26 23:05 ` Ingo Molnar
2009-08-26 23:08 ` David Miller
2009-08-26 23:58 ` Ingo Molnar
2009-08-27 0:05 ` Steven Rostedt
2009-08-27 0:35 ` Christoph Hellwig
2009-08-27 9:28 ` Ingo Molnar
2009-08-26 23:05 ` Steven Rostedt
2009-08-26 23:09 ` David Miller
2009-08-26 23:30 ` Ingo Molnar
2009-08-26 23:23 ` Neil Horman
2009-08-26 23:29 ` David Miller
2009-08-26 23:19 ` Neil Horman
2009-08-26 23:14 ` Ingo Molnar
2009-08-26 23:33 ` Steven Rostedt
2009-08-27 0:14 ` Neil Horman
2009-08-27 0:29 ` Steven Rostedt
2009-08-27 1:17 ` Neil Horman
2009-08-27 9:06 ` Ingo Molnar
2009-08-27 9:34 ` Ingo Molnar
2009-08-27 0:34 ` Christoph Hellwig
2009-08-26 23:46 ` Frederic Weisbecker
2009-08-26 20:28 ` Ingo Molnar
2009-08-26 20:01 ` Neil Horman
2009-08-26 22:57 ` Ingo Molnar
2009-08-27 17:32 ` Bill Fink
2009-09-02 5:28 ` Bill Fink
2009-08-27 17:44 ` Bill Fink
2009-08-27 17:51 ` Neil Horman
2009-09-02 5:11 ` Bill Fink
2009-09-02 10:49 ` Neil Horman
2009-09-02 15:38 ` Bill Fink
2009-08-12 23:29 ` David Miller
2009-08-13 2:35 ` Bill Fink
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).