netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CPU utilization increased in 2.6.27rc
@ 2008-08-13  0:56 Andrew Gallatin
  2008-08-13  1:05 ` David Miller
  2008-08-13  1:15 ` David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Gallatin @ 2008-08-13  0:56 UTC (permalink / raw)
  To: netdev


I noticed a performance degradation in the 2.6.27rc series having to
do with TCP transmits.  The problem seems to be most noticeable
when using a fast (10GbE) network and a pitifully slow (2.0GHz
athlon64) host with a small (1500b) MTU using TSO and sendpage,
but I also see it with 1GbE hardware, without TSO and sendpage.

I used git-bisect to track down where the problem seems
to have been introduced in Linus' tree:

37437bb2e1ae8af470dfcd5b4ff454110894ccaf is first bad commit
commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Jul 16 02:15:04 2008 -0700

     pkt_sched: Schedule qdiscs instead of netdev_queue.


Something about this is maxing out the CPU on my very-low end test
machines. Just prior to the above commit, I see the same
good performance as 2.6.26.2 and the rest of the 2.6 series.
Here is output from netperf -tTCP_SENDFILE -C -c between 2 of
my low end hosts:

Forcedeth (1GbE)
  87380  65536  65536    10.05       949.03   14.54    20.01    2.510 
3.455
Myri10ge (10GbE):
  87380  65536  65536    10.01      9466.27   19.00    73.43    0.329 
1.271

Just after the above commit, the CPU utilization increases
dramatically.  Note the large difference in CPU utilization
for both 1GbE (14.5% -> 46.5%) and 10GbE (19% -> 49.8%):

Forcedeth (1GbE)
  87380  65536  65536    10.01       947.04   46.48    20.05    8.042 
3.468
Myri10ge (10GbE):
  87380  65536  65536    10.00      7693.19   49.81    60.03    1.061 
1.278


For 1GbE, I see a similar increase in CPU utilization from
when using normal socket writes (netperf -t TCP_STREAM):


  87380  65536  65536    10.05       948.92   19.89    18.65    3.434 
3.220
	vs
  87380  65536  65536    10.07       949.35   49.38    20.77    8.523 
3.584

Without TSO enabled, the difference is less evident, but still
there (~30% -> 49%).

For 10GbE, this only seems to happen for sendpage. Normal socket
write (netperf TCP_STREAM) tests do not seem to show this degradation,
perhaps because a CPU is already maxed out copying data...

According to oprofile, the system is spending a lot of
time in __qdisc_run() when sending on the 1GbE forcedeth
interface:

17978    17.5929  vmlinux                  __qdisc_run
9828      9.6175  vmlinux                  net_tx_action
8306      8.1281  vmlinux                  _raw_spin_lock
5762      5.6386  oprofiled                (no symbols)
5443      5.3264  vmlinux                  __netif_schedule
5352      5.2374  vmlinux                  _raw_spin_unlock
4921      4.8156  vmlinux                  __do_softirq
3366      3.2939  vmlinux                  raise_softirq_irqoff
1730      1.6929  vmlinux                  pfifo_fast_requeue
1689      1.6528  vmlinux                  pfifo_fast_dequeue
1406      1.3759  oprofile                 (no symbols)
1346      1.3172  vmlinux                  _raw_spin_trylock
1194      1.1684  vmlinux                  nv_start_xmit_optimized
1114      1.0901  vmlinux                  handle_IRQ_event
1031      1.0089  vmlinux                  tcp_ack
<....>

Does anybody understand what's happening?

Thanks,

Drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13  0:56 CPU utilization increased in 2.6.27rc Andrew Gallatin
@ 2008-08-13  1:05 ` David Miller
  2008-08-13 15:06   ` Andrew Gallatin
  2008-08-13  1:15 ` David Miller
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2008-08-13  1:05 UTC (permalink / raw)
  To: gallatin; +Cc: netdev

From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 12 Aug 2008 20:56:23 -0400

> According to oprofile, the system is spending a lot of
> time in __qdisc_run() when sending on the 1GbE forcedeth
> interface:

What does the profile look like beforehand?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13  0:56 CPU utilization increased in 2.6.27rc Andrew Gallatin
  2008-08-13  1:05 ` David Miller
@ 2008-08-13  1:15 ` David Miller
  2008-08-13 16:13   ` Andrew Gallatin
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2008-08-13  1:15 UTC (permalink / raw)
  To: gallatin; +Cc: netdev, robert

From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 12 Aug 2008 20:56:23 -0400

>      pkt_sched: Schedule qdiscs instead of netdev_queue.

While I'm waiting for your beforehand profile data,
here is a stab in the dark patch which might fix
the problem.

Robert, this could explain some of the things in the
multiqueue testing profile you sent me a week or so
ago.

Let me know how well it works:

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 6affcfa..720cae6 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -89,7 +89,10 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	struct netdev_queue *txq = q->dev_queue;
+
+	if (!netif_tx_queue_stopped(txq) &&
+	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13  1:05 ` David Miller
@ 2008-08-13 15:06   ` Andrew Gallatin
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Gallatin @ 2008-08-13 15:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Tue, 12 Aug 2008 20:56:23 -0400
> 
>> According to oprofile, the system is spending a lot of
>> time in __qdisc_run() when sending on the 1GbE forcedeth
>> interface:
> 
> What does the profile look like beforehand?

The qdisc stuff is gone, and nearly everything is in the
noise.  Beforehand, we're at ~15% CPU.  Here is the
first page or so from opreport -l from immediately
prior:

7566      6.4373  vmlinux                  _raw_spin_lock
5894      5.0147  oprofiled                (no symbols)
4136      3.5190  ehci_hcd                 (no symbols)
3965      3.3735  vmlinux                  handle_IRQ_event
3333      2.8358  vmlinux                  tcp_ack
2952      2.5116  vmlinux                  __copy_skb_header
2869      2.4410  vmlinux                  default_idle
2702      2.2989  vmlinux                  nv_rx_process_optimized
2511      2.1364  vmlinux                  nv_start_xmit_optimized
2310      1.9654  vmlinux                  sk_run_filter
2157      1.8352  vmlinux                  kmem_cache_alloc
2139      1.8199  vmlinux                  IRQ0x69_interrupt
1797      1.5289  vmlinux                  nv_nic_irq_optimized
1796      1.5281  vmlinux                  kmem_cache_free
1784      1.5179  vmlinux                  kfree
1690      1.4379  vmlinux                  _raw_spin_unlock
1594      1.3562  vmlinux                  tcp_sendpage
1578      1.3426  vmlinux                  __tcp_push_pending_frames
1576      1.3409  vmlinux                  packet_rcv_spkt
1560      1.3273  vmlinux                  __inet_lookup_established
1558      1.3256  vmlinux                  nv_tx_done_optimized

[ On this system, forcedeth shares an irq with ehci_hcd,
so that's why that is so high.]

Drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13  1:15 ` David Miller
@ 2008-08-13 16:13   ` Andrew Gallatin
  2008-08-13 19:52     ` Robert Olsson
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Andrew Gallatin @ 2008-08-13 16:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, robert

David Miller wrote:
> From: Andrew Gallatin <gallatin@myri.com>
> Date: Tue, 12 Aug 2008 20:56:23 -0400
> 
>>      pkt_sched: Schedule qdiscs instead of netdev_queue.
> 
> While I'm waiting for your beforehand profile data,
> here is a stab in the dark patch which might fix
> the problem.
> 
> Robert, this could explain some of the things in the
> multiqueue testing profile you sent me a week or so
> ago.
> 
> Let me know how well it works:

Excellent!  This completely fixes the increased CPU
utilization I observed on both 10GbE and 1GbE interfaces,
and CPU utilization is now reduced back to 2.6.26 levels.

Oprofile now is nearly identical to what it was prior to
37437bb2e1ae8af470dfcd5b4ff454110894ccaf:

8363      6.5081  vmlinux                  _raw_spin_lock
5612      4.3672  oprofiled                (no symbols)
4420      3.4396  ehci_hcd                 (no symbols)
4325      3.3657  vmlinux                  handle_IRQ_event
3688      2.8700  vmlinux                  default_idle
3164      2.4622  vmlinux                  nv_start_xmit_optimized
3092      2.4062  vmlinux                  sk_run_filter
3072      2.3906  vmlinux                  tcp_ack
2969      2.3105  vmlinux                  __copy_skb_header
2453      1.9089  vmlinux                  kmem_cache_free
2400      1.8677  vmlinux                  IRQ0x69_interrupt
2295      1.7860  vmlinux                  nv_rx_process_optimized
2092      1.6280  vmlinux                  kmem_cache_alloc
2072      1.6124  vmlinux                  kfree
2049      1.5945  vmlinux                  packet_rcv_spkt
1984      1.5439  vmlinux                  __tcp_push_pending_frames
1942      1.5113  vmlinux                  nv_nic_irq_optimized
1933      1.5043  vmlinux                  _raw_spin_unlock
1637      1.2739  vmlinux                  nv_tx_done_optimized
1630      1.2685  vmlinux                  eth_type_trans
1517      1.1805  vmlinux                  __qdisc_run



Thank you,

Drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 16:13   ` Andrew Gallatin
@ 2008-08-13 19:52     ` Robert Olsson
  2008-08-13 21:34       ` Stephen Hemminger
  2008-08-13 20:03     ` Andi Kleen
  2008-08-13 20:27     ` David Miller
  2 siblings, 1 reply; 14+ messages in thread
From: Robert Olsson @ 2008-08-13 19:52 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: David Miller, netdev, Robert.Olsson


Andrew Gallatin writes:
 > 
 > Excellent!  This completely fixes the increased CPU
 > utilization I observed on both 10GbE and 1GbE interfaces,
 > and CPU utilization is now reduced back to 2.6.26 levels.


 > > Robert, this could explain some of the things in the
 > > multiqueue testing profile you sent me a week or so
 > > ago.

I've just rerun the virtual 10g router experiment with the current
git including the pkt_sched patch. The full experiment is below. In this 
case the profile looks the same as before. No improvement due to this
patch here.

In this case we have not any old numbers to compare with as we're 
testing new functionality. I'm not to unhappy about the performance 
and there must be some functions the in profile... 

Virtual IP forwarding experiment. We're splitting an incoming flow
load (10g) among 4 CPU's and keep the incoming flows per-CPU including
TX and also skb clearing 


Network flow load into (eth0) 10G 82598. Total 295+293+293+220 kpps 
4 * (4096 concurrent flows at 30 pkts)

eth0   1500   0 3996889      0   1280      0      19      0      0      0 BMRU
eth1   1500   0       1      0      0      0 3998236      0      0      0 BMRU

I've configured RSS with ixgbe so all 4 CPU's are used and hacked driver 
so skb gets tagged with incoming CPU. The 2:nd col in softnet_stat is used 
to verify tagging and affinity is correct until hard_xmit and even for TX-skb 
cleaning to avoid all cache misses and true per-CPU forwarding. The ixgbe driver
1.3.31.5 from Intel's site is needed for RSS etc and bit modified for this test.

softnet_stat
000f3236 001e63f8 00000872 00000000 00000000 00000000 00000000 00000000 00000000
000f52df 001ea58c 000008b8 00000000 00000000 00000000 00000000 00000000 00000000
000f3d90 001e7af8 00000a3b 00000000 00000000 00000000 00000000 00000000 00000000
000f4174 001e82c2 00000a17 00000000 00000000 00000000 00000000 00000000 00000000

eth0 (incoming)
214:          4          0          0       6623   PCI-MSI-edge      eth0:v3-Rx
215:          0          5       6635          0   PCI-MSI-edge      eth0:v2-Rx
216:          0       7152          5          0   PCI-MSI-edge      eth0:v1-Rx
217:       7115          0          0          5   PCI-MSI-edge      eth0:v0-Rx

eth1 (outgoing)
201:          3          0          0       3738   PCI-MSI-edge      eth1:v7-Tx
202:          0          4       3743          0   PCI-MSI-edge      eth1:v6-Tx
203:          0       3743          4          0   PCI-MSI-edge      eth1:v5-Tx
204:       3746          0          0          6   PCI-MSI-edge      eth1:v4-Tx

CPU: AMD64 processors, speed 3000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 3000
samples  %        image name               app name                 symbol name
407896    8.7211  vmlinux                  vmlinux                  cache_alloc_refill
339524    7.2592  vmlinux                  vmlinux                  __qdisc_run
243352    5.2030  vmlinux                  vmlinux                  dev_queue_xmit
227855    4.8717  vmlinux                  vmlinux                  kfree
214975    4.5963  vmlinux                  vmlinux                  __alloc_skb
172008    3.6776  vmlinux                  vmlinux                  cache_flusharray
168307    3.5985  vmlinux                  vmlinux                  ip_route_input
160995    3.4422  vmlinux                  vmlinux                  dev_kfree_skb_irq
146116    3.1240  vmlinux                  vmlinux                  netif_receive_skb
137763    2.9455  vmlinux                  vmlinux                  free_block
133732    2.8593  vmlinux                  vmlinux                  eth_type_trans
124262    2.6568  vmlinux                  vmlinux                  ip_rcv
110170    2.3555  vmlinux                  vmlinux                  list_del
100508    2.1489  vmlinux                  vmlinux                  ip_finish_output
96777     2.0691  vmlinux                  vmlinux                  ip_forward
89212     1.9074  vmlinux                  vmlinux                  check_addr


diff --git a/net/core/dev.c b/net/core/dev.c
index 8d13a9b..6fdf427 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1714,6 +1714,9 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 {
 	u16 queue_index = 0;
 
+	if (dev->real_num_tx_queues > 1) 
+		return netdev_get_tx_queue(dev, skb->queue_mapping);
+
 	if (dev->select_queue)
 		queue_index = dev->select_queue(dev, skb);
 	else if (dev->real_num_tx_queues > 1)
@@ -4872,3 +4875,4 @@ EXPORT_SYMBOL(dev_load);
 #endif
 
 EXPORT_PER_CPU_SYMBOL(softnet_data);
+EXPORT_PER_CPU_SYMBOL(netdev_rx_stat);

--- ixgbe.h.orig	2008-07-30 13:11:46.000000000 +0200
+++ ixgbe.h	2008-07-30 17:42:59.000000000 +0200
@@ -28,6 +28,8 @@
 #ifndef _IXGBE_H_
 #define _IXGBE_H_
 
+#define CONFIG_NETDEVICES_MULTIQUEUE
+
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/vmalloc.h>
@@ -106,6 +108,10 @@
 #define IXGBE_TX_FLAGS_VLAN_PRIO_MASK	0x0000e000
 #define IXGBE_TX_FLAGS_VLAN_SHIFT	16
 
+#define IXGBE_NO_LRO
+#define IXGBE_NAPI
+#define CONFIG_IXGBE_NAPI
+
 #ifndef IXGBE_NO_LRO
 #define IXGBE_LRO_MAX 32	/*Maximum number of LRO descriptors*/
 #define IXGBE_LRO_GLOBAL 10
--- ixgbe_main.c.orig	2008-07-30 13:12:02.000000000 +0200
+++ ixgbe_main.c	2008-07-30 19:26:07.000000000 +0200
@@ -71,7 +71,7 @@
 #endif
 
 
-#define BASE_VERSION "1.3.31.5"
+#define BASE_VERSION "1.3.31.5-080730"
 #define DRV_VERSION BASE_VERSION LRO DRIVERNAPI DRV_HW_PERF
 
 char ixgbe_driver_version[] = DRV_VERSION;
@@ -257,6 +257,9 @@
 				total_packets++;
 				total_bytes += skb->len;
 #endif
+				if(skb->queue_mapping == smp_processor_id()) 
+					__get_cpu_var(netdev_rx_stat).dropped++;
+
 			}
 
 			ixgbe_unmap_and_free_tx_resource(adapter,
@@ -426,6 +429,9 @@
 			      struct sk_buff *skb, bool is_vlan, u16 tag)
 {
 	int ret;
+
+	skb->queue_mapping = smp_processor_id();
+	
 #ifdef CONFIG_IXGBE_NAPI
 	if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL)) {
 #ifdef NETIF_F_HW_VLAN_TX
@@ -2875,7 +2881,11 @@
 			rss_i = min(4, rss_i);
 			rss_m = 0x3;
 			nrq = dcb_i * vmdq_i * rss_i;
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+			ntq = nrq;
+#else
 			ntq = dcb_i * vmdq_i;
+#endif
 			break;
 		case (IXGBE_FLAG_VMDQ_ENABLED | IXGBE_FLAG_DCB_ENABLED):
 			dcb_m = 0x7 << 3;
@@ -3242,7 +3252,7 @@
 out:
 #ifdef CONFIG_NETDEVICES_MULTIQUEUE
 	/* Notify the stack of the (possibly) reduced Tx Queue count. */
-	adapter->netdev->egress_subqueue_count = adapter->num_tx_queues;
+	// adapter->netdev->egress_subqueue_count = adapter->num_tx_queues;
 #endif
 
 	return err;
@@ -3794,6 +3804,8 @@
 }
 #endif /* CONFIG_PM */
 
+extern DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
+
 static int ixgbe_suspend(struct pci_dev *pdev, pm_message_t state)
 {
 	struct net_device *netdev = pci_get_drvdata(pdev);
@@ -4402,6 +4414,9 @@
 
 #ifdef CONFIG_NETDEVICES_MULTIQUEUE
 	r_idx = (adapter->num_tx_queues - 1) & skb->queue_mapping;
+
+	if(skb->queue_mapping == smp_processor_id()) 
+		__get_cpu_var(netdev_rx_stat).dropped++;
 #endif
 	tx_ring = &adapter->tx_ring[r_idx];
 
 
Cheers.
					--ro

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 16:13   ` Andrew Gallatin
  2008-08-13 19:52     ` Robert Olsson
@ 2008-08-13 20:03     ` Andi Kleen
  2008-08-13 20:36       ` Andrew Gallatin
  2008-08-13 20:27     ` David Miller
  2 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2008-08-13 20:03 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: David Miller, netdev, robert

Andrew Gallatin <gallatin@myri.com> writes:
>
> 8363      6.5081  vmlinux                  _raw_spin_lock
> 5612      4.3672  oprofiled                (no symbols)
> 4420      3.4396  ehci_hcd                 (no symbols)
> 4325      3.3657  vmlinux                  handle_IRQ_event
> 3688      2.8700  vmlinux                  default_idle
> 3164      2.4622  vmlinux                  nv_start_xmit_optimized
> 3092      2.4062  vmlinux                  sk_run_filter
                                             ^^^^^^^^^^^^^

Looks like you have one of those nasty dhcpcds running that
always open a raw socket and intercept everything using a filter?
I always hoped those would disappear eventually and just
bind to the proper protocol, but they seem to refuse dying.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 16:13   ` Andrew Gallatin
  2008-08-13 19:52     ` Robert Olsson
  2008-08-13 20:03     ` Andi Kleen
@ 2008-08-13 20:27     ` David Miller
  2008-08-13 20:58       ` Andrew Gallatin
  2 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2008-08-13 20:27 UTC (permalink / raw)
  To: gallatin; +Cc: netdev, robert

From: Andrew Gallatin <gallatin@myri.com>
Date: Wed, 13 Aug 2008 12:13:40 -0400

> David Miller wrote:
> > From: Andrew Gallatin <gallatin@myri.com>
> > Date: Tue, 12 Aug 2008 20:56:23 -0400
> > 
> >>      pkt_sched: Schedule qdiscs instead of netdev_queue.
> > 
> > While I'm waiting for your beforehand profile data,
> > here is a stab in the dark patch which might fix
> > the problem.
> > 
> > Robert, this could explain some of the things in the
> > multiqueue testing profile you sent me a week or so
> > ago.
> > 
> > Let me know how well it works:
> 
> Excellent!  This completely fixes the increased CPU
> utilization I observed on both 10GbE and 1GbE interfaces,
> and CPU utilization is now reduced back to 2.6.26 levels.
> 
> Oprofile now is nearly identical to what it was prior to
> 37437bb2e1ae8af470dfcd5b4ff454110894ccaf:

Thanks for testing and providing those profiles.

I'll get this fix to Linus soon.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 20:03     ` Andi Kleen
@ 2008-08-13 20:36       ` Andrew Gallatin
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Gallatin @ 2008-08-13 20:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev, robert

Andi Kleen wrote:
> Andrew Gallatin <gallatin@myri.com> writes:
>> 8363      6.5081  vmlinux                  _raw_spin_lock
>> 5612      4.3672  oprofiled                (no symbols)
>> 4420      3.4396  ehci_hcd                 (no symbols)
>> 4325      3.3657  vmlinux                  handle_IRQ_event
>> 3688      2.8700  vmlinux                  default_idle
>> 3164      2.4622  vmlinux                  nv_start_xmit_optimized
>> 3092      2.4062  vmlinux                  sk_run_filter
>                                              ^^^^^^^^^^^^^
> 
> Looks like you have one of those nasty dhcpcds running that
> always open a raw socket and intercept everything using a filter?
> I always hoped those would disappear eventually and just
> bind to the proper protocol, but they seem to refuse dying.

Yeah, the box is running an ancient CENTOS4, so the dhcpcd
is pretty old.

Drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 20:27     ` David Miller
@ 2008-08-13 20:58       ` Andrew Gallatin
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Gallatin @ 2008-08-13 20:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, robert

David Miller wrote:

> 
> Thanks for testing and providing those profiles.
> 
> I'll get this fix to Linus soon.

No problem.  Thanks for the excellent multi-queue tx work.

FWIW, I tripped over this when testing a myri10ge patch
for multi-queue tx...

Drew

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 19:52     ` Robert Olsson
@ 2008-08-13 21:34       ` Stephen Hemminger
  2008-08-13 21:56         ` Robert Olsson
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2008-08-13 21:34 UTC (permalink / raw)
  To: Robert Olsson; +Cc: Andrew Gallatin, David Miller, netdev, Robert.Olsson

On Wed, 13 Aug 2008 21:52:08 +0200
Robert Olsson <robert@robur.slu.se> wrote:

> 
> Andrew Gallatin writes:
>  > 
>  > Excellent!  This completely fixes the increased CPU
>  > utilization I observed on both 10GbE and 1GbE interfaces,
>  > and CPU utilization is now reduced back to 2.6.26 levels.
> 
> 
>  > > Robert, this could explain some of the things in the
>  > > multiqueue testing profile you sent me a week or so
>  > > ago.
> 
> I've just rerun the virtual 10g router experiment with the current
> git including the pkt_sched patch. The full experiment is below. In this 
> case the profile looks the same as before. No improvement due to this
> patch here.
> 
> In this case we have not any old numbers to compare with as we're 
> testing new functionality. I'm not to unhappy about the performance 
> and there must be some functions the in profile... 
> 
> Virtual IP forwarding experiment. We're splitting an incoming flow
> load (10g) among 4 CPU's and keep the incoming flows per-CPU including
> TX and also skb clearing 
> 
> 
> Network flow load into (eth0) 10G 82598. Total 295+293+293+220 kpps 
> 4 * (4096 concurrent flows at 30 pkts)
> 
> eth0   1500   0 3996889      0   1280      0      19      0      0      0 BMRU
> eth1   1500   0       1      0      0      0 3998236      0      0      0 BMRU
> 
> I've configured RSS with ixgbe so all 4 CPU's are used and hacked driver 
> so skb gets tagged with incoming CPU. The 2:nd col in softnet_stat is used 
> to verify tagging and affinity is correct until hard_xmit and even for TX-skb 
> cleaning to avoid all cache misses and true per-CPU forwarding. The ixgbe driver
> 1.3.31.5 from Intel's site is needed for RSS etc and bit modified for this test.
> 
> softnet_stat
> 000f3236 001e63f8 00000872 00000000 00000000 00000000 00000000 00000000 00000000
> 000f52df 001ea58c 000008b8 00000000 00000000 00000000 00000000 00000000 00000000
> 000f3d90 001e7af8 00000a3b 00000000 00000000 00000000 00000000 00000000 00000000
> 000f4174 001e82c2 00000a17 00000000 00000000 00000000 00000000 00000000 00000000
> 
> eth0 (incoming)
> 214:          4          0          0       6623   PCI-MSI-edge      eth0:v3-Rx
> 215:          0          5       6635          0   PCI-MSI-edge      eth0:v2-Rx
> 216:          0       7152          5          0   PCI-MSI-edge      eth0:v1-Rx
> 217:       7115          0          0          5   PCI-MSI-edge      eth0:v0-Rx
> 
> eth1 (outgoing)
> 201:          3          0          0       3738   PCI-MSI-edge      eth1:v7-Tx
> 202:          0          4       3743          0   PCI-MSI-edge      eth1:v6-Tx
> 203:          0       3743          4          0   PCI-MSI-edge      eth1:v5-Tx
> 204:       3746          0          0          6   PCI-MSI-edge      eth1:v4-Tx
> 
> CPU: AMD64 processors, speed 3000 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 3000
> samples  %        image name               app name                 symbol name
> 407896    8.7211  vmlinux                  vmlinux                  cache_alloc_refill
> 339524    7.2592  vmlinux                  vmlinux                  __qdisc_run
> 243352    5.2030  vmlinux                  vmlinux                  dev_queue_xmit
> 227855    4.8717  vmlinux                  vmlinux                  kfree
> 214975    4.5963  vmlinux                  vmlinux                  __alloc_skb
> 172008    3.6776  vmlinux                  vmlinux                  cache_flusharray

I see you are still using the SLAB allocator. Does the SLUB change the numbers?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 21:34       ` Stephen Hemminger
@ 2008-08-13 21:56         ` Robert Olsson
  2008-08-13 22:06           ` Stephen Hemminger
  0 siblings, 1 reply; 14+ messages in thread
From: Robert Olsson @ 2008-08-13 21:56 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Robert Olsson, Andrew Gallatin, David Miller, netdev


Stephen Hemminger writes:
 > 
 > I see you are still using the SLAB allocator. Does the SLUB change the numbers?

 Correct. I did try SLUB a couple month ago but got less performance
 but there have been some SLUB patches since. Have you experimented?

 Cheers
					--ro

  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 21:56         ` Robert Olsson
@ 2008-08-13 22:06           ` Stephen Hemminger
  2008-08-13 22:21             ` Robert Olsson
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2008-08-13 22:06 UTC (permalink / raw)
  To: Robert Olsson; +Cc: Robert Olsson, Andrew Gallatin, David Miller, netdev

On Wed, 13 Aug 2008 23:56:59 +0200
Robert Olsson <robert@robur.slu.se> wrote:

> 
> Stephen Hemminger writes:
>  > 
>  > I see you are still using the SLAB allocator. Does the SLUB change the numbers?
> 
>  Correct. I did try SLUB a couple month ago but got less performance
>  but there have been some SLUB patches since. Have you experimented?
> 
>  Cheers
> 					--ro
> 
>   
Not yet, but there was a movement to kill SLAB. If SLAB is still faster than
Christoph probably wants to know (and fix the problem).  The problem is that
one way flows might still be moving memory between CPU's

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CPU utilization increased in 2.6.27rc
  2008-08-13 22:06           ` Stephen Hemminger
@ 2008-08-13 22:21             ` Robert Olsson
  0 siblings, 0 replies; 14+ messages in thread
From: Robert Olsson @ 2008-08-13 22:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Robert Olsson, Andrew Gallatin, David Miller, netdev



 > Not yet, but there was a movement to kill SLAB. If SLAB is still faster than
 > Christoph probably wants to know (and fix the problem).  The problem is that
 > one way flows might still be moving memory between CPU's

 How does this happen? (Assuming affinity is setup correct of course)

 Cheers
					--ro

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-08-13 22:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-13  0:56 CPU utilization increased in 2.6.27rc Andrew Gallatin
2008-08-13  1:05 ` David Miller
2008-08-13 15:06   ` Andrew Gallatin
2008-08-13  1:15 ` David Miller
2008-08-13 16:13   ` Andrew Gallatin
2008-08-13 19:52     ` Robert Olsson
2008-08-13 21:34       ` Stephen Hemminger
2008-08-13 21:56         ` Robert Olsson
2008-08-13 22:06           ` Stephen Hemminger
2008-08-13 22:21             ` Robert Olsson
2008-08-13 20:03     ` Andi Kleen
2008-08-13 20:36       ` Andrew Gallatin
2008-08-13 20:27     ` David Miller
2008-08-13 20:58       ` Andrew Gallatin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).