netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Perf data with recent tg3 patches
@ 2005-05-13  2:49 Arthur Kepner
       [not found] ` <20050512.211935.67881321.davem@davemloft.net>
  0 siblings, 1 reply; 9+ messages in thread
From: Arthur Kepner @ 2005-05-13  2:49 UTC (permalink / raw)
  To: davem, mchan; +Cc: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2940 bytes --]



Several tg3 patches have been recently posted, and I've 
collected a bit of performance data with them. Mostly 
I'm concerned about reducing per-interrupt overhead 
(due to PIOs), and CPU Utilization, so the following data 
shows how these patches change the number of received 
packets/interrupt and CPU utilization.

(Let me know if I've missed any relevant patches.) 

The data is labelled as follows:
-------------------------------
base                : "vanilla" 2.6.12-rc2 kernel
base+mchan1-3       : 2.6.12-rc2 kernel + 3 patches 
                      from mchan [1]
base+mchan1-3+tagged: 2.6.12-rc2 + 3 patches
                      from mchan [1] + tagged status 
                      patch from davem [2]
base+mchan1-3+coal  : 2.6.12-rc2 + 3 patches
                      from mchan [1] + tg3 interrupt 
                      coalescence patch [3] 

[1] http://marc.theaimsgroup.com/?l=linux-netdev&m=111446723510962&w=2
(This is one of a series of 3 patches - the others can't be 
found in the archive. But they're all in 2.6.12-rc4.)
[2] http://marc.theaimsgroup.com/?l=linux-netdev&m=111567944730302&w=2
[3] http://marc.theaimsgroup.com/?l=linux-netdev&m=111586526522981&w=2

The system I used was an Altix with 1300MHz CPUs, and 
a 5704 Broadcom NIC. The workload was bulk data receive 
via TCP, with 1500 a byte MTU.

The following tables summarize the data in the attached 
graphs. I had to (grossly) interpolate in some cases, 
see the graphs for the real data. 

                            CPU Util[%]
               -------------------------------------
               base      base+      base+      base+
                       mchan1-3  mchan1-3+  mchan1-3+
Link Util[%]                      tagged       coal
=====================================================
    40         36        34        41          27
    60         48        45        50          35
    80         58        56        58          40
    90         59        58        63          42
    95         -         57        -           42


                            Packets/Intr
               -------------------------------------
               base      base+      base+      base+
                       mchan1-3  mchan1-3+  mchan1-3+
Link Util[%]                      tagged       coal
=====================================================
    40         2.2       2.3        1.4       3.4
    60         2.7       2.9        1.8       4.1
    80         3.0       3.2        2.3       5.2
    90         3.1       3.4        2.4       6.2
    95          -        3.5         -        6.6


"mchan1-3" gets us up to ~.5 more packets/interrupt, 
and adding the "coal" patch ~3.5 more. The "tagged" 
patch made things a bit worse, though I haven't pinned 
down exactly why that is. Processing more packets per 
interrupt results in lower CPU utilization, largely 
because we spend less time waiting for PIOs to flush.


--
Arthur

[-- Attachment #2: CPU utilization --]
[-- Type: IMAGE/PNG, Size: 3803 bytes --]

[-- Attachment #3: Received Packets/Interrupt --]
[-- Type: IMAGE/PNG, Size: 3622 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
       [not found] ` <20050512.211935.67881321.davem@davemloft.net>
@ 2005-05-13 23:57   ` Arthur Kepner
  2005-05-14  0:50     ` David S. Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Arthur Kepner @ 2005-05-13 23:57 UTC (permalink / raw)
  To: David S.Miller; +Cc: mchan, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1932 bytes --]


On Thu, 12 May 2005, David S.Miller wrote:

> ......
> The tg3_interrupt coalesce patch from me is supposed
> to be applied on top of my tagged status patch.
> 

OK, but the "tagged status" patch and the "hw 
coalescing infrastructure" patch aren't really 
dependent, right? 

In any case, I've repeated the measurements with 
both patches. Same experiment as yesterday. 

The data is in the attached graphs is labelled as follows:
---------------------------------------------------------
base              : is a "vanilla" 2.6.12-rc3 kernel
base+[1]+...+[i]  : "vanilla" 2.6.12-rc3 kernel + patches 
                    as labelled below

[1] http://marc.theaimsgroup.com/?l=linux-netdev&m=111446723510962&w=2
    (This is one of a series of 3 patches - the others can't be 
     found in the archive. But they're all in 2.6.12-rc4.)
[2] http://marc.theaimsgroup.com/?l=linux-netdev&m=111567944730302&w=2
    ("tagged status" patch)
[3] http://marc.theaimsgroup.com/?l=linux-netdev&m=111586526522981&w=2
    ("hw coalescing infrastructure" patch)

The data I sent yesterday was also with 2.6.12-rc3 (I 
mistakenly said it was rc2.) Also, I should have said 
that I only used the default coalescence parameters 
with [3] - didn't tune them with ethtool. Same is true 
of today's data.

The graphs show that the tagged status patch ([2]) is 
associated with fewer received packets/interrupt and 
higher CPU utilization. I found that the reason is that, 
under high receive load, most of the time (~80%) the 
tag in the status block changes between the time that 
it's read (and saved as last_tag) in tg3_poll(), and when 
it's written back to MAILBOX_INTERRUPT_0 in 
tg3_restart_ints(). If I understand the way the status 
tag works, that means that the card will immediately 
generate another interrupt. That's consistent with 
what I'm seeing - a much higher interrupt rate when the 
tagged status patch is used.

--
Arthur






[-- Attachment #2: CPU utilization --]
[-- Type: IMAGE/png, Size: 2991 bytes --]

[-- Attachment #3: Received Packets/Interrupt --]
[-- Type: IMAGE/png, Size: 2617 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-14  0:50     ` David S. Miller
@ 2005-05-14  0:39       ` Michael Chan
  2005-05-14  5:20         ` David S. Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2005-05-14  0:39 UTC (permalink / raw)
  To: David S.Miller; +Cc: akepner, netdev

On Fri, 2005-05-13 at 17:50 -0700, David S.Miller wrote:

> Perhaps we can make the logic in tg3_poll() smarter about
> this.  Something like:
> 
> 	tg3_process_phy_events();
> 	tg3_tx();
> 	tg3_rx();
> 
> 	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
> 		tp->last_tag = sblk->status_tag;
> 	rmb();
> 	done = !tg3_has_work(tp);
> 	if (done) {
> 		spin_lock_irqsave(&tp->lock, flags);
> 		__netif_rx_complete(netdev);
> 		tg3_restart_ints(tp);
> 		spin_unlock_irqrestore(&tp->lock, flags);
> 	}
> 	return (done ? 0 : 1);
> 
> Basically, move the last_tag sample to after we do the
> work, then recheck the RX/TX producer/consumer indexes.
> 

I like this. I think it will work well.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-13 23:57   ` Arthur Kepner
@ 2005-05-14  0:50     ` David S. Miller
  2005-05-14  0:39       ` Michael Chan
  0 siblings, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-14  0:50 UTC (permalink / raw)
  To: akepner; +Cc: mchan, netdev

From: Arthur Kepner <akepner@sgi.com>
Subject: Re: Perf data with recent tg3 patches
Date: Fri, 13 May 2005 16:57:51 -0700 (PDT)

> I found that the reason is that, 
> under high receive load, most of the time (~80%) the 
> tag in the status block changes between the time that 
> it's read (and saved as last_tag) in tg3_poll(), and when 
> it's written back to MAILBOX_INTERRUPT_0 in 
> tg3_restart_ints(). If I understand the way the status 
> tag works, that means that the card will immediately 
> generate another interrupt. That's consistent with 
> what I'm seeing - a much higher interrupt rate when the 
> tagged status patch is used.

Thanks for tracking this down.

Perhaps we can make the logic in tg3_poll() smarter about
this.  Something like:

	tg3_process_phy_events();
	tg3_tx();
	tg3_rx();

	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
		tp->last_tag = sblk->status_tag;
	rmb();
	done = !tg3_has_work(tp);
	if (done) {
		spin_lock_irqsave(&tp->lock, flags);
		__netif_rx_complete(netdev);
		tg3_restart_ints(tp);
		spin_unlock_irqrestore(&tp->lock, flags);
	}
	return (done ? 0 : 1);

Basically, move the last_tag sample to after we do the
work, then recheck the RX/TX producer/consumer indexes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-14  0:39       ` Michael Chan
@ 2005-05-14  5:20         ` David S. Miller
  2005-05-20 21:52           ` Arthur Kepner
  0 siblings, 1 reply; 9+ messages in thread
From: David S. Miller @ 2005-05-14  5:20 UTC (permalink / raw)
  To: mchan; +Cc: akepner, netdev

From: "Michael Chan" <mchan@broadcom.com>
Subject: Re: Perf data with recent tg3 patches
Date: Fri, 13 May 2005 17:39:19 -0700

> I like this. I think it will work well.

Here is a quick patch which implements this.

--- 1/drivers/net/tg3.c.~1~	2005-05-13 22:13:02.000000000 -0700
+++ 2/drivers/net/tg3.c	2005-05-13 22:18:03.000000000 -0700
@@ -2869,9 +2869,6 @@ static int tg3_poll(struct net_device *n
 
 	spin_lock_irqsave(&tp->lock, flags);
 
-	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
-		tp->last_tag = sblk->status_tag;
-
 	/* handle link change and other phy events */
 	if (!(tp->tg3_flags &
 	      (TG3_FLAG_USE_LINKCHG_REG |
@@ -2896,7 +2893,6 @@ static int tg3_poll(struct net_device *n
 	 * All RX "locking" is done by ensuring outside
 	 * code synchronizes with dev->poll()
 	 */
-	done = 1;
 	if (sblk->idx[0].rx_producer != tp->rx_rcb_ptr) {
 		int orig_budget = *budget;
 		int work_done;
@@ -2908,12 +2904,14 @@ static int tg3_poll(struct net_device *n
 
 		*budget -= work_done;
 		netdev->quota -= work_done;
-
-		if (work_done >= orig_budget)
-			done = 0;
 	}
 
+	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS)
+		tp->last_tag = sblk->status_tag;
+	rmb();
+
 	/* if no more work, tell net stack and NIC we're done */
+	done = !tg3_has_work(tp);
 	if (done) {
 		spin_lock_irqsave(&tp->lock, flags);
 		__netif_rx_complete(netdev);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-14  5:20         ` David S. Miller
@ 2005-05-20 21:52           ` Arthur Kepner
  2005-05-20 22:33             ` David S. Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Arthur Kepner @ 2005-05-20 21:52 UTC (permalink / raw)
  To: David S.Miller; +Cc: mchan, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4383 bytes --]


Here are a couple more data points with the recent 
interrupt coalescing patches for the tg3 driver. 
Please see the attached graphs, which show how CPU 
utilization, and number of received packets per 
interrupt vary with link utilization. The workload 
here is receive-as-fast-as-you-can-over-TCP, with 
a single sending and a single receiving process.

The data in the graphs is labelled as follows:
----------------------------------------------
2.6.12-rc3: unmodified 2.6.12-rc3 

dflt coal : 2.6.12-rc3 + [1] + [2] + [3] + [4]
            using the default intr coalescence 
            values (rx-frames = rx-usecs-irq = 5)

4x coal   : 2.6.12-rc3 + [1] + [2] + [3] + [4] + [5]
            using 4 times the default values 
            for rx-frames and rx-frames-irq

[1] http://marc.theaimsgroup.com/?l=linux-netdev&m=111446723510962&w=2
    (This is one of a series of 3 patches - the others can't be
     found in the archive. But they're all in 2.6.12-rc4.)
[2] http://marc.theaimsgroup.com/?l=linux-netdev&m=111567944730302&w=2
    ("tagged status" patch)
[3] http://marc.theaimsgroup.com/?l=linux-netdev&m=111586526522981&w=2
    ("hw coalescing infrastructure" patch)
[4] http://marc.theaimsgroup.com/?l=linux-netdev&m=111604846510646&w=2
    ("tagged status update")
[5] the patches below which allow "ethtool -[cC]" to work


Patch [4] almost entirely eliminates updates to the tag in 
the status block between when it's been saved in tg3_poll() 
and when it's written back to the NIC in tg3_restart_ints(). 
It still happens, but the frequency is a few times in a 
thousand, so it doesn't significantly affect the interrupt 
rate. 

I had to make a couple of changes to allow setting/
retrieving the coalescence parameters with ethtool. Those 
patches are at the end. 

When the default coalescence parameters are used (rx-frames, 
rx-frames-irq both set to 5) the maximum number of received 
packets per interrupt is ~4.2. Setting rx-frames and 
rx-frames-irq to 20 caused the maximum number of received
packets per interrupt to rise to ~19.6. Maximum CPU 
utilization went down from ~52% to ~35%. Very nice.


Fix typo in ethtool_set_coalesce()

Signed-off-by: Arthur Kepner <akepner@sgi.com>

--- linux.save/net/core/ethtool.c	2005-05-20 12:40:04.426385446 -0700
+++ linux/net/core/ethtool.c	2005-05-20 12:49:34.087515306 -0700
@@ -347,7 +347,7 @@ static int ethtool_set_coalesce(struct n
 {
 	struct ethtool_coalesce coalesce;
 
-	if (!dev->ethtool_ops->get_coalesce)
+	if (!dev->ethtool_ops->set_coalesce)
 		return -EOPNOTSUPP;
 
 	if (copy_from_user(&coalesce, useraddr, sizeof(coalesce)))


Changes to allow setting/getting coalescence parameters 
with tg3.

Signed-off-by: Arthur Kepner <akepner@sgi.com>

--- linux.save/drivers/net/tg3.c	2005-05-20 13:02:41.610865448 -0700
+++ linux/drivers/net/tg3.c	2005-05-20 13:11:36.467011288 -0700
@@ -5094,8 +5094,11 @@ static void tg3_set_bdinfo(struct tg3 *t
 }
 
 static void __tg3_set_rx_mode(struct net_device *);
-static void tg3_set_coalesce(struct tg3 *tp, struct ethtool_coalesce *ec)
+static int tg3_set_coalesce(struct net_device *dev, 
+				struct ethtool_coalesce *ec)
 {
+	struct tg3 *tp = netdev_priv(dev);
+
 	tw32(HOSTCC_RXCOL_TICKS, ec->rx_coalesce_usecs);
 	tw32(HOSTCC_TXCOL_TICKS, ec->tx_coalesce_usecs);
 	tw32(HOSTCC_RXMAX_FRAMES, ec->rx_max_coalesced_frames);
@@ -5114,6 +5117,9 @@ static void tg3_set_coalesce(struct tg3 
 
 		tw32(HOSTCC_STAT_COAL_TICKS, val);
 	}
+
+	memcpy(&tp->coal, ec, sizeof(tp->coal));
+	return 0;
 }
 
 /* tp->lock is held. */
@@ -5437,7 +5443,7 @@ static int tg3_reset_hw(struct tg3 *tp)
 		udelay(10);
 	}
 
-	tg3_set_coalesce(tp, &tp->coal);
+	tg3_set_coalesce(tp->dev, &tp->coal);
 
 	/* set status block DMA address */
 	tw32(HOSTCC_STATUS_BLK_HOST_ADDR + TG3_64BIT_REG_HIGH,
@@ -7302,6 +7308,8 @@ static int tg3_get_coalesce(struct net_d
 	return 0;
 }
 
+static int tg3_set_coalesce(struct net_device *, struct ethtool_coalesce *);
+
 static struct ethtool_ops tg3_ethtool_ops = {
 	.get_settings		= tg3_get_settings,
 	.set_settings		= tg3_set_settings,
@@ -7335,6 +7343,7 @@ static struct ethtool_ops tg3_ethtool_op
 	.get_stats_count	= tg3_get_stats_count,
 	.get_ethtool_stats	= tg3_get_ethtool_stats,
 	.get_coalesce		= tg3_get_coalesce,
+	.set_coalesce		= tg3_set_coalesce,
 };
 
 static void __devinit tg3_get_eeprom_size(struct tg3 *tp)

--
Arthur

[-- Attachment #2: CPU utilization --]
[-- Type: IMAGE/png, Size: 3303 bytes --]

[-- Attachment #3: Received Pkts/Intr --]
[-- Type: IMAGE/png, Size: 3195 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-20 21:52           ` Arthur Kepner
@ 2005-05-20 22:33             ` David S. Miller
  2005-05-20 22:52               ` Rick Jones
  2005-05-20 22:54               ` Arthur Kepner
  0 siblings, 2 replies; 9+ messages in thread
From: David S. Miller @ 2005-05-20 22:33 UTC (permalink / raw)
  To: akepner; +Cc: mchan, netdev

From: Arthur Kepner <akepner@sgi.com>
Date: Fri, 20 May 2005 14:52:35 -0700 (PDT)

> When the default coalescence parameters are used (rx-frames, 
> rx-frames-irq both set to 5) the maximum number of received 
> packets per interrupt is ~4.2. Setting rx-frames and 
> rx-frames-irq to 20 caused the maximum number of received
> packets per interrupt to rise to ~19.6. Maximum CPU 
> utilization went down from ~52% to ~35%. Very nice.

Yes, but using such a high value makes latency go into the
toilet. :-)

I'd much rather see dynamic settings based upon packet rate.
It's easy to resurrect the ancient code from the early
tg3 days which does this.

Thanks for all your testing, it is very informative and
useful.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-20 22:33             ` David S. Miller
@ 2005-05-20 22:52               ` Rick Jones
  2005-05-20 22:54               ` Arthur Kepner
  1 sibling, 0 replies; 9+ messages in thread
From: Rick Jones @ 2005-05-20 22:52 UTC (permalink / raw)
  To: netdev

> Yes, but using such a high value makes latency go into the
> toilet. :-)

For low packet rates.

> 
> I'd much rather see dynamic settings based upon packet rate.
> It's easy to resurrect the ancient code from the early
> tg3 days which does this.

If that is the stuff I think it was, it was giving me _fits_ trying to run 
TCP_RR tests.  Results bounced all over the place.  I think it was trying to 
kick-in at pps rates that were below the limits of what a pair of systems could 
do on a single, synchronous request/response stream.


Now, modulo an OS that I cannot mention because its EULA forbits discussing 
results with third parties, where the netperf TCP_RR perf is 8000 transactions 
per second no matter how powerful the CPU...   if folks are simply free to set 
high coalescing parms on their own, presumably with some knowledge of their 
workloads, wouldn't that be enough?  That has been "good enough" for one OS I 
can discuss - HP-UX - and its bcm570X-based GbE NICs and before to Tigon2.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Perf data with recent tg3 patches
  2005-05-20 22:33             ` David S. Miller
  2005-05-20 22:52               ` Rick Jones
@ 2005-05-20 22:54               ` Arthur Kepner
  1 sibling, 0 replies; 9+ messages in thread
From: Arthur Kepner @ 2005-05-20 22:54 UTC (permalink / raw)
  To: David S.Miller; +Cc: mchan, netdev

On Fri, 20 May 2005, David S.Miller wrote:

> .....
> Yes, but using such a high value makes latency go into the
> toilet. :-)
> .....

Latency measurements are on my to-do list.

--
Arthur

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-05-20 22:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-13  2:49 Perf data with recent tg3 patches Arthur Kepner
     [not found] ` <20050512.211935.67881321.davem@davemloft.net>
2005-05-13 23:57   ` Arthur Kepner
2005-05-14  0:50     ` David S. Miller
2005-05-14  0:39       ` Michael Chan
2005-05-14  5:20         ` David S. Miller
2005-05-20 21:52           ` Arthur Kepner
2005-05-20 22:33             ` David S. Miller
2005-05-20 22:52               ` Rick Jones
2005-05-20 22:54               ` Arthur Kepner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).