netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] e1000 performance patch
@ 2006-04-26 22:13 Robin Humble
  2006-04-26 22:26 ` Rick Jones
  0 siblings, 1 reply; 5+ messages in thread
From: Robin Humble @ 2006-04-26 22:13 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]


[I sent this to the e1000-devel folks, and they suggested netdev might
 have opinions too. the below text has changed a little bit to reflect
 feedback from Auke Kok]

attached is a small patch for e1000 that dynamically changes Interrupt
Throttle Rate for best performance - both latency and bandwidth.
it makes e1000 look really good on netpipe with a ~28 us latency and
890 Mbit/s bandwidth.

the basic idea is that high InterruptThrottleRate (~200k) is best for
small messages, whilst low ITR (~15k) is best for large messages.
leaving the ITR high for large messages burns outrageous amounts of cpu,
and any less than ~15k ITR is bad for bandwidth.

so this patch creates a new "performance dynamic" mode
  InterruptThrottleRate=2   (2,2 for dual NICS)
which changes the ITR on the fly. the patch is based on the existing
"dynamic" mode (ITR=1) which seems to be optimised for low cpu usage
with little concern for performance.

hopefully the thresholds chosen for ITR changeovers will be ok on other
people's hardware too, but I really have no idea how universal it'll be.
we've been running it for a few months on our cluster and it appears stable.

10M 20M 100M as thresholds for changing between the 200k 90k 30 15k ITRs
were set pretty much by eye - by doing a bunch of netpipe runs and
trying to minimise cpu usage (ITR) for a target latency/bandwidth.

I've done an analysis of performance on this page:
  http://www.cita.utoronto.ca/mediawiki/index.php/E1000_performance_patch
our hardware details are there too.
there's also a link to another analysis of how the patch affects routing
performance and cpu usage (surprisingly better).

despite the netpipe improvements, I haven't seen much in the way of real
world code differences (either +ve or -ve) from a regular 15k ITR. I've
seen an improvement in one code, and a slight degradation (~1%) in HPL
(top500.org benchmark). it should probably make the most difference for
codes that consistantly send small (< 1k) messages.

one possible improvement would be if the watchdog routine was called
more than once every 2 seconds - that would allow the ITR to adapt more
often.
ideally (I think) for traffic with mixed packet sizes the ITR would be
adapted 100's of times a second, but I'm not sure how practical that is.

cheers,
robin

[-- Attachment #2: rjh-performance-e1000-7.0.33.patch --]
[-- Type: text/plain, Size: 3074 bytes --]

diff -ru e1000-7.0.33/src/e1000_main.c e1000-7.0.33-rjh-performance/src/e1000_main.c
--- e1000-7.0.33/src/e1000_main.c	2006-02-03 16:53:41.000000000 -0500
+++ e1000-7.0.33-rjh-performance/src/e1000_main.c	2006-04-01 21:44:21.000000000 -0500
@@ -1732,7 +1732,7 @@
 
 	if (hw->mac_type >= e1000_82540) {
 		E1000_WRITE_REG(hw, RADV, adapter->rx_abs_int_delay);
-		if (adapter->itr > 1)
+		if (adapter->itr > 2)
 			E1000_WRITE_REG(hw, ITR,
 				1000000000 / (adapter->itr * 256));
 	}
@@ -2394,17 +2394,30 @@
 		}
 	}
 
-	/* Dynamic mode for Interrupt Throttle Rate (ITR) */
-	if (adapter->hw.mac_type >= e1000_82540 && adapter->itr == 1) {
-		/* Symmetric Tx/Rx gets a reduced ITR=2000; Total
-		 * asymmetrical Tx or Rx gets ITR=8000; everyone
-		 * else is between 2000-8000. */
-		uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000;
-		uint32_t dif = (adapter->gotcl > adapter->gorcl ?
-			adapter->gotcl - adapter->gorcl :
-			adapter->gorcl - adapter->gotcl) / 10000;
-		uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000;
-		E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256));
+	/* Dynamic modes for Interrupt Throttle Rate (ITR) */
+	if (adapter->hw.mac_type >= e1000_82540) {
+		if (adapter->itr == 1) {
+			/* Symmetric Tx/Rx gets a reduced ITR=2000; Total
+			 * asymmetrical Tx or Rx gets ITR=8000; everyone
+			 * else is between 2000-8000. */
+			uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000;
+			uint32_t dif = (adapter->gotcl > adapter->gorcl ?
+				adapter->gotcl - adapter->gorcl :
+				adapter->gorcl - adapter->gotcl) / 10000;
+			uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000;
+			E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256));
+		}
+		else if (adapter->itr == 2) {  /* low latency, high bandwidth, moderate cpu usage */
+			/* range from high itr at low cl, to low itr at high cl
+			 *   < 10M      =>  large itr
+			 * 10M to 20M   =>  90k itr
+                         * 20M to 100M  =>  30k itr
+			 *   > 100M     =>  15k itr    */
+			uint32_t goc = max(adapter->gotcl, adapter->gorcl) / 1000000;
+			uint32_t itr = goc > 10 ? (goc > 20 ? (goc > 100 ? 15000: 30000): 90000): 200000;
+			/* DPRINTK(PROBE, INFO, "e1000 ITR %d - [tr]cl min/ave/max %dm / %dm/ %dm\n", itr, min(adapter->gotcl, adapter->gorcl) / 1000000, (adapter->gotcl + adapter->gorcl) / 2000000, max(adapter->gotcl, adapter->gorcl) / 1000000 ); */
+			E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256));
+		}
 	}
 
 	/* Cause software interrupt to ensure rx ring is cleaned */
diff -ru e1000-7.0.33/src/e1000_param.c e1000-7.0.33-rjh-performance/src/e1000_param.c
--- e1000-7.0.33/src/e1000_param.c	2006-02-03 16:53:41.000000000 -0500
+++ e1000-7.0.33-rjh-performance/src/e1000_param.c	2006-03-29 21:42:00.000000000 -0500
@@ -538,6 +538,10 @@
 				DPRINTK(PROBE, INFO, "%s set to dynamic mode\n",
 					opt.name);
 				break;
+			case 2:
+				DPRINTK(PROBE, INFO, "%s set to performance dynamic mode\n",
+					opt.name);
+				break;
 			default:
 				e1000_validate_option(&adapter->itr, &opt,
 					adapter);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] e1000 performance patch
  2006-04-26 22:13 [RFC] e1000 performance patch Robin Humble
@ 2006-04-26 22:26 ` Rick Jones
  2006-04-27  2:43   ` Robin Humble
  0 siblings, 1 reply; 5+ messages in thread
From: Rick Jones @ 2006-04-26 22:26 UTC (permalink / raw)
  To: Robin Humble; +Cc: netdev

Robin Humble wrote:
> [I sent this to the e1000-devel folks, and they suggested netdev might
>  have opinions too. the below text has changed a little bit to reflect
>  feedback from Auke Kok]
> 
> attached is a small patch for e1000 that dynamically changes Interrupt
> Throttle Rate for best performance - both latency and bandwidth.
> it makes e1000 look really good on netpipe with a ~28 us latency and
> 890 Mbit/s bandwidth.
> 
> the basic idea is that high InterruptThrottleRate (~200k) is best for
> small messages, 

Best for small numbers of small messages?  If one is looking to have 
high aggregate small packet rates, the higher throttle rate may degrade 
the peak PPS one can achieve.

> I've done an analysis of performance on this page:
>   http://www.cita.utoronto.ca/mediawiki/index.php/E1000_performance_patch
> our hardware details are there too.
> there's also a link to another analysis of how the patch affects routing
> performance and cpu usage (surprisingly better).
> 
> despite the netpipe improvements, I haven't seen much in the way of real
> world code differences (either +ve or -ve) from a regular 15k ITR. I've
> seen an improvement in one code, and a slight degradation (~1%) in HPL
> (top500.org benchmark). it should probably make the most difference for
> codes that consistantly send small (< 1k) messages.

Tweaking interrupt coalescing parameters was rather common in SPECweb 
benchmarking. If you examine some of the results on www.spec.org you may 
see examples.  IIRC the last ones I submitted used an interrupt throttle 
rate of something like 700.  It was a small but non-trivial percentage 
difference in the SPECweb result.

rick jones

It is a bit rough/messy as a writeup, but here is what I've seen wrt the 
latency vs throughput tradeoffs:

ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] e1000 performance patch
  2006-04-26 22:26 ` Rick Jones
@ 2006-04-27  2:43   ` Robin Humble
  2006-04-27 16:07     ` Rick Jones
  0 siblings, 1 reply; 5+ messages in thread
From: Robin Humble @ 2006-04-27  2:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev


Hi Rick,

thanks for your comments.

On Wed, Apr 26, 2006 at 03:26:17PM -0700, Rick Jones wrote:
>Robin Humble wrote:
>>attached is a small patch for e1000 that dynamically changes Interrupt
>>Throttle Rate for best performance - both latency and bandwidth.
>>it makes e1000 look really good on netpipe with a ~28 us latency and
>>890 Mbit/s bandwidth.
>>
>>the basic idea is that high InterruptThrottleRate (~200k) is best for
>>small messages, 
>Best for small numbers of small messages?  If one is looking to have 
>high aggregate small packet rates, the higher throttle rate may degrade 
>the peak PPS one can achieve.

if small is <1kB, and there's a single client, then it looks to me like
the higher ITR the better.
for a single netpipe client (running 10k repetitions and from 0 byte to
1kB messages), the driver chooses 200k ITR until it gets close to 1kB
messages, when it drops to its next level of 90k ITR. about 15-20% cpu
is used.

<short delay whilst I run some tests>

for 3 netpipe clients (again running 10k repetitions and from 0 byte to
1kB messages, all with the patched e1000 driver), the server is at 200k
ITR until the 3 clients get to ~96 bytes, then it drops to 90k ITR, and
at ~512 byte messages it drops the ITR once more to 30k.

so I think the patched driver is doing the right thing there and
lowering the ITR more rapidly as it gets more clients.

but clearly I should be using netperf to get more accurate cpu numbers
and a more convincing aggregate table :-)

>It is a bit rough/messy as a writeup, but here is what I've seen wrt the 
>latency vs throughput tradeoffs:
>ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

from a quick read it looks like just the case with 32kB messages,
multiple simultaneous clients, and driver set to unlimited ITR sees
reduced throughput. is that right?

if so, then I'm not surprised.
this graph
  http://www.cita.utoronto.ca/mediawiki/index.php/Image:Cpu.100k.png
shows that (for our hardware etc. etc.) at 32kB the cpu usage if one
was using 100k ITR is already excessive, and unlimited ITR would be
worse than that... :-/
so for 32kB messages and a single client (never mind multiple clients)
I'd agree with your study that unlimited ITR is probably not a good
idea.

with a single client doing 32kB messages, my patched driver is probably
doing the right thing as it's at 30k ITR (and at its minimum ITR of
15k with multiple clients doing 32kB messages).

>> uint32_t goc = max(adapter->gotcl, adapter->gorcl) / 1000000;
>> uint32_t itr = goc > 10 ? (goc > 20 ? (goc > 100 ? 15000: 30000): 90000): 200000;

Hmmmm... I've just noticed that the gotcl/gorcl count is >200M on the
server when 3 clients are doing 32kB netpipes... so I can probably use
goc of > 150 or 200 as a threshold to switch to a lower ITR again.
maybe 3k or 6k...


but overall I'm actually more worried about a mix of small and large
messages than multiple clients.

a large/small mix might well occur in 'the real world' and it'll be 2s
until the watchdog routine can adapt the ITR. potentially that 2s will
be at 200k ITR which is too high for large messages, and up to 2s of
cpu will be burnt needlessly.

can netperf (or some other tool) mix up big and small message sizes
like 'the real world' perhaps does?
that might help me find a good frequency at which to try to adapt the
ITR... (eg. 1, 10, 100 or 1000 times a second)

cheers,
robin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] e1000 performance patch
  2006-04-27  2:43   ` Robin Humble
@ 2006-04-27 16:07     ` Rick Jones
  2006-04-27 20:49       ` Robin Humble
  0 siblings, 1 reply; 5+ messages in thread
From: Rick Jones @ 2006-04-27 16:07 UTC (permalink / raw)
  To: Robin Humble; +Cc: netdev


> 
> but clearly I should be using netperf to get more accurate cpu numbers
> and a more convincing aggregate table :-)

Well, I'll not stop you  :)

> 
> 
>>It is a bit rough/messy as a writeup, but here is what I've seen wrt the 
>>latency vs throughput tradeoffs:
>>ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt
> 
> 
> from a quick read it looks like just the case with 32kB messages,
> multiple simultaneous clients, and driver set to unlimited ITR sees
> reduced throughput. is that right?
> 
> if so, then I'm not surprised.

There should be three basic measures there - one is the single-instance 
request-response test. The idea is to see minimum latency.  That test 
likes to see the interrupt throttle rate made very high, or disabled 
completely.

The aggregate TCP_RR's and the TCP_STREAM tests are there to show what 
effect that has on the ability to do aggregate request/response and a 
bulk transfer.

> but overall I'm actually more worried about a mix of small and large
> messages than multiple clients.


> 
> a large/small mix might well occur in 'the real world' and it'll be 2s
> until the watchdog routine can adapt the ITR. potentially that 2s will
> be at 200k ITR which is too high for large messages, and up to 2s of
> cpu will be burnt needlessly.
> 
> can netperf (or some other tool) mix up big and small message sizes
> like 'the real world' perhaps does?
> that might help me find a good frequency at which to try to adapt the
> ITR... (eg. 1, 10, 100 or 1000 times a second)

There is the "vst" (variable size test IIRC) in netperf4:

http://www.netperf.org/svn/netperf4/branches/glib_migration

The docs for netperf4 are presently pathetic.  Feel free to email me for 
bootstrapping information.  Basically, you'll need pkg-config, libxml2 
and glib-2.0 on the system.

> 
> cheers,
> robin


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] e1000 performance patch
  2006-04-27 16:07     ` Rick Jones
@ 2006-04-27 20:49       ` Robin Humble
  0 siblings, 0 replies; 5+ messages in thread
From: Robin Humble @ 2006-04-27 20:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On Thu, Apr 27, 2006 at 09:07:36AM -0700, Rick Jones wrote:
>There should be three basic measures there - one is the single-instance 
>request-response test. The idea is to see minimum latency.  That test 
>likes to see the interrupt throttle rate made very high, or disabled 
>completely.
>
>The aggregate TCP_RR's and the TCP_STREAM tests are there to show what 
>effect that has on the ability to do aggregate request/response and a 
>bulk transfer.

I guess the whole point of my patch is to try to handle all these cases
efficiently without user intervention by making the driver self-tune
the InterruptThrottleRate depending on what traffic it's seeing.

I think it's doing the right things so far - at least it's a good
compromise - it probably won't ever be ideal in all workloads.

>>can netperf (or some other tool) mix up big and small message sizes
>>like 'the real world' perhaps does?
>>that might help me find a good frequency at which to try to adapt the
>>ITR... (eg. 1, 10, 100 or 1000 times a second)
>
>There is the "vst" (variable size test IIRC) in netperf4:
>
>http://www.netperf.org/svn/netperf4/branches/glib_migration
>
>The docs for netperf4 are presently pathetic.  Feel free to email me for 
>bootstrapping information.  Basically, you'll need pkg-config, libxml2 
>and glib-2.0 on the system.

thanks. I'll check it out.

actually, thinking about it more, the worst case for the patched driver
is a quiescent system (where ITR will be maximum) which then sees a
stream of large messages - say 500MB (~=5s at 1Gbit). so until the
watchdog kicks in (every 2s at the moment) and lowers the ITR then the
load on the cpu will be high.
The only solution for this is to run the ITR resetting watchdog as
often as possible. ie. times per second.

And then test it on real codes that do some sort of large message
bursty communication and see if they run faster or slower.

The patched driver will actually deal fairly well with a mixed size
workload as some of the workload will be large messages and that will
(on average) lower the ITR automatically.

cheers,
robin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-04-27 20:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-26 22:13 [RFC] e1000 performance patch Robin Humble
2006-04-26 22:26 ` Rick Jones
2006-04-27  2:43   ` Robin Humble
2006-04-27 16:07     ` Rick Jones
2006-04-27 20:49       ` Robin Humble

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).