RE: [2.5] IRQ distribution in the 2.5.52 kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: [2.5] IRQ distribution in the 2.5.52  kernel
@ 2003-01-09 19:52 Kamble, Nitin A
  2003-01-09 22:08 ` Andrew Theurer
  0 siblings, 1 reply; 4+ messages in thread
From: Kamble, Nitin A @ 2003-01-09 19:52 UTC (permalink / raw)
  To: Andrew Theurer, linux-kernel, mbligh
  Cc: Saxena, Sunil, Mallick, Asit K, Nakajima, Jun

Hi Andrew,
  Your benchmark results are very impressive. Thanks for trying it out.
I have some thoughts after seeing the results. 
 
> Nitin,
> 
> I got a chance to run the NetBench benchmark with your patch on
2.5.54-
> mjb2
> kernel.  NetBench measures SMB/CIFS performance by using several SMB
> clients
> (in this case 44 Windows 2000 systems), sending SMB requests to a
Linux
> server running Samba 2.2.3a+sendfile.  Result is in throughput, Mbps.
> Generally the network traffic on the server is 60% recv, 40% tx.
> 
> I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB
L3 P4
> Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is
> "Summit".
> I also have more than one Intel e1000 adapters.
> 
> I decided to run a few configurations, first with just one adapter,
with
> and
> without HT support in the kernel (acpi=off), then add another adapter
and
> test again with/without HT.
> 
> Here are the results:
> 
> 4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
> 4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,
+0.74%
[NK] It is surprising to see single e1000 is giving bandwidth more than
1Gbps. What can be the reason for this extra bandwidth? ... Maybe
compression is happening somewhere.

> 
> I suppose we didn't see much of an improvement here because we never
run
> into
> the situation where more than one interrupt with a high rate is routed
to
> a
> single CPU on irq_balance.
> 
> 4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
> 4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,
+0.49%
> 
> Again, not much of a difference just yet, but lots of idle time.  We
may
> have
> reached the limit at which one logical CPU can process interrupts for
an
> e1000 adapter.  There are other things I can probably do to help this,
> like
> int delay, and NAPI, which I will get to eventually.
> 
> 4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
> 4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle
+4.7%
[NK] It can be a case that throughput is getting limited by the network
infrastructure or total load of clients. If we know the theoretical
desired maximum throughput then we will get a better idea about the
bottleneck. It would be interesting to see the results, after adding one
more e1000 card to the server.

> 
> OK, almost 5% better!  
[NK] It's a pretty good number!

Probably has to do with a couple of things; the
> fact
> that your code does not route two different interrupts to the same
> core/different logical cpus (quite obvious by looking at
> /proc/interrupts),
> and that more than one interrupt does not go to the same cpu if
possible.
> I
> suspect irq_balance did some of those [bad] things some of the time,
and
> we
> observed a bottleneck in int processing that was lower than with kirq.
> 
> I don't think all of the idle time is because of a int processing
> bottleneck.
> I'm just not sure what it is yet :)  Hopefully something will become
> obvious
> to me...
> 
> Overall I like the way it works, and I believe it can be tweaked to
work
> with
> NUMA when necessary.  
[NK] I also believe so.

I hope to have access to a specweb system on a NUMA
> box
> soon, so we can verify that.
> 
> -Andrew Theurer
[NK] 
Thanks & regards,
Nitin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [2.5] IRQ distribution in the 2.5.52  kernel
  2003-01-09 19:52 [2.5] IRQ distribution in the 2.5.52 kernel Kamble, Nitin A
@ 2003-01-09 22:08 ` Andrew Theurer
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Theurer @ 2003-01-09 22:08 UTC (permalink / raw)
  To: Kamble, Nitin A, linux-kernel, mbligh
  Cc: Saxena, Sunil, Mallick, Asit K, Nakajima, Jun

<snip>
> > test again with/without HT.
> >
> > Here are the results:
> >
> > 4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
> > 4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,
>
> +0.74%
> [NK] It is surprising to see single e1000 is giving bandwidth more than
> 1Gbps. What can be the reason for this extra bandwidth? ... Maybe
> compression is happening somewhere.

Full duplex.  I suppose theoretical full throughput is 2Gbps.  Sar reported 
about 1174 Mb/sec with one adapter on one of these results above, and it was 
454 Recv/720 Tx (I had the percentages incorrectly swapped in previous 
email).  This is still with an MTU of 1500!
  
> > I suppose we didn't see much of an improvement here because we never
>
> >
> > 4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
> > 4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,
>
>
> >
> > 4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
> > 4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle
>
> +4.7%
> [NK] It can be a case that throughput is getting limited by the network
> infrastructure or total load of clients. If we know the theoretical
> desired maximum throughput then we will get a better idea about the
> bottleneck. It would be interesting to see the results, after adding one
> more e1000 card to the server.

It occurred to me later, the answer was obvious, the one you mentioned: 
clients.  I originally had enough clients to accomplish 1000 Mbps, but I'm 
pretty sure 44 client will not cut it for NetBench at around 1500 Mbps (where 
this hopefully will end up).  NetBench throttles the clients, so I really 
can't drive them much harder.  There is an option to simulate more than one 
client per computer, but I have had trouble in the past with that, but I am 
going to give it one more try. 
>
> > OK, almost 5% better!
>
> [NK] It's a pretty good number!


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [2.5] IRQ distribution in the 2.5.52  kernel
@ 2003-01-08  2:50 Kamble, Nitin A
  2003-01-09 16:10 ` Andrew Theurer
  0 siblings, 1 reply; 4+ messages in thread
From: Kamble, Nitin A @ 2003-01-08  2:50 UTC (permalink / raw)
  To: linux-kernel, Kamble, Nitin A
  Cc: Saxena, Sunil, Mallick, Asit K, Nakajima, Jun

Hello All,

  We were looking at the performance impact of the IRQ routing from
the 2.5.52 Linux kernel. This email includes some of our findings 
about the way the interrupts are getting moved in the 2.5.52 kernel. 
Also there is discussion and a patch for a new implementation. Let 
me know what you think at nitin.a.kamble@intel.com

Current implementation:
======================
We have found that the existing implementation works well on IA32 
SMP systems with light load of interrupts. Also we noticed that it
is not working that well under heavy interrupt load conditions on 
these SMP systems. The observations are:

* Interrupt load of each IRQ is getting balanced on CPUs independent 
of load of other IRQs. Also the current implementation moves the 
IRQs randomly. This works well when the interrupt load is light. But 
we start seeing imbalance of interrupt load with existence of 
multiple heavy interrupt sources. Frequently multiple heavily loaded 
IRQs gets moved to a single CPU while other CPUs stay very lightly 
loaded. To achieve a good interrupts load balance, it is important to 
consider the load of all the interrupts together.
    This further can be explained with an example of 4 CPUs and 4 
heavy interrupt sources. With the existing random movement approach, 
the chance of each of these heavy interrupt sources moving to separate 
CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of 
the time the situation is, some CPUs are very lightly loaded and some 
are loaded with multiple heavy interrupts. This causes the interrupt 
load imbalance and results in less performance. In a case of 2 CPUs 
and 2 heavily loaded interrupt sources, this imbalance happens 
1/2 = 50% of the times. This issue becomes more and more severe with 
increasing number of heavy interrupt sources.

* Another interesting observation is: We cannot see the imbalance 
of the interrupt load from /proc/interrupts. (/proc/interrupts shows 
the cumulative load of interrupts on all CPUs.) If the interrupt load 
is imbalanced and this imbalance is getting rotated among CPUs 
continuously, then /proc/interrupts will still show that the interrupt 
load is going to processors very evenly. Currently at the frequency 
(HZ/50) at which IRQs are moved across CPUs, it is not possible to 
see any interrupt load imbalance happening.

* We have also found that, in certain cases the static IRQ binding 
performs better than the existing kernel distribution of interrupt 
load. The reason is, in a well-balanced interrupt load situations, 
these interrupts are unnecessarily getting frequently moved across 
CPUs. This adds an extra overhead; also it takes off the CPU cache 
warmth benefits.
  This came out from the performance measurements done on a 4-way HT 
(8 logical processors) Pentium 4 Xeon system running 8 copies of 
netperf. The 4 NICs in the system taking different IRQs generated 
sizable interrupt load with the help of connected clients.

Here the netperf transactions/sec throughput numbers observed are:

IRQs nicely manually bound to CPUs: 56.20K 
The current kernel implementation of IRQ movement: 50.05K
 -----------------------
 The static binding of IRQs has performed 12.28% better than the 
current IRQ movement implemented in the kernel.

* The current implementation does not distinguish siblings from the 
HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to 
balance the interrupt load with respect to processor packages first, 
and then among logical CPUs inside processor packages. 
  For example if we have 2 heavy interrupt sources and 2 processor 
packages (4 logical CPUs); Assigning both the heavy interrupt sources 
in different processor packages is better, it will use different 
execution resources from the different processor packages.

New revised implementation:
==========================
We also have been working on a new implementation. The following 
points are in main focus.

* At any moment heavily loaded IRQs are distributed to different 
CPUs to achieve as much balance as possible. 

* Lightly loaded interrupt sources are ignored from the load 
balancing, as they do not cause considerable imbalance.

* When the heavy interrupt sources are balanced, they are not moved 
around. This also helps in keeping the CPU caches warm.

* It has been made HT aware. While distributing the load, the load 
on a processor package to which the logical CPUs belong to is also 
considered.

* In the situations of few (lesser than num_cpus) heavy interrupt 
sources, it is not possible to balance them evenly. In such case 
the existing code has been reused to move the interrupts. The 
randomness from the original code has been removed.

* The time interval for redistribution has been made flexible. It 
varies as the system interrupt load changes.

* A new kernel_thread is introduced to do the load balancing 
calculations for all the interrupt sources. It keeps the balanace_maps 
ready for interrupt handlers, keeping the overhead in the interrupt 
handling to minimum.

* It allows the disabling of the IRQ distribution from the boot loader 
command line, if anybody wants to do it for any reason. 

* The algorithm also takes into account the static binding of 
interrupts to CPUs that user imposes from the 
/proc/irq/{n}/smp_affinity interface.

Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
 ---------------------
  The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance 
improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
 ---------------------
Here the performance improvement on P6 system of about 2% is observed.

Thanks & Regards,
Nitin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [2.5] IRQ distribution in the 2.5.52  kernel
  2003-01-08  2:50 Kamble, Nitin A
@ 2003-01-09 16:10 ` Andrew Theurer
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew Theurer @ 2003-01-09 16:10 UTC (permalink / raw)
  To: Kamble, Nitin A, linux-kernel
  Cc: Saxena, Sunil, Mallick, Asit K, Nakajima, Jun

On Tuesday 07 January 2003 20:50, Kamble, Nitin A wrote:
> Hello All,
>
>   We were looking at the performance impact of the IRQ routing from
> the 2.5.52 Linux kernel. This email includes some of our findings
> about the way the interrupts are getting moved in the 2.5.52 kernel.
> Also there is discussion and a patch for a new implementation. Let
> me know what you think at nitin.a.kamble@intel.com

Nitin,

I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2 
kernel.  NetBench measures SMB/CIFS performance by using several SMB clients 
(in this case 44 Windows 2000 systems), sending SMB requests to a Linux 
server running Samba 2.2.3a+sendfile.  Result is in throughput, Mbps.  
Generally the network traffic on the server is 60% recv, 40% tx.  

I believe we have very similar systems.  Mine is a 4 x 1.6 GHz, 1 MB L3 P4 
Xeon with 4 GB DDR memory (3.2 GB/sec I believe).  The chipset is "Summit".  
I also have more than one Intel e1000 adapters.  

I decided to run a few configurations, first with just one adapter, with and 
without HT support in the kernel (acpi=off), then add another adapter and 
test again with/without HT. 

Here are the results:

4P, no HT, 1 x e1000, no kirq:	1214 Mbps, 4% idle
4P, no HT, 1 x e1000, kirq:		1223 Mbps, 4% idle,		+0.74%

I suppose we didn't see much of an improvement here because we never run into 
the situation where more than one interrupt with a high rate is routed to a 
single CPU on irq_balance.  

4P, HT, 1 x e1000, no kirq:	1214 Mbps, 25% idle
4P, HT, 1 x e1000, kirq:	1220 Mbps, 30% idle,			+0.49%

Again, not much of a difference just yet, but lots of idle time.  We may have 
reached the limit at which one logical CPU can process interrupts for an 
e1000 adapter.  There are other things I can probably do to help this, like 
int delay, and NAPI, which I will get to eventually.  

4P, HT, 2 x e1000, no kirq:	1269 Mbps, 23% idle
4P, HT, 2 x e1000, kirq:	1329 Mbps, 18% idle			+4.7%

OK, almost 5% better!  Probably has to do with a couple of things; the fact 
that your code does not route two different interrupts to the same 
core/different logical cpus (quite obvious by looking at /proc/interrupts), 
and that more than one interrupt does not go to the same cpu if possible.  I 
suspect irq_balance did some of those [bad] things some of the time, and we 
observed a bottleneck in int processing that was lower than with kirq. 

I don't think all of the idle time is because of a int processing bottleneck.  
I'm just not sure what it is yet :)  Hopefully something will become obvious 
to me...

Overall I like the way it works, and I believe it can be tweaked to work with 
NUMA when necessary.  I hope to have access to a specweb system on a NUMA box 
soon, so we can verify that.  

-Andrew Theurer

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-01-09 22:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-09 19:52 [2.5] IRQ distribution in the 2.5.52 kernel Kamble, Nitin A
2003-01-09 22:08 ` Andrew Theurer
  -- strict thread matches above, loose matches on Subject: below --
2003-01-08  2:50 Kamble, Nitin A
2003-01-09 16:10 ` Andrew Theurer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.