[Question] SMP for Linux

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Question] SMP for Linux
@ 2002-10-17  2:29 Hyochang Nam
  2002-10-17 10:02 ` bert hubert
  0 siblings, 1 reply; 10+ messages in thread
From: Hyochang Nam @ 2002-10-17  2:29 UTC (permalink / raw)
  To: niv; +Cc: netdev

Many people helped me to solve the interrupt distribution problem.
We tested the throughput of Layer 3 forwarding on a SMP machine
which equips two Zero proessor(2Ghz). This is our results:
  -------------------------
       SMP    |  No SMP
  -------------------------
    230 Mbps  | 330 Mbps
  -------------------------
We use RedHat Linux 8.0 (which uses Linux Kernel 2.4.18-14) and
two intel Pro1000 Server Adopters. In the table, SMP means we
use the kernel builted for SMP Machine, and "No SMP" means
the kernel builted for single CPU mode. 

I expected that the SMP might show better throughput than "NO SMP"
environments. But, contrary to my expectation, SMP showed poor results. 
I don't know what is my fault in the experiments.

  - Hyochang Nam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] SMP for Linux
  2002-10-17  2:29 [Question] SMP for Linux Hyochang Nam
@ 2002-10-17 10:02 ` bert hubert
  2002-10-17 12:00   ` Robert Olsson
  0 siblings, 1 reply; 10+ messages in thread
From: bert hubert @ 2002-10-17 10:02 UTC (permalink / raw)
  To: Hyochang Nam; +Cc: niv, netdev

On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
> Many people helped me to solve the interrupt distribution problem.
> We tested the throughput of Layer 3 forwarding on a SMP machine
> which equips two Zero proessor(2Ghz). This is our results:
>   -------------------------
>        SMP    |  No SMP
>   -------------------------
>     230 Mbps  | 330 Mbps
>   -------------------------

There is something called 'irq affinity' which may be interesting for you.
See here: http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm

/proc/irq/?/smp_affinity

> I expected that the SMP might show better throughput than "NO SMP"
> environments. But, contrary to my expectation, SMP showed poor results. 
> I don't know what is my fault in the experiments.

SMP is not always a win.

Regards,

bert hubert

-- 
http://www.PowerDNS.com          Versatile DNS Software & Services
http://www.tk                              the dot in .tk
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] SMP for Linux
  2002-10-17 10:02 ` bert hubert
@ 2002-10-17 12:00   ` Robert Olsson
  2002-10-17 17:28     ` Jon Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-17 12:00 UTC (permalink / raw)
  To: bert hubert; +Cc: Hyochang Nam, niv, netdev

bert hubert writes:
 > On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
 > > Many people helped me to solve the interrupt distribution problem.
 > > We tested the throughput of Layer 3 forwarding on a SMP machine
 > > which equips two Zero proessor(2Ghz). This is our results:
 > >   -------------------------
 > >        SMP    |  No SMP
 > >   -------------------------
 > >     230 Mbps  | 330 Mbps
 > >   -------------------------
 > 
 > There is something called 'irq affinity' which may be interesting for you.
 > See here: http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm
 > 
 > /proc/irq/?/smp_affinity

 Hello!

 Not always good for routing... Were you still get the problem were one
 interface is the output device from devices bound to different CPU's.

 TX-ring can hold skb's from many CPU's so a lot of cache bouncing happens 
 when kfree and skb_headerinit is run.

 I've played with some to code to re-route the skb freeing to the CPU
 where it was processed this to minimize cache bouncing and I've seen 
 some good effects of this.

 And to be fair with SMP you should compare multiple flows to see if you 
 can get any aggregated performance from SMP.

 An experiment...

 Single flow eth0->eth1 w. e1000 NAPI. 2.4.20-pre5. PIII @ 2x933 MHz

 Bound = eth0, eth1 is bound to same CPU.
 Split = eth0, eth1 is bound to differnt CPU's.
 Free  = unbound.

 SMP routing performance
 =======================

Bound   Free  Split   "kfree-route"
 ---------------------------------
 421     354    331                 kpps
 491     348    317            437  kpps w. skb recycling

 UP routing performance
 ======================
 494 kpps
 593 kpps w. skb recycling

 With SMP test "kfree-route" the interfaces are not bound to any CPU still 
 we now getting closer to "bound" (where both eth0, eth1 is bond to the same 
 CPU). 

 But yes UP is gives higher numbers in this single stream tests. Aggregated
 throughput tests are to be done.

 Cheers.

						--ro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] SMP for Linux
  2002-10-17 12:00   ` Robert Olsson
@ 2002-10-17 17:28     ` Jon Fraser
  2002-10-18 16:16       ` Robert Olsson
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Fraser @ 2002-10-17 17:28 UTC (permalink / raw)
  To: netdev


What was your cpu utilization like in the bound vs split scenarios?
Does your e1000 driver have transmit interrupts enabled or disabled?

I'd be really interested to see the results with two flows in opposite
directions.

	Jon

> -----Original Message-----
> From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com]On
> Behalf Of Robert Olsson
> Sent: Thursday, October 17, 2002 8:01 AM
> To: bert hubert
> Cc: Hyochang Nam; niv@us.ibm.com; netdev@oss.sgi.com
> Subject: Re: [Question] SMP for Linux
> 
> 
> 
> bert hubert writes:
>  > On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
>  > > Many people helped me to solve the interrupt 
> distribution problem.
>  > > We tested the throughput of Layer 3 forwarding on a SMP machine
>  > > which equips two Zero proessor(2Ghz). This is our results:
>  > >   -------------------------
>  > >        SMP    |  No SMP
>  > >   -------------------------
>  > >     230 Mbps  | 330 Mbps
>  > >   -------------------------
>  > 
>  > There is something called 'irq affinity' which may be 
> interesting for you.
>  > See here: 
> http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm
>  > 
>  > /proc/irq/?/smp_affinity
> 
>  Hello!
> 
>  Not always good for routing... Were you still get the 
> problem were one
>  interface is the output device from devices bound to different CPU's.
> 
>  TX-ring can hold skb's from many CPU's so a lot of cache 
> bouncing happens 
>  when kfree and skb_headerinit is run.
> 
>  I've played with some to code to re-route the skb freeing to the CPU
>  where it was processed this to minimize cache bouncing and I've seen 
>  some good effects of this.
> 
>  And to be fair with SMP you should compare multiple flows to 
> see if you 
>  can get any aggregated performance from SMP.
> 
>  An experiment...
>  
>  Single flow eth0->eth1 w. e1000 NAPI. 2.4.20-pre5. PIII @ 2x933 MHz
> 
>  Bound = eth0, eth1 is bound to same CPU.
>  Split = eth0, eth1 is bound to differnt CPU's.
>  Free  = unbound.
> 
>  SMP routing performance
>  =======================
>  
> Bound   Free  Split   "kfree-route"
>  ---------------------------------
>  421     354    331                 kpps
>  491     348    317            437  kpps w. skb recycling
> 
> 
>  UP routing performance
>  ======================
>  494 kpps
>  593 kpps w. skb recycling
> 
> 
>  With SMP test "kfree-route" the interfaces are not bound to 
> any CPU still 
>  we now getting closer to "bound" (where both eth0, eth1 is 
> bond to the same 
>  CPU). 
> 
>  But yes UP is gives higher numbers in this single stream 
> tests. Aggregated
>  throughput tests are to be done.
> 
>  Cheers.
> 
> 						--ro
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] SMP for Linux
  2002-10-17 17:28     ` Jon Fraser
@ 2002-10-18 16:16       ` Robert Olsson
  2002-10-18 23:29         ` Jon Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-18 16:16 UTC (permalink / raw)
  To: J_Fraser; +Cc: netdev

Jon Fraser writes:
 > 
 > What was your cpu utilization like in the bound vs split scenarios?

 Not measured. Gonna take a look w. varient of Manfred's loadtest when  
 possible. But measuring the CPU this way also gives affects throughput. 
 Other softirq's are allowed to run as well now. :-)

 Over 1 Mpps was injected into eth0 so a good assumption is that for UP 
 all CPU is used but with SMP we might have some...

 > Does your e1000 driver have transmit interrupts enabled or disabled?

 transmit?  

 > I'd be really interested to see the results with two flows in opposite
 > directions.

 Me too.

 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] SMP for Linux
  2002-10-18 16:16       ` Robert Olsson
@ 2002-10-18 23:29         ` Jon Fraser
  2002-10-19  0:20           ` Nivedita Singhvi
  2002-10-19  7:53           ` Robert Olsson
  0 siblings, 2 replies; 10+ messages in thread
From: Jon Fraser @ 2002-10-18 23:29 UTC (permalink / raw)
  To: 'Robert Olsson'; +Cc: netdev

I ran some tests this afternoon. 
The setup is:

	2 x 1ghz PIII cpus w 256k cache
	2 intel 82542 gig-e cards

linux 2.4.20-pre11 kernel.
I don't have the NAPI e1000 driver.  I actually have
to ship a 2.4.18 based kernel, but decided to run some
tests on the 2.4.20 kernel.

The e1000 driver has been modified in a couple of ways.
The interrupts have been limited to 5k/second per card. This
mimics the actual hardware being shipped which uses an
intel 82543 chip but has an fpga used to do some
control functions and generate the interrupts.

We also don't use any transmit interrupts.  The Tx ring
is not cleaned at interrupt time.  It's cleaned when
we transmit frames and the number of free tx descriptors
drops below a threshold.  I also have some code which
directs the freed skb back to the cpu it was allocated on, 
but it's not in this driver version.

I used an ixia traffic generator to create the two udp flows.
Using the same terminology:
	bound = interrupts for both cards bound to one cpu
	float	= no smp affinity
	split = card 0 bound to cpu 0, card 1 bound to cpu 1

The results, in kpps:

                  bound   float   split
			cpu %    cpu%    cpu%
                -----------------------
1 flow           290     270     290
			99%x1	 65%x2   99%x1

2 flows          270     380     450
                 99%x1   82%x2  96%x2

Previously, I've used the CPU performance monitoring counters
to find that cache invalidates tends to be a big problem when
the interrupts are not bound to a pariticular cpu.  Bind the
card to a particular interrupt effectively binds the flow to
a particular cpu.

I'll repeat the same tests on Monday with 82543 based cards.
I would expect similar results.
Oh, I used top and vmstat to collect cpu percentages, interrupts/second,
etc., so they contribute a bit to the load.

	Jon

> 
> 
> Jon Fraser writes:
>  > 
>  > What was your cpu utilization like in the bound vs split scenarios?
>  
>  Not measured. Gonna take a look w. varient of Manfred's 
> loadtest when  
>  possible. But measuring the CPU this way also gives affects 
> throughput. 
>  Other softirq's are allowed to run as well now. :-)
> 
>  Over 1 Mpps was injected into eth0 so a good assumption is 
> that for UP 
>  all CPU is used but with SMP we might have some...
>  
>  > Does your e1000 driver have transmit interrupts enabled or 
> disabled?
>  
>  transmit?  
> 
>  > I'd be really interested to see the results with two flows 
> in opposite
>  > directions.
> 
>  Me too.
> 
>  Cheers.
> 						--ro
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] SMP for Linux
  2002-10-18 23:29         ` Jon Fraser
@ 2002-10-19  0:20           ` Nivedita Singhvi
  2002-10-19  7:53           ` Robert Olsson
  1 sibling, 0 replies; 10+ messages in thread
From: Nivedita Singhvi @ 2002-10-19  0:20 UTC (permalink / raw)
  To: J_Fraser; +Cc: 'Robert Olsson', netdev

Jon Fraser wrote:

>         bound = interrupts for both cards bound to one cpu
>         float   = no smp affinity
>         split = card 0 bound to cpu 0, card 1 bound to cpu 1
> 
> The results, in kpps:
> 
>                   bound   float   split
>                         cpu %    cpu%    cpu%
>                 -----------------------
> 1 flow           290     270     290
>                         99%x1    65%x2   99%x1
> 
> 2 flows          270     380     450
>                  99%x1   82%x2  96%x2

This is approximately what one should expect, correct? If
you have only one task (flow), then the float case will be
slower than the bound/split case (as long as the CPU isnt the
bottleneck), because you'll have an increasing number of 
cacheline misses. When you have two or more flows, the 
general order in which things improve would be the bound, 
float and then split cases. The fact that the float case is 
halfway between the other two is indicative of how expensive
the cacheline being on the other CPU is.

In the float case, it would be nice if we could see the interrupt 
distribution between the CPU's. Would you happen to have the 
/proc/interrupt info?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] SMP for Linux
  2002-10-18 23:29         ` Jon Fraser
  2002-10-19  0:20           ` Nivedita Singhvi
@ 2002-10-19  7:53           ` Robert Olsson
  2002-10-24 21:00             ` Jon Fraser
  1 sibling, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-19  7:53 UTC (permalink / raw)
  To: J_Fraser; +Cc: 'Robert Olsson', netdev


Jon Fraser writes:

 > The e1000 driver has been modified in a couple of ways.
 > The interrupts have been limited to 5k/second per card. This
 > mimics the actual hardware being shipped which uses an
 > intel 82543 chip but has an fpga used to do some
 > control functions and generate the interrupts.
 > 
 > We also don't use any transmit interrupts.  The Tx ring
 > is not cleaned at interrupt time.  It's cleaned when
 > we transmit frames and the number of free tx descriptors
 > drops below a threshold.  I also have some code which
 > directs the freed skb back to the cpu it was allocated on, 
 > but it's not in this driver version.

 NAPI stuff will do interrupt mitigation for you. You probably
 get a lot less RX interrupts at your loads. 

 I did the old trick to clean TX-buffers at hard_xmit as well
 but don't see any particularly win from this.

Input 1.14 Mpps. Both eth0, eth1 bound to CPU0

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 4313221 8316434 8316434 5658887     28      0      0      0 BRU
eth1   1500   0     23      0      0      0 4313217      0      0      0 BRU

e1000 messes up RX-ERR and RX-DRP as seen but you see TX-OK on eth1. 491 kpps.

           CPU0       CPU1       
 24:      53005          1   IO-APIC-level  eth1
 25:         19          0   IO-APIC-level  eth0

Alltogether 19 RX and 53k TX interrupts was used. 

0041d09c 00000000 000038d9 00000000 00000000 00000000 00000000 00000000 00000001
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
RC hit/miss = 4313099/384

And this run used "bound" w. private skb receycling.


 >                   bound   float   split
 > 			cpu %    cpu%    cpu%
 >                 -----------------------
 > 1 flow           290     270     290
 > 			99%x1	 65%x2   99%x1
 > 
 > 2 flows          270     380     450
 >                  99%x1   82%x2  96%x2

 Looks promising that you get aggregated performance from SMP.
 But "float" is the number to look for... It's almost impossible
 to use any device binding with forwarding at least as a general
 solution.

 > Previously, I've used the CPU performance monitoring counters
 > to find that cache invalidates tends to be a big problem when
 > the interrupts are not bound to a pariticular cpu.  Bind the
 > card to a particular interrupt effectively binds the flow to
 > a particular cpu.

 Hope to verify this for "kfree-route" test. Did you use oprofile
 for performance counters?

 > I'll repeat the same tests on Monday with 82543 based cards.
 > I would expect similar results.
 > Oh, I used top and vmstat to collect cpu percentages, interrupts/second,

 Hmm top and vmstat doesn't give you time spent in irq/softirq's except for 
 softirq's run via ksoftird?

 Cheers.

						--ro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] SMP for Linux
  2002-10-19  7:53           ` Robert Olsson
@ 2002-10-24 21:00             ` Jon Fraser
  2002-11-13  6:53               ` Eric Lemoine
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Fraser @ 2002-10-24 21:00 UTC (permalink / raw)
  To: netdev

>  Hope to verify this for "kfree-route" test. Did you use oprofile
>  for performance counters?

I used perfctr-2.4.0-pre2 to grab the CPU performance counters.
I was interested in cache invalidates, etc.

I guess I'll try using oprofile to see where in the code I'm really 
spending the time.

> 
>  > I'll repeat the same tests on Monday with 82543 based cards.
>  > I would expect similar results.
>  > Oh, I used top and vmstat to collect cpu percentages, 
> interrupts/second,
> 
>  Hmm top and vmstat doesn't give you time spent in 
> irq/softirq's except for 
>  softirq's run via ksoftird?

Right.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] SMP for Linux
  2002-10-24 21:00             ` Jon Fraser
@ 2002-11-13  6:53               ` Eric Lemoine
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Lemoine @ 2002-11-13  6:53 UTC (permalink / raw)
  To: Jon Fraser; +Cc: netdev

> >  > I'll repeat the same tests on Monday with 82543 based cards.
> >  > I would expect similar results.
> >  > Oh, I used top and vmstat to collect cpu percentages, 
> > interrupts/second,
> > 
> >  Hmm top and vmstat doesn't give you time spent in 
> > irq/softirq's except for 
> >  softirq's run via ksoftird?
> 
> Right.

You guys sure about this?

vmstat uses stats gathered by the kernel in the structure variable kstat
(of type struct kernel_stat). Here follows the func that updates this
variable, it seems that time spent in irq/softirq is accounted, isn't
it?

void update_process_times(int user_tick)
{
        struct task_struct *p = current;
        int cpu = smp_processor_id(), system = user_tick ^ 1;

        update_one_process(p, user_tick, system, cpu);
        if (p->pid) {
                if (--p->counter <= 0) {
                        p->counter = 0;
                        p->need_resched = 1;
                }
                if (p->nice > 0)
                        kstat.per_cpu_nice[cpu] += user_tick;
                else
                        kstat.per_cpu_user[cpu] += user_tick;
                kstat.per_cpu_system[cpu] += system;
        } else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
                kstat.per_cpu_system[cpu] += system;
}

PS: I agree this has nothing to do with netdev discussions but I just want 
to make sure...

Thx.
-- 
Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2002-11-13  6:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-17  2:29 [Question] SMP for Linux Hyochang Nam
2002-10-17 10:02 ` bert hubert
2002-10-17 12:00   ` Robert Olsson
2002-10-17 17:28     ` Jon Fraser
2002-10-18 16:16       ` Robert Olsson
2002-10-18 23:29         ` Jon Fraser
2002-10-19  0:20           ` Nivedita Singhvi
2002-10-19  7:53           ` Robert Olsson
2002-10-24 21:00             ` Jon Fraser
2002-11-13  6:53               ` Eric Lemoine

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).