* [Question] SMP for Linux
@ 2002-10-17 2:29 Hyochang Nam
2002-10-17 10:02 ` bert hubert
0 siblings, 1 reply; 10+ messages in thread
From: Hyochang Nam @ 2002-10-17 2:29 UTC (permalink / raw)
To: niv; +Cc: netdev
Many people helped me to solve the interrupt distribution problem.
We tested the throughput of Layer 3 forwarding on a SMP machine
which equips two Zero proessor(2Ghz). This is our results:
-------------------------
SMP | No SMP
-------------------------
230 Mbps | 330 Mbps
-------------------------
We use RedHat Linux 8.0 (which uses Linux Kernel 2.4.18-14) and
two intel Pro1000 Server Adopters. In the table, SMP means we
use the kernel builted for SMP Machine, and "No SMP" means
the kernel builted for single CPU mode.
I expected that the SMP might show better throughput than "NO SMP"
environments. But, contrary to my expectation, SMP showed poor results.
I don't know what is my fault in the experiments.
- Hyochang Nam
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [Question] SMP for Linux
2002-10-17 2:29 [Question] SMP for Linux Hyochang Nam
@ 2002-10-17 10:02 ` bert hubert
2002-10-17 12:00 ` Robert Olsson
0 siblings, 1 reply; 10+ messages in thread
From: bert hubert @ 2002-10-17 10:02 UTC (permalink / raw)
To: Hyochang Nam; +Cc: niv, netdev
On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
> Many people helped me to solve the interrupt distribution problem.
> We tested the throughput of Layer 3 forwarding on a SMP machine
> which equips two Zero proessor(2Ghz). This is our results:
> -------------------------
> SMP | No SMP
> -------------------------
> 230 Mbps | 330 Mbps
> -------------------------
There is something called 'irq affinity' which may be interesting for you.
See here: http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm
/proc/irq/?/smp_affinity
> I expected that the SMP might show better throughput than "NO SMP"
> environments. But, contrary to my expectation, SMP showed poor results.
> I don't know what is my fault in the experiments.
SMP is not always a win.
Regards,
bert hubert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://www.tk the dot in .tk
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Question] SMP for Linux
2002-10-17 10:02 ` bert hubert
@ 2002-10-17 12:00 ` Robert Olsson
2002-10-17 17:28 ` Jon Fraser
0 siblings, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-17 12:00 UTC (permalink / raw)
To: bert hubert; +Cc: Hyochang Nam, niv, netdev
bert hubert writes:
> On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
> > Many people helped me to solve the interrupt distribution problem.
> > We tested the throughput of Layer 3 forwarding on a SMP machine
> > which equips two Zero proessor(2Ghz). This is our results:
> > -------------------------
> > SMP | No SMP
> > -------------------------
> > 230 Mbps | 330 Mbps
> > -------------------------
>
> There is something called 'irq affinity' which may be interesting for you.
> See here: http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm
>
> /proc/irq/?/smp_affinity
Hello!
Not always good for routing... Were you still get the problem were one
interface is the output device from devices bound to different CPU's.
TX-ring can hold skb's from many CPU's so a lot of cache bouncing happens
when kfree and skb_headerinit is run.
I've played with some to code to re-route the skb freeing to the CPU
where it was processed this to minimize cache bouncing and I've seen
some good effects of this.
And to be fair with SMP you should compare multiple flows to see if you
can get any aggregated performance from SMP.
An experiment...
Single flow eth0->eth1 w. e1000 NAPI. 2.4.20-pre5. PIII @ 2x933 MHz
Bound = eth0, eth1 is bound to same CPU.
Split = eth0, eth1 is bound to differnt CPU's.
Free = unbound.
SMP routing performance
=======================
Bound Free Split "kfree-route"
---------------------------------
421 354 331 kpps
491 348 317 437 kpps w. skb recycling
UP routing performance
======================
494 kpps
593 kpps w. skb recycling
With SMP test "kfree-route" the interfaces are not bound to any CPU still
we now getting closer to "bound" (where both eth0, eth1 is bond to the same
CPU).
But yes UP is gives higher numbers in this single stream tests. Aggregated
throughput tests are to be done.
Cheers.
--ro
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [Question] SMP for Linux
2002-10-17 12:00 ` Robert Olsson
@ 2002-10-17 17:28 ` Jon Fraser
2002-10-18 16:16 ` Robert Olsson
0 siblings, 1 reply; 10+ messages in thread
From: Jon Fraser @ 2002-10-17 17:28 UTC (permalink / raw)
To: netdev
What was your cpu utilization like in the bound vs split scenarios?
Does your e1000 driver have transmit interrupts enabled or disabled?
I'd be really interested to see the results with two flows in opposite
directions.
Jon
> -----Original Message-----
> From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com]On
> Behalf Of Robert Olsson
> Sent: Thursday, October 17, 2002 8:01 AM
> To: bert hubert
> Cc: Hyochang Nam; niv@us.ibm.com; netdev@oss.sgi.com
> Subject: Re: [Question] SMP for Linux
>
>
>
> bert hubert writes:
> > On Thu, Oct 17, 2002 at 11:29:28AM +0900, Hyochang Nam wrote:
> > > Many people helped me to solve the interrupt
> distribution problem.
> > > We tested the throughput of Layer 3 forwarding on a SMP machine
> > > which equips two Zero proessor(2Ghz). This is our results:
> > > -------------------------
> > > SMP | No SMP
> > > -------------------------
> > > 230 Mbps | 330 Mbps
> > > -------------------------
> >
> > There is something called 'irq affinity' which may be
> interesting for you.
> > See here:
> http://www.dell.com/us/en/esg/topics/power_ps1q02-morse.htm
> >
> > /proc/irq/?/smp_affinity
>
> Hello!
>
> Not always good for routing... Were you still get the
> problem were one
> interface is the output device from devices bound to different CPU's.
>
> TX-ring can hold skb's from many CPU's so a lot of cache
> bouncing happens
> when kfree and skb_headerinit is run.
>
> I've played with some to code to re-route the skb freeing to the CPU
> where it was processed this to minimize cache bouncing and I've seen
> some good effects of this.
>
> And to be fair with SMP you should compare multiple flows to
> see if you
> can get any aggregated performance from SMP.
>
> An experiment...
>
> Single flow eth0->eth1 w. e1000 NAPI. 2.4.20-pre5. PIII @ 2x933 MHz
>
> Bound = eth0, eth1 is bound to same CPU.
> Split = eth0, eth1 is bound to differnt CPU's.
> Free = unbound.
>
> SMP routing performance
> =======================
>
> Bound Free Split "kfree-route"
> ---------------------------------
> 421 354 331 kpps
> 491 348 317 437 kpps w. skb recycling
>
>
> UP routing performance
> ======================
> 494 kpps
> 593 kpps w. skb recycling
>
>
> With SMP test "kfree-route" the interfaces are not bound to
> any CPU still
> we now getting closer to "bound" (where both eth0, eth1 is
> bond to the same
> CPU).
>
> But yes UP is gives higher numbers in this single stream
> tests. Aggregated
> throughput tests are to be done.
>
> Cheers.
>
> --ro
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [Question] SMP for Linux
2002-10-17 17:28 ` Jon Fraser
@ 2002-10-18 16:16 ` Robert Olsson
2002-10-18 23:29 ` Jon Fraser
0 siblings, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-18 16:16 UTC (permalink / raw)
To: J_Fraser; +Cc: netdev
Jon Fraser writes:
>
> What was your cpu utilization like in the bound vs split scenarios?
Not measured. Gonna take a look w. varient of Manfred's loadtest when
possible. But measuring the CPU this way also gives affects throughput.
Other softirq's are allowed to run as well now. :-)
Over 1 Mpps was injected into eth0 so a good assumption is that for UP
all CPU is used but with SMP we might have some...
> Does your e1000 driver have transmit interrupts enabled or disabled?
transmit?
> I'd be really interested to see the results with two flows in opposite
> directions.
Me too.
Cheers.
--ro
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [Question] SMP for Linux
2002-10-18 16:16 ` Robert Olsson
@ 2002-10-18 23:29 ` Jon Fraser
2002-10-19 0:20 ` Nivedita Singhvi
2002-10-19 7:53 ` Robert Olsson
0 siblings, 2 replies; 10+ messages in thread
From: Jon Fraser @ 2002-10-18 23:29 UTC (permalink / raw)
To: 'Robert Olsson'; +Cc: netdev
I ran some tests this afternoon.
The setup is:
2 x 1ghz PIII cpus w 256k cache
2 intel 82542 gig-e cards
linux 2.4.20-pre11 kernel.
I don't have the NAPI e1000 driver. I actually have
to ship a 2.4.18 based kernel, but decided to run some
tests on the 2.4.20 kernel.
The e1000 driver has been modified in a couple of ways.
The interrupts have been limited to 5k/second per card. This
mimics the actual hardware being shipped which uses an
intel 82543 chip but has an fpga used to do some
control functions and generate the interrupts.
We also don't use any transmit interrupts. The Tx ring
is not cleaned at interrupt time. It's cleaned when
we transmit frames and the number of free tx descriptors
drops below a threshold. I also have some code which
directs the freed skb back to the cpu it was allocated on,
but it's not in this driver version.
I used an ixia traffic generator to create the two udp flows.
Using the same terminology:
bound = interrupts for both cards bound to one cpu
float = no smp affinity
split = card 0 bound to cpu 0, card 1 bound to cpu 1
The results, in kpps:
bound float split
cpu % cpu% cpu%
-----------------------
1 flow 290 270 290
99%x1 65%x2 99%x1
2 flows 270 380 450
99%x1 82%x2 96%x2
Previously, I've used the CPU performance monitoring counters
to find that cache invalidates tends to be a big problem when
the interrupts are not bound to a pariticular cpu. Bind the
card to a particular interrupt effectively binds the flow to
a particular cpu.
I'll repeat the same tests on Monday with 82543 based cards.
I would expect similar results.
Oh, I used top and vmstat to collect cpu percentages, interrupts/second,
etc., so they contribute a bit to the load.
Jon
>
>
> Jon Fraser writes:
> >
> > What was your cpu utilization like in the bound vs split scenarios?
>
> Not measured. Gonna take a look w. varient of Manfred's
> loadtest when
> possible. But measuring the CPU this way also gives affects
> throughput.
> Other softirq's are allowed to run as well now. :-)
>
> Over 1 Mpps was injected into eth0 so a good assumption is
> that for UP
> all CPU is used but with SMP we might have some...
>
> > Does your e1000 driver have transmit interrupts enabled or
> disabled?
>
> transmit?
>
> > I'd be really interested to see the results with two flows
> in opposite
> > directions.
>
> Me too.
>
> Cheers.
> --ro
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [Question] SMP for Linux
2002-10-18 23:29 ` Jon Fraser
@ 2002-10-19 0:20 ` Nivedita Singhvi
2002-10-19 7:53 ` Robert Olsson
1 sibling, 0 replies; 10+ messages in thread
From: Nivedita Singhvi @ 2002-10-19 0:20 UTC (permalink / raw)
To: J_Fraser; +Cc: 'Robert Olsson', netdev
Jon Fraser wrote:
> bound = interrupts for both cards bound to one cpu
> float = no smp affinity
> split = card 0 bound to cpu 0, card 1 bound to cpu 1
>
> The results, in kpps:
>
> bound float split
> cpu % cpu% cpu%
> -----------------------
> 1 flow 290 270 290
> 99%x1 65%x2 99%x1
>
> 2 flows 270 380 450
> 99%x1 82%x2 96%x2
This is approximately what one should expect, correct? If
you have only one task (flow), then the float case will be
slower than the bound/split case (as long as the CPU isnt the
bottleneck), because you'll have an increasing number of
cacheline misses. When you have two or more flows, the
general order in which things improve would be the bound,
float and then split cases. The fact that the float case is
halfway between the other two is indicative of how expensive
the cacheline being on the other CPU is.
In the float case, it would be nice if we could see the interrupt
distribution between the CPU's. Would you happen to have the
/proc/interrupt info?
thanks,
Nivedita
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [Question] SMP for Linux
2002-10-18 23:29 ` Jon Fraser
2002-10-19 0:20 ` Nivedita Singhvi
@ 2002-10-19 7:53 ` Robert Olsson
2002-10-24 21:00 ` Jon Fraser
1 sibling, 1 reply; 10+ messages in thread
From: Robert Olsson @ 2002-10-19 7:53 UTC (permalink / raw)
To: J_Fraser; +Cc: 'Robert Olsson', netdev
Jon Fraser writes:
> The e1000 driver has been modified in a couple of ways.
> The interrupts have been limited to 5k/second per card. This
> mimics the actual hardware being shipped which uses an
> intel 82543 chip but has an fpga used to do some
> control functions and generate the interrupts.
>
> We also don't use any transmit interrupts. The Tx ring
> is not cleaned at interrupt time. It's cleaned when
> we transmit frames and the number of free tx descriptors
> drops below a threshold. I also have some code which
> directs the freed skb back to the cpu it was allocated on,
> but it's not in this driver version.
NAPI stuff will do interrupt mitigation for you. You probably
get a lot less RX interrupts at your loads.
I did the old trick to clean TX-buffers at hard_xmit as well
but don't see any particularly win from this.
Input 1.14 Mpps. Both eth0, eth1 bound to CPU0
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0 1500 0 4313221 8316434 8316434 5658887 28 0 0 0 BRU
eth1 1500 0 23 0 0 0 4313217 0 0 0 BRU
e1000 messes up RX-ERR and RX-DRP as seen but you see TX-OK on eth1. 491 kpps.
CPU0 CPU1
24: 53005 1 IO-APIC-level eth1
25: 19 0 IO-APIC-level eth0
Alltogether 19 RX and 53k TX interrupts was used.
0041d09c 00000000 000038d9 00000000 00000000 00000000 00000000 00000000 00000001
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
RC hit/miss = 4313099/384
And this run used "bound" w. private skb receycling.
> bound float split
> cpu % cpu% cpu%
> -----------------------
> 1 flow 290 270 290
> 99%x1 65%x2 99%x1
>
> 2 flows 270 380 450
> 99%x1 82%x2 96%x2
Looks promising that you get aggregated performance from SMP.
But "float" is the number to look for... It's almost impossible
to use any device binding with forwarding at least as a general
solution.
> Previously, I've used the CPU performance monitoring counters
> to find that cache invalidates tends to be a big problem when
> the interrupts are not bound to a pariticular cpu. Bind the
> card to a particular interrupt effectively binds the flow to
> a particular cpu.
Hope to verify this for "kfree-route" test. Did you use oprofile
for performance counters?
> I'll repeat the same tests on Monday with 82543 based cards.
> I would expect similar results.
> Oh, I used top and vmstat to collect cpu percentages, interrupts/second,
Hmm top and vmstat doesn't give you time spent in irq/softirq's except for
softirq's run via ksoftird?
Cheers.
--ro
^ permalink raw reply [flat|nested] 10+ messages in thread* RE: [Question] SMP for Linux
2002-10-19 7:53 ` Robert Olsson
@ 2002-10-24 21:00 ` Jon Fraser
2002-11-13 6:53 ` Eric Lemoine
0 siblings, 1 reply; 10+ messages in thread
From: Jon Fraser @ 2002-10-24 21:00 UTC (permalink / raw)
To: netdev
> Hope to verify this for "kfree-route" test. Did you use oprofile
> for performance counters?
I used perfctr-2.4.0-pre2 to grab the CPU performance counters.
I was interested in cache invalidates, etc.
I guess I'll try using oprofile to see where in the code I'm really
spending the time.
>
> > I'll repeat the same tests on Monday with 82543 based cards.
> > I would expect similar results.
> > Oh, I used top and vmstat to collect cpu percentages,
> interrupts/second,
>
> Hmm top and vmstat doesn't give you time spent in
> irq/softirq's except for
> softirq's run via ksoftird?
Right.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Question] SMP for Linux
2002-10-24 21:00 ` Jon Fraser
@ 2002-11-13 6:53 ` Eric Lemoine
0 siblings, 0 replies; 10+ messages in thread
From: Eric Lemoine @ 2002-11-13 6:53 UTC (permalink / raw)
To: Jon Fraser; +Cc: netdev
> > > I'll repeat the same tests on Monday with 82543 based cards.
> > > I would expect similar results.
> > > Oh, I used top and vmstat to collect cpu percentages,
> > interrupts/second,
> >
> > Hmm top and vmstat doesn't give you time spent in
> > irq/softirq's except for
> > softirq's run via ksoftird?
>
> Right.
You guys sure about this?
vmstat uses stats gathered by the kernel in the structure variable kstat
(of type struct kernel_stat). Here follows the func that updates this
variable, it seems that time spent in irq/softirq is accounted, isn't
it?
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id(), system = user_tick ^ 1;
update_one_process(p, user_tick, system, cpu);
if (p->pid) {
if (--p->counter <= 0) {
p->counter = 0;
p->need_resched = 1;
}
if (p->nice > 0)
kstat.per_cpu_nice[cpu] += user_tick;
else
kstat.per_cpu_user[cpu] += user_tick;
kstat.per_cpu_system[cpu] += system;
} else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
kstat.per_cpu_system[cpu] += system;
}
PS: I agree this has nothing to do with netdev discussions but I just want
to make sure...
Thx.
--
Eric
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2002-11-13 6:53 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-17 2:29 [Question] SMP for Linux Hyochang Nam
2002-10-17 10:02 ` bert hubert
2002-10-17 12:00 ` Robert Olsson
2002-10-17 17:28 ` Jon Fraser
2002-10-18 16:16 ` Robert Olsson
2002-10-18 23:29 ` Jon Fraser
2002-10-19 0:20 ` Nivedita Singhvi
2002-10-19 7:53 ` Robert Olsson
2002-10-24 21:00 ` Jon Fraser
2002-11-13 6:53 ` Eric Lemoine
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).