htb parallelism on multi-core platforms

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* htb parallelism on multi-core platforms
@ 2009-04-17 10:40 Radu Rendec
  2009-04-17 11:31 ` David Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Radu Rendec @ 2009-04-17 10:40 UTC (permalink / raw)
  To: netdev

Hi,

I'm using htb on a dedicated shaping machine. Under heavy traffic (high
packet rate) all htb work is done on a single cpu - only one ksoftirqd
is consuming cpu power.

I have limited network stack knowledge, but I guess all htb work for a
particular interface is done on the same softirq context. Of course this
does not scale with multiple cpus, since only one of them would be used.

Is there any (simple) approach to distribute htb work (for one
interface) on multiple cpus?

Thanks,

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec
@ 2009-04-17 11:31 ` David Miller
  2009-04-17 11:33 ` Badalian Vyacheslav
  2009-04-17 22:41 ` Jarek Poplawski
  2 siblings, 0 replies; 39+ messages in thread
From: David Miller @ 2009-04-17 11:31 UTC (permalink / raw)
  To: radu.rendec; +Cc: netdev

From: Radu Rendec <radu.rendec@ines.ro>
Date: Fri, 17 Apr 2009 13:40:44 +0300

> Is there any (simple) approach to distribute htb work (for one
> interface) on multiple cpus?

HTB acts upon global state, so anything that goes into a particular
device's HTB ruleset is going to be single threaded.

There really isn't any way around this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec
  2009-04-17 11:31 ` David Miller
@ 2009-04-17 11:33 ` Badalian Vyacheslav
  2009-04-17 22:41 ` Jarek Poplawski
  2 siblings, 0 replies; 39+ messages in thread
From: Badalian Vyacheslav @ 2009-04-17 11:33 UTC (permalink / raw)
  To: Radu Rendec; +Cc: netdev

hello

100% SI on ksoftirqd on one CPU because PC can't forward such packets
(napi off if i understand). 2 cpu xeon 2.4 ghz can forward about 400-500
mbs full duplex with about 20-30k htb rules. If we try do more - we get
100% SI. Its our example.

We now use multiple pc for this and will try to by intel 10G with A/IO
that can use Multiqueue.

Anyone can say:
How match CPU we must have for about 5-7G in/out with 2 x intel 10G +
A/IO (1x10g to lan + 1x10g to wan) ?
Any statistic or formula to calculate? pps or mbs?
tc + iptables (+ipset) now use 10-30%. All other cpu now use e1000e driver.

Thanks

> Hi,
>
> I'm using htb on a dedicated shaping machine. Under heavy traffic (high
> packet rate) all htb work is done on a single cpu - only one ksoftirqd
> is consuming cpu power.
>
> I have limited network stack knowledge, but I guess all htb work for a
> particular interface is done on the same softirq context. Of course this
> does not scale with multiple cpus, since only one of them would be used.
>
> Is there any (simple) approach to distribute htb work (for one
> interface) on multiple cpus?
>
> Thanks,
>
> Radu Rendec
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec
  2009-04-17 11:31 ` David Miller
  2009-04-17 11:33 ` Badalian Vyacheslav
@ 2009-04-17 22:41 ` Jarek Poplawski
  2009-04-18  0:21   ` Denys Fedoryschenko
  2 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-17 22:41 UTC (permalink / raw)
  To: Radu Rendec; +Cc: netdev

Radu Rendec wrote, On 04/17/2009 12:40 PM:

> Hi,

Hi Radu,

> 
> I'm using htb on a dedicated shaping machine. Under heavy traffic (high
> packet rate) all htb work is done on a single cpu - only one ksoftirqd
> is consuming cpu power.
> 
> I have limited network stack knowledge, but I guess all htb work for a
> particular interface is done on the same softirq context. Of course this
> does not scale with multiple cpus, since only one of them would be used.
> 
> Is there any (simple) approach to distribute htb work (for one
> interface) on multiple cpus?

I don't know about anything (simple) for this, but I wonder if you
tried already any htb tweaking like htb_hysteresis module param or
burst/cburst class parameters to limit some maybe useless resolution/
overhead?

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-17 22:41 ` Jarek Poplawski
@ 2009-04-18  0:21   ` Denys Fedoryschenko
  2009-04-18  7:56     ` Jarek Poplawski
  0 siblings, 1 reply; 39+ messages in thread
From: Denys Fedoryschenko @ 2009-04-18  0:21 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Radu Rendec, netdev

On Saturday 18 April 2009 01:41:38 Jarek Poplawski wrote:
> Radu Rendec wrote, On 04/17/2009 12:40 PM:
> > Hi,
>
> Hi Radu,

> I don't know about anything (simple) for this, but I wonder if you
> tried already any htb tweaking like htb_hysteresis module param or
> burst/cburst class parameters to limit some maybe useless resolution/
> overhead?
Like adding HZ=1000 as environment variable in scripts :-)
For me it helps....
Also worth to try HFSC.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-18  0:21   ` Denys Fedoryschenko
@ 2009-04-18  7:56     ` Jarek Poplawski
  2009-04-22 14:02       ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-18  7:56 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: Radu Rendec, netdev

On Sat, Apr 18, 2009 at 03:21:50AM +0300, Denys Fedoryschenko wrote:
> On Saturday 18 April 2009 01:41:38 Jarek Poplawski wrote:
> > Radu Rendec wrote, On 04/17/2009 12:40 PM:
> > > Hi,
> >
> > Hi Radu,
> 
> > I don't know about anything (simple) for this, but I wonder if you
> > tried already any htb tweaking like htb_hysteresis module param or
> > burst/cburst class parameters to limit some maybe useless resolution/
> > overhead?
> Like adding HZ=1000 as environment variable in scripts :-)
> For me it helps....

Right, if you're using high resolution; there is a bug in tc, found by
Denys, which causes wrong (too low) defaults for burst/cburst.

> Also worth to try HFSC.

Yes, it seems to be especially interesting for 64 bit boxes.

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-18  7:56     ` Jarek Poplawski
@ 2009-04-22 14:02       ` Radu Rendec
  2009-04-22 21:29         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-22 14:02 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Denys Fedoryschenko, netdev

On Sat, 2009-04-18 at 09:56 +0200, Jarek Poplawski wrote:
> Right, if you're using high resolution; there is a bug in tc, found by
> Denys, which causes wrong (too low) defaults for burst/cburst.
> 
> > Also worth to try HFSC.
> 
> Yes, it seems to be especially interesting for 64 bit boxes.

Hi Jarek,

Thanks for the hints! As far as I understand, HFSC is also implemented
as a queue discipline (like HTB), so I guess it suffers from the same
design limitations (doesn't span across multiple CPUs). Is this
assumption correct?

As for htb_hysteresis I actually haven't tried it. Although it is
definitely worth a try (especially if the average traffic grows), I
don't think it can compensate multithreading / parallel execution. At
least half of a packet processing time is consumed by classification
(although I am using hashes). I guess htb_hysteresis only affects the
actual shaping (which takes place after the packet is classified).

Thanks,

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-22 14:02       ` Radu Rendec
@ 2009-04-22 21:29         ` Jesper Dangaard Brouer
  2009-04-23  8:20           ` Jarek Poplawski
  2009-04-23 12:31           ` Radu Rendec
  0 siblings, 2 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-22 21:29 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

On Wed, 22 Apr 2009, Radu Rendec wrote:

> Thanks for the hints! As far as I understand, HFSC is also implemented
> as a queue discipline (like HTB), so I guess it suffers from the same
> design limitations (doesn't span across multiple CPUs). Is this
> assumption correct?

Yes.

> As for htb_hysteresis I actually haven't tried it. Although it is
> definitely worth a try (especially if the average traffic grows), I
> don't think it can compensate multithreading / parallel execution.

Its runtime adjustable, so its easy to try out.

  via /sys/module/sch_htb/parameters/htb_hysteresis


> At least half of a packet processing time is consumed by classification 
> (although I am using hashes).

The HTB classify hash has a scalability issue in kernels below 2.6.26. 
Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
using?

Could you explain how you do classification? And perhaps outline where you 
possible scalability issue is located?

If you are interested how I do scalable classification, see my 
presentation from Netfilter Workshop 2008:

  http://nfws.inl.fr/en/?p=115
  http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf


> I guess htb_hysteresis only affects the actual shaping (which takes 
> place after the packet is classified).

Yes, htb_hysteresis basically is a hack to allow extra bursts... we 
actually considered removing it completely...

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-22 21:29         ` Jesper Dangaard Brouer
@ 2009-04-23  8:20           ` Jarek Poplawski
  2009-04-23 13:56             ` Radu Rendec
  2009-04-23 12:31           ` Radu Rendec
  1 sibling, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-23  8:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev

On 22-04-2009 23:29, Jesper Dangaard Brouer wrote:
> On Wed, 22 Apr 2009, Radu Rendec wrote:
> 
>> Thanks for the hints! As far as I understand, HFSC is also implemented
>> as a queue discipline (like HTB), so I guess it suffers from the same
>> design limitations (doesn't span across multiple CPUs). Is this
>> assumption correct?
> 
> Yes.

Within a common tree of classes it would a need finer locking to
separate some jobs but considering cache problems I doubt there would
be much gain from such redesigning for smp. On the other hand, a
common tree is necessary if these classes really have to share every
byte, which I doubt. Then we could think of config and maybe tiny
hardware "redesign" (to more qdiscs/roots). So, e.g. using additional
(cheap) NICs and even switch, if possible, looks quite natural way of
spanning. Similar thing (multiple htb qdiscs) should be possible in
the future with one multiqueue NIC too.

There is also an interesting thread "Software receive packet steering"
nearby, but using this for shaping only looks like "less simple":
http://lwn.net/Articles/328339/

> 
>> As for htb_hysteresis I actually haven't tried it. Although it is
>> definitely worth a try (especially if the average traffic grows), I
>> don't think it can compensate multithreading / parallel execution.
> 
> Its runtime adjustable, so its easy to try out.
> 
>   via /sys/module/sch_htb/parameters/htb_hysteresis
> 
> 
>> At least half of a packet processing time is consumed by classification 
>> (although I am using hashes).
> 
> The HTB classify hash has a scalability issue in kernels below 2.6.26. 
> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
> using?
> 
> Could you explain how you do classification? And perhaps outline where you 
> possible scalability issue is located?

BTW, I hope you add filters after classes they point to.

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23  8:20           ` Jarek Poplawski
@ 2009-04-23 13:56             ` Radu Rendec
  2009-04-23 18:19               ` Jarek Poplawski
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-23 13:56 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev

On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote:
> Within a common tree of classes it would a need finer locking to
> separate some jobs but considering cache problems I doubt there would
> be much gain from such redesigning for smp. On the other hand, a
> common tree is necessary if these classes really have to share every
> byte, which I doubt. Then we could think of config and maybe tiny
> hardware "redesign" (to more qdiscs/roots). So, e.g. using additional
> (cheap) NICs and even switch, if possible, looks quite natural way of
> spanning. Similar thing (multiple htb qdiscs) should be possible in
> the future with one multiqueue NIC too.

Since htb has a tree structure by default, I think it's pretty difficult
to distribute shaping across different htb-enabled queues. Actually we
had thought of using completely separate machines, but soon we realized
there are some issues. Consider the following example:

Customer A and customer B share 2 Mbit of bandwith. Each of them is
guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1
Mbit from the other's bandwith (depending on the other's traffic).

This is done like this:

* bucket C -> rate 2 Mbit, ceil 2 Mbit
* bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C
* bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C

IP filters for customer A classify packets to bucket A, and similar for
customer B to bucket B.

It's obvious that buckets A, B and C must be in the same htb tree,
otherwise customers A and B would not be able to borrow from each
other's bandwidth. One simple rule would be to allocate all buckets
(with all their child buckets) that have rate = ceil to the same tree /
queue / whatever. I don't know if this is enough.

> There is also an interesting thread "Software receive packet steering"
> nearby, but using this for shaping only looks like "less simple":
> http://lwn.net/Articles/328339/

I am aware of the thread and even tried out the author's patch (despite
the fact that David Miller suggested it was not sane). Under heavy
(simulated) traffic nothing was changed: only one ksoftirqd using 100%
CPU, one CPU in 100%, others idle. This only confirms what I've already
been told: htb is single threaded by design. It also proves that most of
the packet processing work is actually in htb.

> BTW, I hope you add filters after classes they point to.

Do you mean the actual order I use for the "tc filter add" and "tc class
add" commands? Does it make any difference?

Anyway, speaking of htb redesign or improvement (to use multiple
threads / CPUs) I think classification rules can be cloned on a
per-thread basis (to avoid synchronization issues). This means
sacrificing memory for the benefit of performance but probably it is
better to do it this way.

However, shaping data structures must be shared between all threads as
long as it's not sure that all packets corresponding to a certain IP
address are processed in the same thread (they most probably would not,
if a round-robin alhorithm is used).

While searching the Internet for what has already been accomplished in
this area, I ran several time across the per-CPU cache issue. The
commonly accepted opinion seems to be that CPU parallelism in packet
processing implies synchronization issues which in turn imply cache
misses, which ultimately result in performance loss. However, with only
one core in 100% and other 7 cores idle, I doubt that CPU-cache is
really worth (it's just a guess and it definitely needs real tests as
evidence).

Thanks,

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 13:56             ` Radu Rendec
@ 2009-04-23 18:19               ` Jarek Poplawski
  2009-04-23 20:19                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-23 18:19 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev

On Thu, Apr 23, 2009 at 04:56:42PM +0300, Radu Rendec wrote:
> On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote:
> > Within a common tree of classes it would a need finer locking to
> > separate some jobs but considering cache problems I doubt there would
> > be much gain from such redesigning for smp. On the other hand, a
> > common tree is necessary if these classes really have to share every
> > byte, which I doubt. Then we could think of config and maybe tiny
> > hardware "redesign" (to more qdiscs/roots). So, e.g. using additional
> > (cheap) NICs and even switch, if possible, looks quite natural way of
> > spanning. Similar thing (multiple htb qdiscs) should be possible in
> > the future with one multiqueue NIC too.
> 
> Since htb has a tree structure by default, I think it's pretty difficult
> to distribute shaping across different htb-enabled queues. Actually we
> had thought of using completely separate machines, but soon we realized
> there are some issues. Consider the following example:
> 
> Customer A and customer B share 2 Mbit of bandwith. Each of them is
> guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1
> Mbit from the other's bandwith (depending on the other's traffic).
> 
> This is done like this:
> 
> * bucket C -> rate 2 Mbit, ceil 2 Mbit
> * bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C
> * bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C
> 
> IP filters for customer A classify packets to bucket A, and similar for
> customer B to bucket B.
> 
> It's obvious that buckets A, B and C must be in the same htb tree,
> otherwise customers A and B would not be able to borrow from each
> other's bandwidth. One simple rule would be to allocate all buckets
> (with all their child buckets) that have rate = ceil to the same tree /
> queue / whatever. I don't know if this is enough.

Yes, what I meant was rather a config with more individual clients eg.
20 x rate 50kbit ceil 100kbit. But, if you have many such rate = ceil
classes, separating them to another qdisc/NIC looks even better (no
problem with unbalanced load).

> > There is also an interesting thread "Software receive packet steering"
> > nearby, but using this for shaping only looks like "less simple":
> > http://lwn.net/Articles/328339/
> 
> I am aware of the thread and even tried out the author's patch (despite
> the fact that David Miller suggested it was not sane). Under heavy
> (simulated) traffic nothing was changed: only one ksoftirqd using 100%
> CPU, one CPU in 100%, others idle. This only confirms what I've already
> been told: htb is single threaded by design. It also proves that most of
> the packet processing work is actually in htb.

But, I wrote it's not simple. (And it was told about single threadedness
too.) This method is intended for a local traffic (to sockets) AFAIK, so
I thought about using some trick with virtual devs instead, but maybe
I'm totally wrong.

> 
> > BTW, I hope you add filters after classes they point to.
> 
> Do you mean the actual order I use for the "tc filter add" and "tc class
> add" commands? Does it make any difference?

Yes, I mean this order:
tc class add ... classid 1:23 ...
tc filter add ... flowid 1:23

> 
> Anyway, speaking of htb redesign or improvement (to use multiple
> threads / CPUs) I think classification rules can be cloned on a
> per-thread basis (to avoid synchronization issues). This means
> sacrificing memory for the benefit of performance but probably it is
> better to do it this way.
> 
> However, shaping data structures must be shared between all threads as
> long as it's not sure that all packets corresponding to a certain IP
> address are processed in the same thread (they most probably would not,
> if a round-robin alhorithm is used).
> 
> While searching the Internet for what has already been accomplished in
> this area, I ran several time across the per-CPU cache issue. The
> commonly accepted opinion seems to be that CPU parallelism in packet
> processing implies synchronization issues which in turn imply cache
> misses, which ultimately result in performance loss. However, with only
> one core in 100% and other 7 cores idle, I doubt that CPU-cache is
> really worth (it's just a guess and it definitely needs real tests as
> evidence).

There are many things to learn and to do around smp yet, just like
this "Software receive packet steering" thread shows. Anyway, there
are really big htb traffics handled as it is (look at Vyacheslav's
mail in this thread), so I guess you have something to do around your
config/hardware too.

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 18:19               ` Jarek Poplawski
@ 2009-04-23 20:19                 ` Jesper Dangaard Brouer
  2009-04-24  9:42                   ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-23 20:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev

On Thu, 23 Apr 2009, Jarek Poplawski wrote:

> On Thu, Apr 23, 2009 at 04:56:42PM +0300, Radu Rendec wrote:
>> On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote:
>>
>> I am aware of the thread and even tried out the author's patch (despite
>> the fact that David Miller suggested it was not sane). Under heavy
>> (simulated) traffic nothing was changed: only one ksoftirqd using 100%
>> CPU, one CPU in 100%, others idle. This only confirms what I've already
>> been told: htb is single threaded by design.

Its more general that just HTB.  We have general Qdisc serialization 
point in net/sched/sch_generic.c by the qdisc_lock(q).

>> It also proves that most of the packet processing work is actually in 
>> htb.

I'm not sure that statement is true.
Can you run Oprofile on the system?  That will tell us exactly where time 
is spend...

> ...
> I thought about using some trick with virtual devs instead, but maybe
> I'm totally wrong.

I like the idea with virtual devices, as each virtual device could be 
bound to a hardware tx-queue.

Then you just have to construct your HTB trees on each virtual 
device, and assign customers accordingly.

I just realized, you don't use a multi-queue capably NIC right?
Then it would be difficult to use the hardware tx-queue idea.
Have you though of using several physical NICs?

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 20:19                 ` Jesper Dangaard Brouer
@ 2009-04-24  9:42                   ` Radu Rendec
  2009-04-28 10:15                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-24  9:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

On Thu, 2009-04-23 at 22:19 +0200, Jesper Dangaard Brouer wrote:
> >> It also proves that most of the packet processing work is actually in 
> >> htb.
> 
> I'm not sure that statement is true.
> Can you run Oprofile on the system?  That will tell us exactly where time 
> is spend...

I've never used oprofile, but it looks very powerful and simple to use.
I'll compile a 2.6.29 (so that I also benefit from the htb patch you
told me about) then put oprofile on top of it. I'll get back to you by
evening (or maybe Monday noon) with real facts :)

> > ...
> > I thought about using some trick with virtual devs instead, but maybe
> > I'm totally wrong.
> 
> I like the idea with virtual devices, as each virtual device could be 
> bound to a hardware tx-queue.

Is there any current support for this or do you talk about it as an
approach to use in future development?

The idea looks interesting indeed. If there's current support for it,
I'd like to try it out. If not, perhaps I can help at least with testing
(or even some coding as well).

> Then you just have to construct your HTB trees on each virtual 
> device, and assign customers accordingly.

I don't think it's that easy. Let's say we have the same HTB trees on
both virtual devices A and B (each of them is bound to a different
hardware tx queue). If packets for a specific destination ip address
(pseudo)randomly arrive at both A and B, tokens will be extracted from
both A and B trees, resulting in an erroneus overall bandwidth (at worst
double the ceil, if packets reach the ceil on both A and B).

I have to make sure packets belonging to a certain customer (or ip
address) always come through a specific virtual device. Then HTB trees
don't even need to be identical.

However, this is not trivial at all. A single customer can have
different subnets (even from different class-B networks) but share a
single HTB bucket for all of them. Using a simple hash function on the
ip address to determine which virtual device to send through doesn't
seem to be an option since it does not guarantee all packets for a
certain customer will go together.

What I had in mind for parallel shaping was this:

NIC0 -> mux -----> Thread 0: classify/shape -----> NIC2
             \/                              \/
             /\                              /\
NIC1 -> mux -----> Thread 1: classify/shape -----> NIC3

Of course the number of input NICs, processing threads and output NICs
would be adjustable. But this idea has 2 major problems:

* shaping data must be shared between processing threads (in order to
extract tokens from the same bucket regardless of the thread that does
the actual processesing)

* it seems to be impossible to do this with (unmodified) HTB

> I just realized, you don't use a multi-queue capably NIC right?
> Then it would be difficult to use the hardware tx-queue idea.
> Have you though of using several physical NICs?

The machine we are preparing for production has this:

2 x Intel Corporation 82571EB Gigabit Ethernet Controller
2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller

All 4 NICs use the e1000e driver and I think they are multi-queue
capable. So in theory I can use several NICs and/or multi-queue.

Thanks,

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-24  9:42                   ` Radu Rendec
@ 2009-04-28 10:15                     ` Jesper Dangaard Brouer
  2009-04-29 10:21                       ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-28 10:15 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

On Fri, 24 Apr 2009, Radu Rendec wrote:

> On Thu, 2009-04-23 at 22:19 +0200, Jesper Dangaard Brouer wrote:
>>>> It also proves that most of the packet processing work is actually in
>>>> htb.
>>
>> I'm not sure that statement is true.
>> Can you run Oprofile on the system?  That will tell us exactly where time
>> is spend...
>
> I've never used oprofile, but it looks very powerful and simple to use.
> I'll compile a 2.6.29 (so that I also benefit from the htb patch you
> told me about) then put oprofile on top of it. I'll get back to you by
> evening (or maybe Monday noon) with real facts :)

Remember to keep/copy the file "vmlinux".

Here is the steps I usually use:

  opcontrol --vmlinux=/boot/vmlinux-`uname -r`

  opcontrol --stop
  opcontrol --reset
  opcontrol --start

  <perform stuff that needs profiling>

  opcontrol --stop

"Normal" report
  opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less

Looking at specific module "sch_htb"

  opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname 
-r`/kernel/


>>> ...
>>> I thought about using some trick with virtual devs instead, but maybe
>>> I'm totally wrong.
>>
>> I like the idea with virtual devices, as each virtual device could be
>> bound to a hardware tx-queue.
>
> Is there any current support for this or do you talk about it as an
> approach to use in future development?

This is definitly only ideas for future development...


> The idea looks interesting indeed. If there's current support for it,
> I'd like to try it out. If not, perhaps I can help at least with testing
> (or even some coding as well).
>
>> Then you just have to construct your HTB trees on each virtual
>> device, and assign customers accordingly.
>
> I don't think it's that easy. Let's say we have the same HTB trees on
> both virtual devices A and B (each of them is bound to a different
> hardware tx queue). If packets for a specific destination ip address
> (pseudo)randomly arrive at both A and B, tokens will be extracted from
> both A and B trees, resulting in an erroneus overall bandwidth (at worst
> double the ceil, if packets reach the ceil on both A and B).
>
> I have to make sure packets belonging to a certain customer (or ip
> address) always come through a specific virtual device. Then HTB trees
> don't even need to be identical.

Correct...


> However, this is not trivial at all. A single customer can have
> different subnets (even from different class-B networks) but share a
> single HTB bucket for all of them. Using a simple hash function on the
> ip address to determine which virtual device to send through doesn't
> seem to be an option since it does not guarantee all packets for a
> certain customer will go together.

Well I know the problem, our customers IP's are also allocated adhoc and 
not grouped nicely :-(


>...
>
>> I just realized, you don't use a multi-queue capably NIC right?
>> Then it would be difficult to use the hardware tx-queue idea.
>> Have you though of using several physical NICs?
>
> The machine we are preparing for production has this:
>
> 2 x Intel Corporation 82571EB Gigabit Ethernet Controller
> 2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller
>
> All 4 NICs use the e1000e driver and I think they are multi-queue
> capable. So in theory I can use several NICs and/or multi-queue.

I'm note sure that the driver e1000e has multiqueue for your devices.  The 
82571EB chip should have 2-rx and 2-tx queues [1].

Looking through the code, the multiqueue capable IRQ MSI-X code first got 
in in kernel version v2.6.28-rc1.  BUT the driver still uses 
alloc_etherdev() and not alloc_etherdev_mq().

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

[1]: 
http://www.intel.com/products/ethernet/index.htm?iid=embnav1+eth#s1=Gigabit%20Ethernet&s2=82571EB&s3=all


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-28 10:15                     ` Jesper Dangaard Brouer
@ 2009-04-29 10:21                       ` Radu Rendec
  2009-04-29 10:31                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-29 10:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

Thanks for the oprofile newbie guide - it saved much time and digging
through man pages.

Normal report looks like this:
samples  %        image name               app name                 symbol name
38424    30.7350  cls_u32.ko               cls_u32                  u32_classify
5321      4.2562  e1000e.ko                e1000e                   e1000_clean_rx_irq
4690      3.7515  vmlinux                  vmlinux                  ipt_do_table
3825      3.0596  sch_htb.ko               sch_htb                  htb_dequeue
3458      2.7660  vmlinux                  vmlinux                  __hash_conntrack
2597      2.0773  vmlinux                  vmlinux                  nf_nat_setup_info
2531      2.0245  vmlinux                  vmlinux                  kmem_cache_alloc
2229      1.7830  vmlinux                  vmlinux                  ip_route_input
1722      1.3774  vmlinux                  vmlinux                  nf_conntrack_in
1547      1.2374  sch_htb.ko               sch_htb                  htb_enqueue
1519      1.2150  vmlinux                  vmlinux                  kmem_cache_free
1471      1.1766  vmlinux                  vmlinux                  __slab_free
1435      1.1478  vmlinux                  vmlinux                  dev_queue_xmit
1313      1.0503  vmlinux                  vmlinux                  __qdisc_run
1277      1.0215  vmlinux                  vmlinux                  netif_receive_skb

All other symbols are below 1%.

sch_htb.ko report is this:

samples  %        image name               symbol name
-------------------------------------------------------------------------------
3825     49.0762  sch_htb.ko               htb_dequeue
  3825     100.000  sch_htb.ko               htb_dequeue [self]
-------------------------------------------------------------------------------
1547     19.8486  sch_htb.ko               htb_enqueue
  1547     100.000  sch_htb.ko               htb_enqueue [self]
-------------------------------------------------------------------------------
608       7.8009  sch_htb.ko               htb_lookup_leaf
  608      100.000  sch_htb.ko               htb_lookup_leaf [self]
-------------------------------------------------------------------------------
459       5.8891  sch_htb.ko               htb_deactivate_prios
  459      100.000  sch_htb.ko               htb_deactivate_prios [self]
-------------------------------------------------------------------------------
417       5.3503  sch_htb.ko               htb_add_to_wait_tree
  417      100.000  sch_htb.ko               htb_add_to_wait_tree [self]
-------------------------------------------------------------------------------
372       4.7729  sch_htb.ko               htb_change_class_mode
  372      100.000  sch_htb.ko               htb_change_class_mode [self]
-------------------------------------------------------------------------------
276       3.5412  sch_htb.ko               htb_activate_prios
  276      100.000  sch_htb.ko               htb_activate_prios [self]
-------------------------------------------------------------------------------
189       2.4249  sch_htb.ko               htb_add_to_id_tree
  189      100.000  sch_htb.ko               htb_add_to_id_tree [self]
-------------------------------------------------------------------------------
101       1.2959  sch_htb.ko               htb_safe_rb_erase
  101      100.000  sch_htb.ko               htb_safe_rb_erase [self]
-------------------------------------------------------------------------------

Am I misinterpreting the results, or does it look like the real problem
is actually packet classification?

Thanks,

Radu Rendec

On Tue, 2009-04-28 at 12:15 +0200, Jesper Dangaard Brouer wrote:
> Remember to keep/copy the file "vmlinux".
> 
> Here is the steps I usually use:
> 
>   opcontrol --vmlinux=/boot/vmlinux-`uname -r`
> 
>   opcontrol --stop
>   opcontrol --reset
>   opcontrol --start
> 
>   <perform stuff that needs profiling>
> 
>   opcontrol --stop
> 
> "Normal" report
>   opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less
> 
> Looking at specific module "sch_htb"
> 
>   opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname 
> -r`/kernel/



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 10:21                       ` Radu Rendec
@ 2009-04-29 10:31                         ` Jesper Dangaard Brouer
  2009-04-29 11:03                           ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-29 10:31 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev


On Wed, 29 Apr 2009, Radu Rendec wrote:

> Thanks for the oprofile newbie guide - it saved much time and digging
> through man pages.

You are welcome :-)

Just noticed that Jeremy Kerr has made some python scripts to make it even 
easier to use oprofile.
See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/


> Normal report looks like this:
> samples  %        image name               app name                 symbol name
> 38424    30.7350  cls_u32.ko               cls_u32                  u32_classify
> 5321      4.2562  e1000e.ko                e1000e                   e1000_clean_rx_irq
> 4690      3.7515  vmlinux                  vmlinux                  ipt_do_table
> 3825      3.0596  sch_htb.ko               sch_htb                  htb_dequeue
> 3458      2.7660  vmlinux                  vmlinux                  __hash_conntrack
> 2597      2.0773  vmlinux                  vmlinux                  nf_nat_setup_info
> 2531      2.0245  vmlinux                  vmlinux                  kmem_cache_alloc
> 2229      1.7830  vmlinux                  vmlinux                  ip_route_input
> 1722      1.3774  vmlinux                  vmlinux                  nf_conntrack_in
> 1547      1.2374  sch_htb.ko               sch_htb                  htb_enqueue
> 1519      1.2150  vmlinux                  vmlinux                  kmem_cache_free
> 1471      1.1766  vmlinux                  vmlinux                  __slab_free
> 1435      1.1478  vmlinux                  vmlinux                  dev_queue_xmit
> 1313      1.0503  vmlinux                  vmlinux                  __qdisc_run
> 1277      1.0215  vmlinux                  vmlinux                  netif_receive_skb
>
> All other symbols are below 1%.
>
> sch_htb.ko report is this:
> ...

I would rather want to see the output from cls_u32.ko

opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/


> Am I misinterpreting the results, or does it look like the real problem
> is actually packet classification?

Yes, it looks like the problem is your u32 classification setup... Perhaps 
its not doing what you think its doing...  didn't Jarek provide some hints 
for you to follow?


> On Tue, 2009-04-28 at 12:15 +0200, Jesper Dangaard Brouer wrote:
>> Remember to keep/copy the file "vmlinux".
>>
>> Here is the steps I usually use:
>>
>>   opcontrol --vmlinux=/boot/vmlinux-`uname -r`
>>
>>   opcontrol --stop
>>   opcontrol --reset
>>   opcontrol --start
>>
>>   <perform stuff that needs profiling>
>>
>>   opcontrol --stop
>>
>> "Normal" report
>>   opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less
>>
>> Looking at specific module "sch_htb"
>>
>>   opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname
>> -r`/kernel/


Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 10:31                         ` Jesper Dangaard Brouer
@ 2009-04-29 11:03                           ` Radu Rendec
  2009-04-29 12:23                             ` Jarek Poplawski
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-29 11:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

On Wed, 2009-04-29 at 12:31 +0200, Jesper Dangaard Brouer wrote:
> Just noticed that Jeremy Kerr has made some python scripts to make it even 
> easier to use oprofile.
> See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/

Thanks for the hint; I'll have a look at the scripts too.

> I would rather want to see the output from cls_u32.ko
> 
> opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/

samples  %        image name               symbol name
-------------------------------------------------------------------------------
38424    100.000  cls_u32.ko               u32_classify
  38424    100.000  cls_u32.ko               u32_classify [self]
-------------------------------------------------------------------------------

Well, this doesn't tell us much more, but I think it's pretty obvious
what cls_u32 is doing :)

> > Am I misinterpreting the results, or does it look like the real problem
> > is actually packet classification?
> 
> Yes, it looks like the problem is your u32 classification setup... Perhaps 
> its not doing what you think its doing...  didn't Jarek provide some hints 
> for you to follow?

I've just realized that I might be hitting the worst-case bucket with
the (ip) destinations I chose for the test traffic. I'll try 

I haven't tried tweaking htb_hysteresis yet (that was one of Jarek's
hints) - it's debatable that it would help since the real problem seems
to be in u32 (not htb), but I'll give it a try anyway.

Another hint was to make sure that "tc class add" goes before
corresponding "tc filter add" - checked: it's ok.

Another interesting hint came from Calin Velea, whose tests suggest that
the overall performance is better with napi turned off, since (rx)
interrupt work is distributed to all cpus/cores. I'll try to replicate
this as soon as I make some small changes to my test setup so that I'm
able to measure overall htb throughput on the egress nic (bps and pps).

Thanks,

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 11:03                           ` Radu Rendec
@ 2009-04-29 12:23                             ` Jarek Poplawski
  2009-04-29 13:15                               ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-29 12:23 UTC (permalink / raw)
  To: Radu Rendec
  Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea

On Wed, Apr 29, 2009 at 02:03:26PM +0300, Radu Rendec wrote:
> On Wed, 2009-04-29 at 12:31 +0200, Jesper Dangaard Brouer wrote:
...
> > Yes, it looks like the problem is your u32 classification setup... Perhaps 
> > its not doing what you think its doing...  didn't Jarek provide some hints 
> > for you to follow?
> 
> I've just realized that I might be hitting the worst-case bucket with
> the (ip) destinations I chose for the test traffic. I'll try 
> 
> I haven't tried tweaking htb_hysteresis yet (that was one of Jarek's
> hints) - it's debatable that it would help since the real problem seems
> to be in u32 (not htb), but I'll give it a try anyway.

According to the author's(?) comment with hysteresis "The speed gain
is about 1/6", so not very much here considering htb_dequeue time.

> Another hint was to make sure that "tc class add" goes before
> corresponding "tc filter add" - checked: it's ok.
> 
> Another interesting hint came from Calin Velea, whose tests suggest that
> the overall performance is better with napi turned off, since (rx)
> interrupt work is distributed to all cpus/cores. I'll try to replicate
> this as soon as I make some small changes to my test setup so that I'm
> able to measure overall htb throughput on the egress nic (bps and pps).

Radu, since not only your worst case, but also the real case u32
lookups are very big I think you should mainly have a look at Calin's
u32 hash generator or at least his method, and only after optimizing
it try these other tricks. Btw. I hope Calin made this nice program
known to networking/admins lists too.

Btw. #2: I think you wrote you didn't use iptables...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 12:23                             ` Jarek Poplawski
@ 2009-04-29 13:15                               ` Radu Rendec
  2009-04-29 13:38                                 ` Jarek Poplawski
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-29 13:15 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea

On Wed, 2009-04-29 at 14:23 +0200, Jarek Poplawski wrote:
> According to the author's(?) comment with hysteresis "The speed gain
> is about 1/6", so not very much here considering htb_dequeue time.

Thought so :)

> Radu, since not only your worst case, but also the real case u32
> lookups are very big I think you should mainly have a look at Calin's
> u32 hash generator or at least his method, and only after optimizing
> it try these other tricks. Btw. I hope Calin made this nice program
> known to networking/admins lists too.

I've just had a look at Calin's approach to optimizing u32 lookups. It
does indeed make a very nice use of u32 hash capabilities, resulting in
a maximum of 4 lookups. The algorithm he uses takes advantage of the
fact that only a (small) subset of the whole ipv4 address space is
actually used in an ISP's network.

Unfortunately his approach makes it a bit difficult to dynamically
adjust the configuration, since the controller (program/application)
must remember the exact hash tables, filters etc in order to be able to
add/remove CIDRs without rewriting the entire configuration. Unused hash
tables also need to be "garbage collected" and reused, otherwise the
hash table id space could be exhausted.

Since I only use IP lookups (and u32 is very generic) I'm starting to
ask myself if a different kind of data structures and classifier were
more appropriate.

For instance, I think a binary search tree that is matched against the
bits in the ip address would result in pretty nice performance. It would
take at most 32 iterations (descending through the tree) with less
overhead than the (complex) u32 rule match.

> Btw. #2: I think you wrote you didn't use iptables...
No, I don't use iptables.

Btw, the e1000e driver seems to have no way to disable NAPI. Am I
missing something (like a global kernel config option that disables NAPI
completely)?

Thanks,

Radu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 13:15                               ` Radu Rendec
@ 2009-04-29 13:38                                 ` Jarek Poplawski
  2009-04-29 16:21                                   ` Radu Rendec
  0 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-29 13:38 UTC (permalink / raw)
  To: Radu Rendec
  Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea

On Wed, Apr 29, 2009 at 04:15:51PM +0300, Radu Rendec wrote:
...
> I've just had a look at Calin's approach to optimizing u32 lookups. It
> does indeed make a very nice use of u32 hash capabilities, resulting in
> a maximum of 4 lookups. The algorithm he uses takes advantage of the
> fact that only a (small) subset of the whole ipv4 address space is
> actually used in an ISP's network.
...

Anyway, it looks like your main problem, and I doubt even dividing
current work by e.g. 4 cores (if it were multi-threaded) is enough.
These lookups are simply too long.

> > Btw. #2: I think you wrote you didn't use iptables...
> No, I don't use iptables.

But your oprofile shows them. Maybe you shouldn't compile it into
kernel at all?

> 
> Btw, the e1000e driver seems to have no way to disable NAPI. Am I
> missing something (like a global kernel config option that disables NAPI
> completely)?

Calin uses older kernel, and maybe e1000 driver, I don't know.

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 13:38                                 ` Jarek Poplawski
@ 2009-04-29 16:21                                   ` Radu Rendec
  2009-04-29 22:49                                     ` Calin Velea
  0 siblings, 1 reply; 39+ messages in thread
From: Radu Rendec @ 2009-04-29 16:21 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea

I finally managed to disable NAPI on e1000e - apparently it can only be
done on the "official" Intel driver (downloaded from their website), by
compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem
to be available in the (2.6.29) kernel driver.

With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but
overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps.
This makes sense, since h/w interrupt is much more time consuming than
polling (that's the whole idea behind NAPI anyway).

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 16:21                                   ` Radu Rendec
@ 2009-04-29 22:49                                     ` Calin Velea
  2009-04-29 23:00                                       ` Re[2]: " Calin Velea
  2009-04-30 11:19                                       ` Radu Rendec
  0 siblings, 2 replies; 39+ messages in thread
From: Calin Velea @ 2009-04-29 22:49 UTC (permalink / raw)
  To: Radu Rendec
  Cc: Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko,
	netdev

Wednesday, April 29, 2009, 7:21:11 PM, you wrote:

> I finally managed to disable NAPI on e1000e - apparently it can only be
> done on the "official" Intel driver (downloaded from their website), by
> compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem
> to be available in the (2.6.29) kernel driver.

> With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but
> overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps.
> This makes sense, since h/w interrupt is much more time consuming than
> polling (that's the whole idea behind NAPI anyway).

> Radu Rendec

   I tested with e1000 only, on a single quad-core CPU - the L2 cache was
shared between the cores.

  For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
used belong to different physical CPUs, L2 cache sharing does not occur -
maybe this could explain the performance drop in your case.
  Or there may be other explanation...

  Anyway - coming back to David Miller's words:

"HTB acts upon global state, so anything that goes into a particular 
device's HTB ruleset is going to be single threaded. 
There really isn't any way around this. "

  It could be the only way to get more power is to increase the number 
of devices where you are shaping. You could split the IP space into 4 groups
and direct the trafic to 4 IMQ devices with 4 iptables rules -

-d 0.0.0.0/2 -j IMQ --todev imq0,
-d 64.0.0.0/2 -j IMQ --todev imq1, etc...

Or you can customize the split depeding on the traffic distribution.
ipset nethash match can also be used.

 The 4 devices can have the same htb ruleset, only the right parts 
of it will match.
  You should test with 4 flows that use all the devices simultaneously and
see what is the aggregate throughput.

  The performance gained through parallelism might be a lot higher than the 
added overhead of iptables and/or ipset nethash match. Anyway - this is more of
a "hack" than a clean solution :)

p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that
-- 
Best regards,
 Calin                            mailto:calin.velea@gemenii.ro

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re[2]: htb parallelism on multi-core platforms
  2009-04-29 22:49                                     ` Calin Velea
@ 2009-04-29 23:00                                       ` Calin Velea
  2009-04-30 11:19                                       ` Radu Rendec
  1 sibling, 0 replies; 39+ messages in thread
From: Calin Velea @ 2009-04-29 23:00 UTC (permalink / raw)
  To: Calin Velea
  Cc: Radu Rendec, Jarek Poplawski, Jesper Dangaard Brouer,
	Denys Fedoryschenko, netdev

Hello Calin,

Thursday, April 30, 2009, 1:49:46 AM, you wrote:

> Wednesday, April 29, 2009, 7:21:11 PM, you wrote:

>> I finally managed to disable NAPI on e1000e - apparently it can only be
>> done on the "official" Intel driver (downloaded from their website), by
>> compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem
>> to be available in the (2.6.29) kernel driver.

>> With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but
>> overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps.
>> This makes sense, since h/w interrupt is much more time consuming than
>> polling (that's the whole idea behind NAPI anyway).

>> Radu Rendec

>    I tested with e1000 only, on a single quad-core CPU - the L2 cache was
> shared between the cores.

>   For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
> used belong to different physical CPUs, L2 cache sharing does not occur -
> maybe this could explain the performance drop in your case.
>   Or there may be other explanation...


>   Anyway - coming back to David Miller's words:

> "HTB acts upon global state, so anything that goes into a particular 
> device's HTB ruleset is going to be single threaded. 
> There really isn't any way around this. "

>   It could be the only way to get more power is to increase the number
> of devices where you are shaping. You could split the IP space into 4 groups
> and direct the trafic to 4 IMQ devices with 4 iptables rules -

> -d 0.0.0.0/2 -j IMQ --todev imq0,
> -d 64.0.0.0/2 -j IMQ --todev imq1, etc...

> Or you can customize the split depeding on the traffic distribution.
> ipset nethash match can also be used.


>  The 4 devices can have the same htb ruleset, only the right parts 
> of it will match.
>   You should test with 4 flows that use all the devices simultaneously and
> see what is the aggregate throughput.


>   The performance gained through parallelism might be a lot higher than the
> added overhead of iptables and/or ipset nethash match. Anyway - this is more of
> a "hack" than a clean solution :)


> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that


  You will also need -i ethX (router), or -m physdev --physdev-in ethX
(bridge) to differentiate between upload and download in the iptables rules.


-- 
Best regards,
 Calin                            mailto:calin.velea@gemenii.ro


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-29 22:49                                     ` Calin Velea
  2009-04-29 23:00                                       ` Re[2]: " Calin Velea
@ 2009-04-30 11:19                                       ` Radu Rendec
  2009-04-30 11:44                                         ` Jesper Dangaard Brouer
  2009-04-30 14:04                                         ` Re[2]: " Calin Velea
  1 sibling, 2 replies; 39+ messages in thread
From: Radu Rendec @ 2009-04-30 11:19 UTC (permalink / raw)
  To: Calin Velea
  Cc: Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko,
	netdev

On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote:
>    I tested with e1000 only, on a single quad-core CPU - the L2 cache was
> shared between the cores.
> 
>   For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
> used belong to different physical CPUs, L2 cache sharing does not occur -
> maybe this could explain the performance drop in your case.
>   Or there may be other explanation...

It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified
CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and
it is very probable - then I think the L2 cache was actually shared.
That's because the used CPUs where either 0-3 or 4-7 but never a mix of
them. So perhaps there is another explanation (maybe driver/hardware).

>   It could be the only way to get more power is to increase the number 
> of devices where you are shaping. You could split the IP space into 4 groups
> and direct the trafic to 4 IMQ devices with 4 iptables rules -
> 
> -d 0.0.0.0/2 -j IMQ --todev imq0,
> -d 64.0.0.0/2 -j IMQ --todev imq1, etc...

Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share
bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc,
and the two qdiscs (HTB sets) are independent. This will result in a
maximum of double the allocated bandwidth (if HTB sets are identical and
traffic is equally distributed).

>   The performance gained through parallelism might be a lot higher than the 
> added overhead of iptables and/or ipset nethash match. Anyway - this is more of
> a "hack" than a clean solution :)
>
> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that

Yes, the performance gained through parallelism is expected to be higher
than the loss of the additional overhead. That's why I asked for
parallel HTB in the first place, but got very disappointed after David
Miller's reply :)

Thanks a lot for all the hints and for the imq link. Imq is very
interesting regardless of whether it proves to be useful for this
project of mine or not.

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-30 11:19                                       ` Radu Rendec
@ 2009-04-30 11:44                                         ` Jesper Dangaard Brouer
  2009-04-30 14:04                                         ` Re[2]: " Calin Velea
  1 sibling, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-30 11:44 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Calin Velea, Jarek Poplawski, Denys Fedoryschenko, netdev

On Thu, 30 Apr 2009, Radu Rendec wrote:

> On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote:
>>    I tested with e1000 only, on a single quad-core CPU - the L2 cache was
>> shared between the cores.
>>
>>   For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
>> used belong to different physical CPUs, L2 cache sharing does not occur -
>> maybe this could explain the performance drop in your case.
>>   Or there may be other explanation...
>
> It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified
> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and
> it is very probable - then I think the L2 cache was actually shared.
> That's because the used CPUs where either 0-3 or 4-7 but never a mix of
> them. So perhaps there is another explanation (maybe driver/hardware).

WRONG assumption regarding CPU id's

Look in /proc/cpuinfo for the correct answer.

(From a:
  model name      : Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz)

cat /proc/cpuinfo | egrep -e '(processor|physical id|core id)'
processor       : 0
physical id     : 0
core id         : 0

processor       : 1
physical id     : 1
core id         : 0

processor       : 2
physical id     : 0
core id         : 2

processor       : 3
physical id     : 1
core id         : 2

processor       : 4
physical id     : 0
core id         : 1

processor       : 5
physical id     : 1
core id         : 1

processor       : 6
physical id     : 0
core id         : 3

processor       : 7
physical id     : 1
core id         : 3

E.g. Here CPU0 and CPU4 is sharing the same L2 cache.


Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re[2]: htb parallelism on multi-core platforms
  2009-04-30 11:19                                       ` Radu Rendec
  2009-04-30 11:44                                         ` Jesper Dangaard Brouer
@ 2009-04-30 14:04                                         ` Calin Velea
  2009-05-08 10:15                                           ` Paweł Staszewski
  1 sibling, 1 reply; 39+ messages in thread
From: Calin Velea @ 2009-04-30 14:04 UTC (permalink / raw)
  To: Radu Rendec
  Cc: Calin Velea, Jarek Poplawski, Jesper Dangaard Brouer,
	Denys Fedoryschenko, netdev



Thursday, April 30, 2009, 2:19:36 PM, you wrote:

> On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote:
>>    I tested with e1000 only, on a single quad-core CPU - the L2 cache was
>> shared between the cores.
>> 
>>   For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
>> used belong to different physical CPUs, L2 cache sharing does not occur -
>> maybe this could explain the performance drop in your case.
>>   Or there may be other explanation...

> It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified
> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and
> it is very probable - then I think the L2 cache was actually shared.
> That's because the used CPUs where either 0-3 or 4-7 but never a mix of
> them. So perhaps there is another explanation (maybe driver/hardware).

>>   It could be the only way to get more power is to increase the number 
>> of devices where you are shaping. You could split the IP space into 4 groups
>> and direct the trafic to 4 IMQ devices with 4 iptables rules -
>> 
>> -d 0.0.0.0/2 -j IMQ --todev imq0,
>> -d 64.0.0.0/2 -j IMQ --todev imq1, etc...

> Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share
> bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc,
> and the two qdiscs (HTB sets) are independent. This will result in a
> maximum of double the allocated bandwidth (if HTB sets are identical and
> traffic is equally distributed).

>>   The performance gained through parallelism might be a lot higher than the 
>> added overhead of iptables and/or ipset nethash match. Anyway - this is more of
>> a "hack" than a clean solution :)
>>
>> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that

> Yes, the performance gained through parallelism is expected to be higher
> than the loss of the additional overhead. That's why I asked for
> parallel HTB in the first place, but got very disappointed after David
> Miller's reply :)

> Thanks a lot for all the hints and for the imq link. Imq is very
> interesting regardless of whether it proves to be useful for this
> project of mine or not.

> Radu Rendec


   Indeed, you need to use ipset with nethash to avoid bandwidth doubling.
Let's say we have a shaping bridge: customer side (download) is
on eth0, the upstream side (upload) is on eth1.

   Create customer groups with ipset (http://ipset.netfilter.org/)

ipset -N cust_group1_ips nethash
ipset -A cust_group1_ips <subnet/mask>
....
....for each subnet



To shape the upload with multiple IMQs:

-m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ --to-dev 0
-m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ --to-dev 1
-m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ --to-dev 2
-m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ --to-dev 3


 You will apply the same htb upload limits to imq 0-3.
 Upload for customers having source IPs from the first group will be shaped
by imq0, for the second, by imq1, etc...


For download:

-m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ --to-dev 4
-m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ --to-dev 5
-m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ --to-dev 6
-m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ --to-dev 7

and apply the same download limits on imq 4-7


> __________ NOD32 4045 (20090430) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Best regards,
 Calin                            mailto:calin.velea@gemenii.ro


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-30 14:04                                         ` Re[2]: " Calin Velea
@ 2009-05-08 10:15                                           ` Paweł Staszewski
  2009-05-08 17:55                                             ` Vladimir Ivashchenko
  0 siblings, 1 reply; 39+ messages in thread
From: Paweł Staszewski @ 2009-05-08 10:15 UTC (permalink / raw)
  To: Linux Network Development list; +Cc: netdev

Radu You have something wrong with your configuration i think.

I make Traffic management for many different nets with space of /18 
prefix outside net + 10.0.0.0/18 inside and some nets like /21 , /22 , 
/23, /20 network prefixes.

Some stats from my router:

tc -s -d filter show dev eth0 | grep dst | wc -l
14087
tc -s -d filter show dev eth1 | grep dst | wc -l
14087

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            3075  @ 2.66GHz
stepping        : 11
cpu MHz         : 2659.843
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 5319.68
clflush size    : 64
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            3075  @ 2.66GHz
stepping        : 11
cpu MHz         : 2659.843
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 5320.30
clflush size    : 64
power management:


mpstat -P ALL 1 10
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  
%steal   %idle    intr/s
Average:     all    0.00    0.00    0.15    0.00    0.00    0.10    
0.00   99.75  73231.70
Average:       0    0.00    0.00    0.20    0.00    0.00    0.10    
0.00   99.70      0.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    
0.00  100.00  27686.80
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00      0.00

Some opreport:
CPU: Core 2, speed 2659.84 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a 
unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        app name                 symbol name
7592      8.3103  vmlinux                  rb_next
5393      5.9033  vmlinux                  e1000_get_hw_control
4514      4.9411  vmlinux                  hfsc_dequeue
4069      4.4540  vmlinux                  e1000_intr_msi
3695      4.0446  vmlinux                  u32_classify
3522      3.8552  vmlinux                  poll_idle
2234      2.4454  vmlinux                  _raw_spin_lock
2077      2.2735  vmlinux                  read_tsc
1855      2.0305  vmlinux                  rb_prev
1834      2.0075  vmlinux                  getnstimeofday
1800      1.9703  vmlinux                  e1000_clean_rx_irq
1553      1.6999  vmlinux                  ip_route_input
1509      1.6518  vmlinux                  hfsc_enqueue
1451      1.5883  vmlinux                  irq_entries_start
1419      1.5533  vmlinux                  mwait_idle
1392      1.5237  vmlinux                  e1000_clean_tx_irq
1345      1.4723  vmlinux                  rb_erase
1294      1.4164  vmlinux                  sfq_enqueue
1187      1.2993  libc-2.6.1.so            (no symbols)
1162      1.2719  vmlinux                  sfq_dequeue
1134      1.2413  vmlinux                  ipt_do_table
1116      1.2216  vmlinux                  apic_timer_interrupt
1108      1.2128  vmlinux                  cftree_insert
1039      1.1373  vmlinux                  rtsc_y2x
985       1.0782  vmlinux                  e1000_xmit_frame
943       1.0322  vmlinux                  update_vf

 bwm-ng v0.6 (probing every 5.000s), press 'h' for help
  input: /proc/net/dev type: rate
  /         iface                   Rx                   
Tx                Total
  
==============================================================================
               lo:           0.00 KB/s            0.00 KB/s            
0.00 KB/s
             eth1:       20716.35 KB/s        24258.43 KB/s        
44974.78 KB/s
             eth0:       24365.31 KB/s        30691.10 KB/s        
55056.42 KB/s
  
------------------------------------------------------------------------------

bwm-ng v0.6 (probing every 5.000s), press 'h' for help
  input: /proc/net/dev type: rate
  |         iface                   Rx                   
Tx                Total
  
==============================================================================
               lo:            0.00 P/s             0.00 P/s             
0.00 P/s
             eth1:        38034.00 P/s         36751.00 P/s         
74785.00 P/s
             eth0:        37195.40 P/s         38115.00 P/s         
75310.40 P/s
      
Maximum CPU load is when rush hour (from 5:00 pm to 10:00 pm) then it is 
20% - 30% of each CPU.


So i think you must change type of your hash tree in u32 filtering.
I use simply split of big nets like /18, /20, /21 to /24 prefixes  to 
build my hash tree.
I make many tests and this configuration of hash works best for my 
configuration.



Regards
Paweł Sstaszewski





Calin Velea pisze:
> Thursday, April 30, 2009, 2:19:36 PM, you wrote:
>
>   
>> On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote:
>>     
>>>    I tested with e1000 only, on a single quad-core CPU - the L2 cache was
>>> shared between the cores.
>>>
>>>   For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually
>>> used belong to different physical CPUs, L2 cache sharing does not occur -
>>> maybe this could explain the performance drop in your case.
>>>   Or there may be other explanation...
>>>       
>
>   
>> It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified
>> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and
>> it is very probable - then I think the L2 cache was actually shared.
>> That's because the used CPUs where either 0-3 or 4-7 but never a mix of
>> them. So perhaps there is another explanation (maybe driver/hardware).
>>     
>
>   
>>>   It could be the only way to get more power is to increase the number 
>>> of devices where you are shaping. You could split the IP space into 4 groups
>>> and direct the trafic to 4 IMQ devices with 4 iptables rules -
>>>
>>> -d 0.0.0.0/2 -j IMQ --todev imq0,
>>> -d 64.0.0.0/2 -j IMQ --todev imq1, etc...
>>>       
>
>   
>> Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share
>> bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc,
>> and the two qdiscs (HTB sets) are independent. This will result in a
>> maximum of double the allocated bandwidth (if HTB sets are identical and
>> traffic is equally distributed).
>>     
>
>   
>>>   The performance gained through parallelism might be a lot higher than the 
>>> added overhead of iptables and/or ipset nethash match. Anyway - this is more of
>>> a "hack" than a clean solution :)
>>>
>>> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that
>>>       
>
>   
>> Yes, the performance gained through parallelism is expected to be higher
>> than the loss of the additional overhead. That's why I asked for
>> parallel HTB in the first place, but got very disappointed after David
>> Miller's reply :)
>>     
>
>   
>> Thanks a lot for all the hints and for the imq link. Imq is very
>> interesting regardless of whether it proves to be useful for this
>> project of mine or not.
>>     
>
>   
>> Radu Rendec
>>     
>
>
>    Indeed, you need to use ipset with nethash to avoid bandwidth doubling.
> Let's say we have a shaping bridge: customer side (download) is
> on eth0, the upstream side (upload) is on eth1.
>
>    Create customer groups with ipset (http://ipset.netfilter.org/)
>
> ipset -N cust_group1_ips nethash
> ipset -A cust_group1_ips <subnet/mask>
> ....
> ....for each subnet
>
>
>
> To shape the upload with multiple IMQs:
>
> -m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ --to-dev 0
> -m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ --to-dev 1
> -m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ --to-dev 2
> -m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ --to-dev 3
>
>
>  You will apply the same htb upload limits to imq 0-3.
>  Upload for customers having source IPs from the first group will be shaped
> by imq0, for the second, by imq1, etc...
>
>
> For download:
>
> -m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ --to-dev 4
> -m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ --to-dev 5
> -m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ --to-dev 6
> -m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ --to-dev 7
>
> and apply the same download limits on imq 4-7
>
>
>   
>> __________ NOD32 4045 (20090430) Information __________
>>     
>
>   
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>     
>
>
>
>
>   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-05-08 10:15                                           ` Paweł Staszewski
@ 2009-05-08 17:55                                             ` Vladimir Ivashchenko
  2009-05-08 18:07                                               ` Denys Fedoryschenko
  0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-08 17:55 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list


> >> It is correct, I have 2 quad-core CPUs. If adjacent
> kernel-identified
> >> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3)
> - and
> >> it is very probable - then I think the L2 cache was actually
> shared.
> >> That's because the used CPUs where either 0-3 or 4-7 but never a
> mix of
> >> them. So perhaps there is another explanation (maybe
> driver/hardware).

Keep in mind that on Intel quad-core CPU cache is shared between pairs
of cores, not for all four cores.

http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/desktop/processor/processors/core2quad/feature/index.htm

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-05-08 17:55                                             ` Vladimir Ivashchenko
@ 2009-05-08 18:07                                               ` Denys Fedoryschenko
  0 siblings, 0 replies; 39+ messages in thread
From: Denys Fedoryschenko @ 2009-05-08 18:07 UTC (permalink / raw)
  To: Vladimir Ivashchenko
  Cc: Paweł Staszewski, Linux Network Development list

Btw shared L2 cache have higher latency, than dedicated one.
Thats why Core i7 rules (tested recently).

On Friday 08 May 2009 20:55:12 Vladimir Ivashchenko wrote:
> > >> It is correct, I have 2 quad-core CPUs. If adjacent
> >
> > kernel-identified
> >
> > >> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3)
> >
> > - and
> >
> > >> it is very probable - then I think the L2 cache was actually
> >
> > shared.
> >
> > >> That's because the used CPUs where either 0-3 or 4-7 but never a
> >
> > mix of
> >
> > >> them. So perhaps there is another explanation (maybe
> >
> > driver/hardware).
>
> Keep in mind that on Intel quad-core CPU cache is shared between pairs
> of cores, not for all four cores.
>
> http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/desktop/proce
>ssor/processors/core2quad/feature/index.htm



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-22 21:29         ` Jesper Dangaard Brouer
  2009-04-23  8:20           ` Jarek Poplawski
@ 2009-04-23 12:31           ` Radu Rendec
  2009-04-23 18:43             ` Jarek Poplawski
                               ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Radu Rendec @ 2009-04-23 12:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev

On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
> Its runtime adjustable, so its easy to try out.
> 
>   via /sys/module/sch_htb/parameters/htb_hysteresis

Thanks for the tip! This means I can play around with various values
while the machine is in production and see how it reacts.

> The HTB classify hash has a scalability issue in kernels below 2.6.26. 
> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
> using?

I'm using 2.6.26, so I guess the fix is already there :(

> Could you explain how you do classification? And perhaps outline where you 
> possible scalability issue is located?
> 
> If you are interested how I do scalable classification, see my 
> presentation from Netfilter Workshop 2008:
> 
>   http://nfws.inl.fr/en/?p=115
>   http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf

I had a look at your presentation and it seems to be focused in dividing
a single iptables rule chain into multiple chains, so that rule lookup
complexity decreases from linear to logarithmic.

Since I only need to do shaping, I don't use iptables at all. Address
matching is all done in on the egress side, using u32. Rule schema is
this:

1. We have two /19 networks that differ pretty much in the first bits:
80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
individual /32 addresses.

2. The default ip hash (0x800) is size 1 (only one bucket) and has two
rules that select between two subsequent hash tables (say 0x100 and
0x101) based on the most significant bits in the address.

3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
bucket selection is done by bits b10 - b17 (with b0 being the least
significant).

4. Each bucket contains complete cidr match rules (corresponding to real
customer addresses). Since bits b11 - b31 are already checked in upper
levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
worst case, if all customer addresses that "fall" into that bucket
are /32 (fortunately this is not the real case).

In conclusion each packet would be matched against at most 1026 rules
(worst case). The real case is actually much better: only one bucket
with 400 rules, all other less than 70 rules and most of them less than
10 rules.

> > I guess htb_hysteresis only affects the actual shaping (which takes 
> > place after the packet is classified).
> 
> Yes, htb_hysteresis basically is a hack to allow extra bursts... we 
> actually considered removing it completely...

It's definitely worth a try at least. Thanks for the tips!

Radu Rendec

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 12:31           ` Radu Rendec
@ 2009-04-23 18:43             ` Jarek Poplawski
  2009-04-23 19:06               ` Jesper Dangaard Brouer
  2009-04-24  6:01               ` Jarek Poplawski
       [not found]             ` <1039493214.20090424135024@gemenii.ro>
  2009-04-24 11:35             ` Re[2]: " Calin Velea
  2 siblings, 2 replies; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-23 18:43 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev

Radu Rendec wrote, On 04/23/2009 02:31 PM:

> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
...
>> The HTB classify hash has a scalability issue in kernels below 2.6.26. 
>> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
>> using?
> 
> I'm using 2.6.26, so I guess the fix is already there :(

If Jesper meant the change of hash I can see it in 2.6.27 yet.

...
> In conclusion each packet would be matched against at most 1026 rules
> (worst case). The real case is actually much better: only one bucket
> with 400 rules, all other less than 70 rules and most of them less than
> 10 rules.

Alas I can't analyze this all now, and probably I miss something, but
your worst and real cases look suspiciously big. Do all these classes
differ so much? Maybe you should have a look at cls_flow?

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 18:43             ` Jarek Poplawski
@ 2009-04-23 19:06               ` Jesper Dangaard Brouer
  2009-04-23 19:14                 ` Jarek Poplawski
  2009-04-24  6:01               ` Jarek Poplawski
  1 sibling, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-23 19:06 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev

On Thu, 23 Apr 2009, Jarek Poplawski wrote:

> Radu Rendec wrote, On 04/23/2009 02:31 PM:
>
>> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
> ...
>>> The HTB classify hash has a scalability issue in kernels below 2.6.26.
>>> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you
>>> using?
>>
>> I'm using 2.6.26, so I guess the fix is already there :(
>
> If Jesper meant the change of hash I can see it in 2.6.27 yet.

I'm referring to:

  commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
  Author: Patrick McHardy <kaber@trash.net>
  Date:   Sat Jul 5 23:22:35 2008 -0700

     net-sched: sch_htb: use dynamic class hash helpers

Is there any easy git way to figure out which release this commit got 
into?

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 19:06               ` Jesper Dangaard Brouer
@ 2009-04-23 19:14                 ` Jarek Poplawski
  2009-04-23 19:47                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-23 19:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev

On Thu, Apr 23, 2009 at 09:06:59PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 23 Apr 2009, Jarek Poplawski wrote:
>
>> Radu Rendec wrote, On 04/23/2009 02:31 PM:
>>
>>> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
>> ...
>>>> The HTB classify hash has a scalability issue in kernels below 2.6.26.
>>>> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you
>>>> using?
>>>
>>> I'm using 2.6.26, so I guess the fix is already there :(
>>
>> If Jesper meant the change of hash I can see it in 2.6.27 yet.
>
> I'm referring to:
>
>  commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
>  Author: Patrick McHardy <kaber@trash.net>
>  Date:   Sat Jul 5 23:22:35 2008 -0700
>
>     net-sched: sch_htb: use dynamic class hash helpers
>
> Is there any easy git way to figure out which release this commit got  
> into?

I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag):
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 19:14                 ` Jarek Poplawski
@ 2009-04-23 19:47                   ` Jesper Dangaard Brouer
  2009-04-23 20:00                     ` Jarek Poplawski
  2009-04-23 20:09                     ` Jeff King
  0 siblings, 2 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2009-04-23 19:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev, git

On Thu, 23 Apr 2009, Jarek Poplawski wrote:

> On Thu, Apr 23, 2009 at 09:06:59PM +0200, Jesper Dangaard Brouer wrote:
>> On Thu, 23 Apr 2009, Jarek Poplawski wrote:
>>
>>> Radu Rendec wrote, On 04/23/2009 02:31 PM:
>>>
>>>> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
>>> ...
>>>>> The HTB classify hash has a scalability issue in kernels below 2.6.26.
>>>>> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you
>>>>> using?
>>>>
>>>> I'm using 2.6.26, so I guess the fix is already there :(
>>>
>>> If Jesper meant the change of hash I can see it in 2.6.27 yet.
>>
>> I'm referring to:
>>
>>  commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
>>  Author: Patrick McHardy <kaber@trash.net>
>>  Date:   Sat Jul 5 23:22:35 2008 -0700
>>
>>     net-sched: sch_htb: use dynamic class hash helpers
>>
>> Is there any easy git way to figure out which release this commit got
>> into?
>
> I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag):
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

I think I prefer the command line edition "git-describe".  But it seems 
that the two approaches gives a different results.
(Cc'ing the git mailing list as they might know the reason)

  git-describe f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
  returns "v2.6.26-rc8-1107-gf4c1f3e"

  While you URL returns: "X-Git-Tag: v2.6.27-rc1~964^2~219"

I also did a:
"git log v2.6.26..v2.6.27 | grep f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2"
commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2

To Radu: The change I talked about is in 2.6.27, so you should try that 
kernel on you system.

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 19:47                   ` Jesper Dangaard Brouer
@ 2009-04-23 20:00                     ` Jarek Poplawski
  2009-04-23 20:09                     ` Jeff King
  1 sibling, 0 replies; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-23 20:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev, git

On Thu, Apr 23, 2009 at 09:47:05PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 23 Apr 2009, Jarek Poplawski wrote:
...
>> I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag):
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
>
> I think I prefer the command line edition "git-describe".  But it seems  
> that the two approaches gives a different results.

Probably there is something more needed around this git-describe.
I prefer the command line too when I can remember this command line...

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 19:47                   ` Jesper Dangaard Brouer
  2009-04-23 20:00                     ` Jarek Poplawski
@ 2009-04-23 20:09                     ` Jeff King
  1 sibling, 0 replies; 39+ messages in thread
From: Jeff King @ 2009-04-23 20:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jarek Poplawski, Radu Rendec, Denys Fedoryschenko, netdev, git

On Thu, Apr 23, 2009 at 09:47:05PM +0200, Jesper Dangaard Brouer wrote:

>>> Is there any easy git way to figure out which release this commit got
>>> into?
>>
>> I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag):
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2
>
> I think I prefer the command line edition "git-describe".  But it seems  
> that the two approaches gives a different results.
> (Cc'ing the git mailing list as they might know the reason)

You want "git describe --contains". The default mode for describe is
"you are at tag $X, plus $N commits, and by the way, the sha1 is $H"
(shown as "$X-$N-g$H").

The default mode is useful for generating a unique semi-human-readable
version number (e.g., to be included in your builds).

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: htb parallelism on multi-core platforms
  2009-04-23 18:43             ` Jarek Poplawski
  2009-04-23 19:06               ` Jesper Dangaard Brouer
@ 2009-04-24  6:01               ` Jarek Poplawski
  1 sibling, 0 replies; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-24  6:01 UTC (permalink / raw)
  To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev

On 23-04-2009 20:43, Jarek Poplawski wrote:
> Radu Rendec wrote, On 04/23/2009 02:31 PM:
...
>> In conclusion each packet would be matched against at most 1026 rules
>> (worst case). The real case is actually much better: only one bucket
>> with 400 rules, all other less than 70 rules and most of them less than
>> 10 rules.
> 
> Alas I can't analyze this all now, and probably I miss something, but
> your worst and real cases look suspiciously big. Do all these classes
> differ so much? Maybe you should have a look at cls_flow?

Actually fixing this u32 config (hashes) should be enough here.

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

[parent not found: <1039493214.20090424135024@gemenii.ro>]

* Re: htb parallelism on multi-core platforms
       [not found]             ` <1039493214.20090424135024@gemenii.ro>
@ 2009-04-24 11:19               ` Jarek Poplawski
  0 siblings, 0 replies; 39+ messages in thread
From: Jarek Poplawski @ 2009-04-24 11:19 UTC (permalink / raw)
  To: Calin Velea
  Cc: Radu Rendec, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev

On Fri, Apr 24, 2009 at 01:50:24PM +0300, Calin Velea wrote:
> Hi,
Hi,

Very interesting message, but try to use plain format next time.
I guess your mime/html original wasn't accepted by netdev@.

Jarek P.

> 
>   Maybe some actual results I got some time ago could help you and others who had the same problems:
> 
> Hardware: quad-core Xeon X3210 (2.13GHz, 8M  L2 cache), 2 Intel PCI Express Gigabit NICs
> Kernel: 2.6.20
> 
>   I did some udp flood tests in the following configurations - the machine was configured as a
> traffic shaping bridge, about 10k htb rules loaded, using hashing (see below):
> 
> A) napi on,  irqs for each card statically allocated to 2 CPU cores
> 
> when flooding, the same CPU went 100% softirq always (seems logical,
> since it is statically bound to the irq)
> 
> B) napi on, CONFIG_IRQBALANCE=y
> 
> when flooding, a random CPU went 100% softirq always. (here,
> at high interrupt rates, NAPI kicks in and starts using polling
> rather than irqs, so no more balancing takes place since there are 
> no more interrupts - checked this with /proc/interrupts - at high packet 
> rates the irq counters for the network cards stalled)
> 
> C) napi off, CONFIG_IRQBALANCE=y
> 
> this is the setup I used in the end since all CPU cores were used. All of them
> went to 100%, and the pps rate I could pass through was higher than in 
> case A or B.
> 
> 
>   Also, your worst case hashing setup could be improved - I suggest you take a look at 
> http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method 
> described there will take a constant CPU time (4 checks) for each packet, regardless of how many 
> filter rules you have (provided you only filter by IP address). A tree of hashtables
> is constructed which matches each of the four bytes from the IP address in succesion.
> 
>   Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got 
> troughputs of 1.3Gbps / 250.000 pps  aggregated in+out in normal usage. CPU utilization 
> averages varied between 25 - 50 % for every core, so there was still room to grow. 
>   I expect much higher pps rates with better hardware (higher freq/larger cache Xeons).
> 
> 
> 
> Thursday, April 23, 2009, 3:31:47 PM, you wrote:
> 
> > On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
> >> Its runtime adjustable, so its easy to try out.
> 
> >>   via /sys/module/sch_htb/parameters/htb_hysteresis
> 
> > Thanks for the tip! This means I can play around with various values
> > while the machine is in production and see how it reacts.
> 
> >> The HTB classify hash has a scalability issue in kernels below 2.6.26. 
> >> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
> >> using?
> 
> > I'm using 2.6.26, so I guess the fix is already there :(
> 
> >> Could you explain how you do classification? And perhaps outline where you 
> >> possible scalability issue is located?
> 
> >> If you are interested how I do scalable classification, see my 
> >> presentation from Netfilter Workshop 2008:
> 
> >>   http://nfws.inl.fr/en/?p=115
> >>   http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf
> 
> > I had a look at your presentation and it seems to be focused in dividing
> > a single iptables rule chain into multiple chains, so that rule lookup
> > complexity decreases from linear to logarithmic.
> 
> > Since I only need to do shaping, I don't use iptables at all. Address
> > matching is all done in on the egress side, using u32. Rule schema is
> > this:
> 
> > 1. We have two /19 networks that differ pretty much in the first bits:
> > 80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
> > individual /32 addresses.
> 
> > 2. The default ip hash (0x800) is size 1 (only one bucket) and has two
> > rules that select between two subsequent hash tables (say 0x100 and
> > 0x101) based on the most significant bits in the address.
> 
> > 3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
> > bucket selection is done by bits b10 - b17 (with b0 being the least
> > significant).
> 
> > 4. Each bucket contains complete cidr match rules (corresponding to real
> > customer addresses). Since bits b11 - b31 are already checked in upper
> > levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
> > worst case, if all customer addresses that "fall" into that bucket
> > are /32 (fortunately this is not the real case).
> 
> > In conclusion each packet would be matched against at most 1026 rules
> > (worst case). The real case is actually much better: only one bucket
> > with 400 rules, all other less than 70 rules and most of them less than
> > 10 rules.
> 
> >> > I guess htb_hysteresis only affects the actual shaping (which takes 
> >> > place after the packet is classified).
> 
> >> Yes, htb_hysteresis basically is a hack to allow extra bursts... we 
> >> actually considered removing it completely...
> 
> > It's definitely worth a try at least. Thanks for the tips!
> 
> > Radu Rendec
> 
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best regards,
>  Calin                            mailto:calin.velea@gemenii.ro

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re[2]: htb parallelism on multi-core platforms
  2009-04-23 12:31           ` Radu Rendec
  2009-04-23 18:43             ` Jarek Poplawski
       [not found]             ` <1039493214.20090424135024@gemenii.ro>
@ 2009-04-24 11:35             ` Calin Velea
  2 siblings, 0 replies; 39+ messages in thread
From: Calin Velea @ 2009-04-24 11:35 UTC (permalink / raw)
  To: netdev

Hi,

  Maybe some actual results I got some time ago could help you and others who had the same problems:

Hardware: quad-core Xeon X3210 (2.13GHz, 8M  L2 cache), 2 Intel PCI Express Gigabit NICs
Kernel: 2.6.20

  I did some udp flood tests in the following configurations - the machine was configured as a
traffic shaping bridge, about 10k htb rules loaded, using hashing (see below):

A) napi on,  irqs for each card statically allocated to 2 CPU cores

when flooding, the same CPU went 100% softirq always (seems logical,
since it is statically bound to the irq)

B) napi on, CONFIG_IRQBALANCE=y

when flooding, a random CPU went 100% softirq always. (here,
at high interrupt rates, NAPI kicks in and starts using polling
rather than irqs, so no more balancing takes place since there are 
no more interrupts - checked this with /proc/interrupts - at high packet 
rates the irq counters for the network cards stalled)

C) napi off, CONFIG_IRQBALANCE=y

this is the setup I used in the end since all CPU cores were used. All of them
went to 100%, and the pps rate I could pass through was higher than in 
case A or B.


  Also, your worst case hashing setup could be improved - I suggest you take a look at 
http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method 
described there will take a constant CPU time (4 checks) for each packet, regardless of how many 
filter rules you have (provided you only filter by IP address). A tree of hashtables
is constructed which matches each of the four bytes from the IP address in succesion.

  Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got 
troughputs of 1.3Gbps / 250.000 pps  aggregated in+out in normal usage. CPU utilization 
averages varied between 25 - 50 % for every core, so there was still room to grow. 
  I expect much higher pps rates with better hardware (higher freq/larger cache Xeons).



Thursday, April 23, 2009, 3:31:47 PM, you wrote:

> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote:
>> Its runtime adjustable, so its easy to try out.

>>   via /sys/module/sch_htb/parameters/htb_hysteresis

> Thanks for the tip! This means I can play around with various values
> while the machine is in production and see how it reacts.

>> The HTB classify hash has a scalability issue in kernels below 2.6.26. 
>> Patrick McHardy fixes that up in 2.6.26.  What kernel version are you 
>> using?

> I'm using 2.6.26, so I guess the fix is already there :(

>> Could you explain how you do classification? And perhaps outline where you 
>> possible scalability issue is located?

>> If you are interested how I do scalable classification, see my 
>> presentation from Netfilter Workshop 2008:

>>   http://nfws.inl.fr/en/?p=115
>>   http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf

> I had a look at your presentation and it seems to be focused in dividing
> a single iptables rule chain into multiple chains, so that rule lookup
> complexity decreases from linear to logarithmic.

> Since I only need to do shaping, I don't use iptables at all. Address
> matching is all done in on the egress side, using u32. Rule schema is
> this:

> 1. We have two /19 networks that differ pretty much in the first bits:
> 80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to
> individual /32 addresses.

> 2. The default ip hash (0x800) is size 1 (only one bucket) and has two
> rules that select between two subsequent hash tables (say 0x100 and
> 0x101) based on the most significant bits in the address.

> 3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets);
> bucket selection is done by bits b10 - b17 (with b0 being the least
> significant).

> 4. Each bucket contains complete cidr match rules (corresponding to real
> customer addresses). Since bits b11 - b31 are already checked in upper
> levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the
> worst case, if all customer addresses that "fall" into that bucket
> are /32 (fortunately this is not the real case).

> In conclusion each packet would be matched against at most 1026 rules
> (worst case). The real case is actually much better: only one bucket
> with 400 rules, all other less than 70 rules and most of them less than
> 10 rules.

>> > I guess htb_hysteresis only affects the actual shaping (which takes 
>> > place after the packet is classified).

>> Yes, htb_hysteresis basically is a hack to allow extra bursts... we 
>> actually considered removing it completely...

> It's definitely worth a try at least. Thanks for the tips!

> Radu Rendec


> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best regards,
 Calin                            mailto:calin.velea@gemenii.ro


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-05-08 18:08 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec
2009-04-17 11:31 ` David Miller
2009-04-17 11:33 ` Badalian Vyacheslav
2009-04-17 22:41 ` Jarek Poplawski
2009-04-18  0:21   ` Denys Fedoryschenko
2009-04-18  7:56     ` Jarek Poplawski
2009-04-22 14:02       ` Radu Rendec
2009-04-22 21:29         ` Jesper Dangaard Brouer
2009-04-23  8:20           ` Jarek Poplawski
2009-04-23 13:56             ` Radu Rendec
2009-04-23 18:19               ` Jarek Poplawski
2009-04-23 20:19                 ` Jesper Dangaard Brouer
2009-04-24  9:42                   ` Radu Rendec
2009-04-28 10:15                     ` Jesper Dangaard Brouer
2009-04-29 10:21                       ` Radu Rendec
2009-04-29 10:31                         ` Jesper Dangaard Brouer
2009-04-29 11:03                           ` Radu Rendec
2009-04-29 12:23                             ` Jarek Poplawski
2009-04-29 13:15                               ` Radu Rendec
2009-04-29 13:38                                 ` Jarek Poplawski
2009-04-29 16:21                                   ` Radu Rendec
2009-04-29 22:49                                     ` Calin Velea
2009-04-29 23:00                                       ` Re[2]: " Calin Velea
2009-04-30 11:19                                       ` Radu Rendec
2009-04-30 11:44                                         ` Jesper Dangaard Brouer
2009-04-30 14:04                                         ` Re[2]: " Calin Velea
2009-05-08 10:15                                           ` Paweł Staszewski
2009-05-08 17:55                                             ` Vladimir Ivashchenko
2009-05-08 18:07                                               ` Denys Fedoryschenko
2009-04-23 12:31           ` Radu Rendec
2009-04-23 18:43             ` Jarek Poplawski
2009-04-23 19:06               ` Jesper Dangaard Brouer
2009-04-23 19:14                 ` Jarek Poplawski
2009-04-23 19:47                   ` Jesper Dangaard Brouer
2009-04-23 20:00                     ` Jarek Poplawski
2009-04-23 20:09                     ` Jeff King
2009-04-24  6:01               ` Jarek Poplawski
     [not found]             ` <1039493214.20090424135024@gemenii.ro>
2009-04-24 11:19               ` Jarek Poplawski
2009-04-24 11:35             ` Re[2]: " Calin Velea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).