* htb parallelism on multi-core platforms
@ 2009-04-17 10:40 Radu Rendec
2009-04-17 11:31 ` David Miller
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Radu Rendec @ 2009-04-17 10:40 UTC (permalink / raw)
To: netdev
Hi,
I'm using htb on a dedicated shaping machine. Under heavy traffic (high
packet rate) all htb work is done on a single cpu - only one ksoftirqd
is consuming cpu power.
I have limited network stack knowledge, but I guess all htb work for a
particular interface is done on the same softirq context. Of course this
does not scale with multiple cpus, since only one of them would be used.
Is there any (simple) approach to distribute htb work (for one
interface) on multiple cpus?
Thanks,
Radu Rendec
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: htb parallelism on multi-core platforms 2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec @ 2009-04-17 11:31 ` David Miller 2009-04-17 11:33 ` Badalian Vyacheslav 2009-04-17 22:41 ` Jarek Poplawski 2 siblings, 0 replies; 39+ messages in thread From: David Miller @ 2009-04-17 11:31 UTC (permalink / raw) To: radu.rendec; +Cc: netdev From: Radu Rendec <radu.rendec@ines.ro> Date: Fri, 17 Apr 2009 13:40:44 +0300 > Is there any (simple) approach to distribute htb work (for one > interface) on multiple cpus? HTB acts upon global state, so anything that goes into a particular device's HTB ruleset is going to be single threaded. There really isn't any way around this. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec 2009-04-17 11:31 ` David Miller @ 2009-04-17 11:33 ` Badalian Vyacheslav 2009-04-17 22:41 ` Jarek Poplawski 2 siblings, 0 replies; 39+ messages in thread From: Badalian Vyacheslav @ 2009-04-17 11:33 UTC (permalink / raw) To: Radu Rendec; +Cc: netdev hello 100% SI on ksoftirqd on one CPU because PC can't forward such packets (napi off if i understand). 2 cpu xeon 2.4 ghz can forward about 400-500 mbs full duplex with about 20-30k htb rules. If we try do more - we get 100% SI. Its our example. We now use multiple pc for this and will try to by intel 10G with A/IO that can use Multiqueue. Anyone can say: How match CPU we must have for about 5-7G in/out with 2 x intel 10G + A/IO (1x10g to lan + 1x10g to wan) ? Any statistic or formula to calculate? pps or mbs? tc + iptables (+ipset) now use 10-30%. All other cpu now use e1000e driver. Thanks > Hi, > > I'm using htb on a dedicated shaping machine. Under heavy traffic (high > packet rate) all htb work is done on a single cpu - only one ksoftirqd > is consuming cpu power. > > I have limited network stack knowledge, but I guess all htb work for a > particular interface is done on the same softirq context. Of course this > does not scale with multiple cpus, since only one of them would be used. > > Is there any (simple) approach to distribute htb work (for one > interface) on multiple cpus? > > Thanks, > > Radu Rendec > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec 2009-04-17 11:31 ` David Miller 2009-04-17 11:33 ` Badalian Vyacheslav @ 2009-04-17 22:41 ` Jarek Poplawski 2009-04-18 0:21 ` Denys Fedoryschenko 2 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-17 22:41 UTC (permalink / raw) To: Radu Rendec; +Cc: netdev Radu Rendec wrote, On 04/17/2009 12:40 PM: > Hi, Hi Radu, > > I'm using htb on a dedicated shaping machine. Under heavy traffic (high > packet rate) all htb work is done on a single cpu - only one ksoftirqd > is consuming cpu power. > > I have limited network stack knowledge, but I guess all htb work for a > particular interface is done on the same softirq context. Of course this > does not scale with multiple cpus, since only one of them would be used. > > Is there any (simple) approach to distribute htb work (for one > interface) on multiple cpus? I don't know about anything (simple) for this, but I wonder if you tried already any htb tweaking like htb_hysteresis module param or burst/cburst class parameters to limit some maybe useless resolution/ overhead? Regards, Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-17 22:41 ` Jarek Poplawski @ 2009-04-18 0:21 ` Denys Fedoryschenko 2009-04-18 7:56 ` Jarek Poplawski 0 siblings, 1 reply; 39+ messages in thread From: Denys Fedoryschenko @ 2009-04-18 0:21 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Radu Rendec, netdev On Saturday 18 April 2009 01:41:38 Jarek Poplawski wrote: > Radu Rendec wrote, On 04/17/2009 12:40 PM: > > Hi, > > Hi Radu, > I don't know about anything (simple) for this, but I wonder if you > tried already any htb tweaking like htb_hysteresis module param or > burst/cburst class parameters to limit some maybe useless resolution/ > overhead? Like adding HZ=1000 as environment variable in scripts :-) For me it helps.... Also worth to try HFSC. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-18 0:21 ` Denys Fedoryschenko @ 2009-04-18 7:56 ` Jarek Poplawski 2009-04-22 14:02 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-18 7:56 UTC (permalink / raw) To: Denys Fedoryschenko; +Cc: Radu Rendec, netdev On Sat, Apr 18, 2009 at 03:21:50AM +0300, Denys Fedoryschenko wrote: > On Saturday 18 April 2009 01:41:38 Jarek Poplawski wrote: > > Radu Rendec wrote, On 04/17/2009 12:40 PM: > > > Hi, > > > > Hi Radu, > > > I don't know about anything (simple) for this, but I wonder if you > > tried already any htb tweaking like htb_hysteresis module param or > > burst/cburst class parameters to limit some maybe useless resolution/ > > overhead? > Like adding HZ=1000 as environment variable in scripts :-) > For me it helps.... Right, if you're using high resolution; there is a bug in tc, found by Denys, which causes wrong (too low) defaults for burst/cburst. > Also worth to try HFSC. Yes, it seems to be especially interesting for 64 bit boxes. Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-18 7:56 ` Jarek Poplawski @ 2009-04-22 14:02 ` Radu Rendec 2009-04-22 21:29 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-22 14:02 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Denys Fedoryschenko, netdev On Sat, 2009-04-18 at 09:56 +0200, Jarek Poplawski wrote: > Right, if you're using high resolution; there is a bug in tc, found by > Denys, which causes wrong (too low) defaults for burst/cburst. > > > Also worth to try HFSC. > > Yes, it seems to be especially interesting for 64 bit boxes. Hi Jarek, Thanks for the hints! As far as I understand, HFSC is also implemented as a queue discipline (like HTB), so I guess it suffers from the same design limitations (doesn't span across multiple CPUs). Is this assumption correct? As for htb_hysteresis I actually haven't tried it. Although it is definitely worth a try (especially if the average traffic grows), I don't think it can compensate multithreading / parallel execution. At least half of a packet processing time is consumed by classification (although I am using hashes). I guess htb_hysteresis only affects the actual shaping (which takes place after the packet is classified). Thanks, Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-22 14:02 ` Radu Rendec @ 2009-04-22 21:29 ` Jesper Dangaard Brouer 2009-04-23 8:20 ` Jarek Poplawski 2009-04-23 12:31 ` Radu Rendec 0 siblings, 2 replies; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-22 21:29 UTC (permalink / raw) To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Wed, 22 Apr 2009, Radu Rendec wrote: > Thanks for the hints! As far as I understand, HFSC is also implemented > as a queue discipline (like HTB), so I guess it suffers from the same > design limitations (doesn't span across multiple CPUs). Is this > assumption correct? Yes. > As for htb_hysteresis I actually haven't tried it. Although it is > definitely worth a try (especially if the average traffic grows), I > don't think it can compensate multithreading / parallel execution. Its runtime adjustable, so its easy to try out. via /sys/module/sch_htb/parameters/htb_hysteresis > At least half of a packet processing time is consumed by classification > (although I am using hashes). The HTB classify hash has a scalability issue in kernels below 2.6.26. Patrick McHardy fixes that up in 2.6.26. What kernel version are you using? Could you explain how you do classification? And perhaps outline where you possible scalability issue is located? If you are interested how I do scalable classification, see my presentation from Netfilter Workshop 2008: http://nfws.inl.fr/en/?p=115 http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf > I guess htb_hysteresis only affects the actual shaping (which takes > place after the packet is classified). Yes, htb_hysteresis basically is a hack to allow extra bursts... we actually considered removing it completely... Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-22 21:29 ` Jesper Dangaard Brouer @ 2009-04-23 8:20 ` Jarek Poplawski 2009-04-23 13:56 ` Radu Rendec 2009-04-23 12:31 ` Radu Rendec 1 sibling, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-23 8:20 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev On 22-04-2009 23:29, Jesper Dangaard Brouer wrote: > On Wed, 22 Apr 2009, Radu Rendec wrote: > >> Thanks for the hints! As far as I understand, HFSC is also implemented >> as a queue discipline (like HTB), so I guess it suffers from the same >> design limitations (doesn't span across multiple CPUs). Is this >> assumption correct? > > Yes. Within a common tree of classes it would a need finer locking to separate some jobs but considering cache problems I doubt there would be much gain from such redesigning for smp. On the other hand, a common tree is necessary if these classes really have to share every byte, which I doubt. Then we could think of config and maybe tiny hardware "redesign" (to more qdiscs/roots). So, e.g. using additional (cheap) NICs and even switch, if possible, looks quite natural way of spanning. Similar thing (multiple htb qdiscs) should be possible in the future with one multiqueue NIC too. There is also an interesting thread "Software receive packet steering" nearby, but using this for shaping only looks like "less simple": http://lwn.net/Articles/328339/ > >> As for htb_hysteresis I actually haven't tried it. Although it is >> definitely worth a try (especially if the average traffic grows), I >> don't think it can compensate multithreading / parallel execution. > > Its runtime adjustable, so its easy to try out. > > via /sys/module/sch_htb/parameters/htb_hysteresis > > >> At least half of a packet processing time is consumed by classification >> (although I am using hashes). > > The HTB classify hash has a scalability issue in kernels below 2.6.26. > Patrick McHardy fixes that up in 2.6.26. What kernel version are you > using? > > Could you explain how you do classification? And perhaps outline where you > possible scalability issue is located? BTW, I hope you add filters after classes they point to. Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 8:20 ` Jarek Poplawski @ 2009-04-23 13:56 ` Radu Rendec 2009-04-23 18:19 ` Jarek Poplawski 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-23 13:56 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote: > Within a common tree of classes it would a need finer locking to > separate some jobs but considering cache problems I doubt there would > be much gain from such redesigning for smp. On the other hand, a > common tree is necessary if these classes really have to share every > byte, which I doubt. Then we could think of config and maybe tiny > hardware "redesign" (to more qdiscs/roots). So, e.g. using additional > (cheap) NICs and even switch, if possible, looks quite natural way of > spanning. Similar thing (multiple htb qdiscs) should be possible in > the future with one multiqueue NIC too. Since htb has a tree structure by default, I think it's pretty difficult to distribute shaping across different htb-enabled queues. Actually we had thought of using completely separate machines, but soon we realized there are some issues. Consider the following example: Customer A and customer B share 2 Mbit of bandwith. Each of them is guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1 Mbit from the other's bandwith (depending on the other's traffic). This is done like this: * bucket C -> rate 2 Mbit, ceil 2 Mbit * bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C * bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C IP filters for customer A classify packets to bucket A, and similar for customer B to bucket B. It's obvious that buckets A, B and C must be in the same htb tree, otherwise customers A and B would not be able to borrow from each other's bandwidth. One simple rule would be to allocate all buckets (with all their child buckets) that have rate = ceil to the same tree / queue / whatever. I don't know if this is enough. > There is also an interesting thread "Software receive packet steering" > nearby, but using this for shaping only looks like "less simple": > http://lwn.net/Articles/328339/ I am aware of the thread and even tried out the author's patch (despite the fact that David Miller suggested it was not sane). Under heavy (simulated) traffic nothing was changed: only one ksoftirqd using 100% CPU, one CPU in 100%, others idle. This only confirms what I've already been told: htb is single threaded by design. It also proves that most of the packet processing work is actually in htb. > BTW, I hope you add filters after classes they point to. Do you mean the actual order I use for the "tc filter add" and "tc class add" commands? Does it make any difference? Anyway, speaking of htb redesign or improvement (to use multiple threads / CPUs) I think classification rules can be cloned on a per-thread basis (to avoid synchronization issues). This means sacrificing memory for the benefit of performance but probably it is better to do it this way. However, shaping data structures must be shared between all threads as long as it's not sure that all packets corresponding to a certain IP address are processed in the same thread (they most probably would not, if a round-robin alhorithm is used). While searching the Internet for what has already been accomplished in this area, I ran several time across the per-CPU cache issue. The commonly accepted opinion seems to be that CPU parallelism in packet processing implies synchronization issues which in turn imply cache misses, which ultimately result in performance loss. However, with only one core in 100% and other 7 cores idle, I doubt that CPU-cache is really worth (it's just a guess and it definitely needs real tests as evidence). Thanks, Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 13:56 ` Radu Rendec @ 2009-04-23 18:19 ` Jarek Poplawski 2009-04-23 20:19 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-23 18:19 UTC (permalink / raw) To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev On Thu, Apr 23, 2009 at 04:56:42PM +0300, Radu Rendec wrote: > On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote: > > Within a common tree of classes it would a need finer locking to > > separate some jobs but considering cache problems I doubt there would > > be much gain from such redesigning for smp. On the other hand, a > > common tree is necessary if these classes really have to share every > > byte, which I doubt. Then we could think of config and maybe tiny > > hardware "redesign" (to more qdiscs/roots). So, e.g. using additional > > (cheap) NICs and even switch, if possible, looks quite natural way of > > spanning. Similar thing (multiple htb qdiscs) should be possible in > > the future with one multiqueue NIC too. > > Since htb has a tree structure by default, I think it's pretty difficult > to distribute shaping across different htb-enabled queues. Actually we > had thought of using completely separate machines, but soon we realized > there are some issues. Consider the following example: > > Customer A and customer B share 2 Mbit of bandwith. Each of them is > guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1 > Mbit from the other's bandwith (depending on the other's traffic). > > This is done like this: > > * bucket C -> rate 2 Mbit, ceil 2 Mbit > * bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C > * bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C > > IP filters for customer A classify packets to bucket A, and similar for > customer B to bucket B. > > It's obvious that buckets A, B and C must be in the same htb tree, > otherwise customers A and B would not be able to borrow from each > other's bandwidth. One simple rule would be to allocate all buckets > (with all their child buckets) that have rate = ceil to the same tree / > queue / whatever. I don't know if this is enough. Yes, what I meant was rather a config with more individual clients eg. 20 x rate 50kbit ceil 100kbit. But, if you have many such rate = ceil classes, separating them to another qdisc/NIC looks even better (no problem with unbalanced load). > > There is also an interesting thread "Software receive packet steering" > > nearby, but using this for shaping only looks like "less simple": > > http://lwn.net/Articles/328339/ > > I am aware of the thread and even tried out the author's patch (despite > the fact that David Miller suggested it was not sane). Under heavy > (simulated) traffic nothing was changed: only one ksoftirqd using 100% > CPU, one CPU in 100%, others idle. This only confirms what I've already > been told: htb is single threaded by design. It also proves that most of > the packet processing work is actually in htb. But, I wrote it's not simple. (And it was told about single threadedness too.) This method is intended for a local traffic (to sockets) AFAIK, so I thought about using some trick with virtual devs instead, but maybe I'm totally wrong. > > > BTW, I hope you add filters after classes they point to. > > Do you mean the actual order I use for the "tc filter add" and "tc class > add" commands? Does it make any difference? Yes, I mean this order: tc class add ... classid 1:23 ... tc filter add ... flowid 1:23 > > Anyway, speaking of htb redesign or improvement (to use multiple > threads / CPUs) I think classification rules can be cloned on a > per-thread basis (to avoid synchronization issues). This means > sacrificing memory for the benefit of performance but probably it is > better to do it this way. > > However, shaping data structures must be shared between all threads as > long as it's not sure that all packets corresponding to a certain IP > address are processed in the same thread (they most probably would not, > if a round-robin alhorithm is used). > > While searching the Internet for what has already been accomplished in > this area, I ran several time across the per-CPU cache issue. The > commonly accepted opinion seems to be that CPU parallelism in packet > processing implies synchronization issues which in turn imply cache > misses, which ultimately result in performance loss. However, with only > one core in 100% and other 7 cores idle, I doubt that CPU-cache is > really worth (it's just a guess and it definitely needs real tests as > evidence). There are many things to learn and to do around smp yet, just like this "Software receive packet steering" thread shows. Anyway, there are really big htb traffics handled as it is (look at Vyacheslav's mail in this thread), so I guess you have something to do around your config/hardware too. Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 18:19 ` Jarek Poplawski @ 2009-04-23 20:19 ` Jesper Dangaard Brouer 2009-04-24 9:42 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-23 20:19 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev On Thu, 23 Apr 2009, Jarek Poplawski wrote: > On Thu, Apr 23, 2009 at 04:56:42PM +0300, Radu Rendec wrote: >> On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote: >> >> I am aware of the thread and even tried out the author's patch (despite >> the fact that David Miller suggested it was not sane). Under heavy >> (simulated) traffic nothing was changed: only one ksoftirqd using 100% >> CPU, one CPU in 100%, others idle. This only confirms what I've already >> been told: htb is single threaded by design. Its more general that just HTB. We have general Qdisc serialization point in net/sched/sch_generic.c by the qdisc_lock(q). >> It also proves that most of the packet processing work is actually in >> htb. I'm not sure that statement is true. Can you run Oprofile on the system? That will tell us exactly where time is spend... > ... > I thought about using some trick with virtual devs instead, but maybe > I'm totally wrong. I like the idea with virtual devices, as each virtual device could be bound to a hardware tx-queue. Then you just have to construct your HTB trees on each virtual device, and assign customers accordingly. I just realized, you don't use a multi-queue capably NIC right? Then it would be difficult to use the hardware tx-queue idea. Have you though of using several physical NICs? Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 20:19 ` Jesper Dangaard Brouer @ 2009-04-24 9:42 ` Radu Rendec 2009-04-28 10:15 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-24 9:42 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Thu, 2009-04-23 at 22:19 +0200, Jesper Dangaard Brouer wrote: > >> It also proves that most of the packet processing work is actually in > >> htb. > > I'm not sure that statement is true. > Can you run Oprofile on the system? That will tell us exactly where time > is spend... I've never used oprofile, but it looks very powerful and simple to use. I'll compile a 2.6.29 (so that I also benefit from the htb patch you told me about) then put oprofile on top of it. I'll get back to you by evening (or maybe Monday noon) with real facts :) > > ... > > I thought about using some trick with virtual devs instead, but maybe > > I'm totally wrong. > > I like the idea with virtual devices, as each virtual device could be > bound to a hardware tx-queue. Is there any current support for this or do you talk about it as an approach to use in future development? The idea looks interesting indeed. If there's current support for it, I'd like to try it out. If not, perhaps I can help at least with testing (or even some coding as well). > Then you just have to construct your HTB trees on each virtual > device, and assign customers accordingly. I don't think it's that easy. Let's say we have the same HTB trees on both virtual devices A and B (each of them is bound to a different hardware tx queue). If packets for a specific destination ip address (pseudo)randomly arrive at both A and B, tokens will be extracted from both A and B trees, resulting in an erroneus overall bandwidth (at worst double the ceil, if packets reach the ceil on both A and B). I have to make sure packets belonging to a certain customer (or ip address) always come through a specific virtual device. Then HTB trees don't even need to be identical. However, this is not trivial at all. A single customer can have different subnets (even from different class-B networks) but share a single HTB bucket for all of them. Using a simple hash function on the ip address to determine which virtual device to send through doesn't seem to be an option since it does not guarantee all packets for a certain customer will go together. What I had in mind for parallel shaping was this: NIC0 -> mux -----> Thread 0: classify/shape -----> NIC2 \/ \/ /\ /\ NIC1 -> mux -----> Thread 1: classify/shape -----> NIC3 Of course the number of input NICs, processing threads and output NICs would be adjustable. But this idea has 2 major problems: * shaping data must be shared between processing threads (in order to extract tokens from the same bucket regardless of the thread that does the actual processesing) * it seems to be impossible to do this with (unmodified) HTB > I just realized, you don't use a multi-queue capably NIC right? > Then it would be difficult to use the hardware tx-queue idea. > Have you though of using several physical NICs? The machine we are preparing for production has this: 2 x Intel Corporation 82571EB Gigabit Ethernet Controller 2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller All 4 NICs use the e1000e driver and I think they are multi-queue capable. So in theory I can use several NICs and/or multi-queue. Thanks, Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-24 9:42 ` Radu Rendec @ 2009-04-28 10:15 ` Jesper Dangaard Brouer 2009-04-29 10:21 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-28 10:15 UTC (permalink / raw) To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Fri, 24 Apr 2009, Radu Rendec wrote: > On Thu, 2009-04-23 at 22:19 +0200, Jesper Dangaard Brouer wrote: >>>> It also proves that most of the packet processing work is actually in >>>> htb. >> >> I'm not sure that statement is true. >> Can you run Oprofile on the system? That will tell us exactly where time >> is spend... > > I've never used oprofile, but it looks very powerful and simple to use. > I'll compile a 2.6.29 (so that I also benefit from the htb patch you > told me about) then put oprofile on top of it. I'll get back to you by > evening (or maybe Monday noon) with real facts :) Remember to keep/copy the file "vmlinux". Here is the steps I usually use: opcontrol --vmlinux=/boot/vmlinux-`uname -r` opcontrol --stop opcontrol --reset opcontrol --start <perform stuff that needs profiling> opcontrol --stop "Normal" report opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less Looking at specific module "sch_htb" opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname -r`/kernel/ >>> ... >>> I thought about using some trick with virtual devs instead, but maybe >>> I'm totally wrong. >> >> I like the idea with virtual devices, as each virtual device could be >> bound to a hardware tx-queue. > > Is there any current support for this or do you talk about it as an > approach to use in future development? This is definitly only ideas for future development... > The idea looks interesting indeed. If there's current support for it, > I'd like to try it out. If not, perhaps I can help at least with testing > (or even some coding as well). > >> Then you just have to construct your HTB trees on each virtual >> device, and assign customers accordingly. > > I don't think it's that easy. Let's say we have the same HTB trees on > both virtual devices A and B (each of them is bound to a different > hardware tx queue). If packets for a specific destination ip address > (pseudo)randomly arrive at both A and B, tokens will be extracted from > both A and B trees, resulting in an erroneus overall bandwidth (at worst > double the ceil, if packets reach the ceil on both A and B). > > I have to make sure packets belonging to a certain customer (or ip > address) always come through a specific virtual device. Then HTB trees > don't even need to be identical. Correct... > However, this is not trivial at all. A single customer can have > different subnets (even from different class-B networks) but share a > single HTB bucket for all of them. Using a simple hash function on the > ip address to determine which virtual device to send through doesn't > seem to be an option since it does not guarantee all packets for a > certain customer will go together. Well I know the problem, our customers IP's are also allocated adhoc and not grouped nicely :-( >... > >> I just realized, you don't use a multi-queue capably NIC right? >> Then it would be difficult to use the hardware tx-queue idea. >> Have you though of using several physical NICs? > > The machine we are preparing for production has this: > > 2 x Intel Corporation 82571EB Gigabit Ethernet Controller > 2 x Intel Corporation 80003ES2LAN Gigabit Ethernet Controller > > All 4 NICs use the e1000e driver and I think they are multi-queue > capable. So in theory I can use several NICs and/or multi-queue. I'm note sure that the driver e1000e has multiqueue for your devices. The 82571EB chip should have 2-rx and 2-tx queues [1]. Looking through the code, the multiqueue capable IRQ MSI-X code first got in in kernel version v2.6.28-rc1. BUT the driver still uses alloc_etherdev() and not alloc_etherdev_mq(). Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- [1]: http://www.intel.com/products/ethernet/index.htm?iid=embnav1+eth#s1=Gigabit%20Ethernet&s2=82571EB&s3=all ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-28 10:15 ` Jesper Dangaard Brouer @ 2009-04-29 10:21 ` Radu Rendec 2009-04-29 10:31 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-29 10:21 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev Thanks for the oprofile newbie guide - it saved much time and digging through man pages. Normal report looks like this: samples % image name app name symbol name 38424 30.7350 cls_u32.ko cls_u32 u32_classify 5321 4.2562 e1000e.ko e1000e e1000_clean_rx_irq 4690 3.7515 vmlinux vmlinux ipt_do_table 3825 3.0596 sch_htb.ko sch_htb htb_dequeue 3458 2.7660 vmlinux vmlinux __hash_conntrack 2597 2.0773 vmlinux vmlinux nf_nat_setup_info 2531 2.0245 vmlinux vmlinux kmem_cache_alloc 2229 1.7830 vmlinux vmlinux ip_route_input 1722 1.3774 vmlinux vmlinux nf_conntrack_in 1547 1.2374 sch_htb.ko sch_htb htb_enqueue 1519 1.2150 vmlinux vmlinux kmem_cache_free 1471 1.1766 vmlinux vmlinux __slab_free 1435 1.1478 vmlinux vmlinux dev_queue_xmit 1313 1.0503 vmlinux vmlinux __qdisc_run 1277 1.0215 vmlinux vmlinux netif_receive_skb All other symbols are below 1%. sch_htb.ko report is this: samples % image name symbol name ------------------------------------------------------------------------------- 3825 49.0762 sch_htb.ko htb_dequeue 3825 100.000 sch_htb.ko htb_dequeue [self] ------------------------------------------------------------------------------- 1547 19.8486 sch_htb.ko htb_enqueue 1547 100.000 sch_htb.ko htb_enqueue [self] ------------------------------------------------------------------------------- 608 7.8009 sch_htb.ko htb_lookup_leaf 608 100.000 sch_htb.ko htb_lookup_leaf [self] ------------------------------------------------------------------------------- 459 5.8891 sch_htb.ko htb_deactivate_prios 459 100.000 sch_htb.ko htb_deactivate_prios [self] ------------------------------------------------------------------------------- 417 5.3503 sch_htb.ko htb_add_to_wait_tree 417 100.000 sch_htb.ko htb_add_to_wait_tree [self] ------------------------------------------------------------------------------- 372 4.7729 sch_htb.ko htb_change_class_mode 372 100.000 sch_htb.ko htb_change_class_mode [self] ------------------------------------------------------------------------------- 276 3.5412 sch_htb.ko htb_activate_prios 276 100.000 sch_htb.ko htb_activate_prios [self] ------------------------------------------------------------------------------- 189 2.4249 sch_htb.ko htb_add_to_id_tree 189 100.000 sch_htb.ko htb_add_to_id_tree [self] ------------------------------------------------------------------------------- 101 1.2959 sch_htb.ko htb_safe_rb_erase 101 100.000 sch_htb.ko htb_safe_rb_erase [self] ------------------------------------------------------------------------------- Am I misinterpreting the results, or does it look like the real problem is actually packet classification? Thanks, Radu Rendec On Tue, 2009-04-28 at 12:15 +0200, Jesper Dangaard Brouer wrote: > Remember to keep/copy the file "vmlinux". > > Here is the steps I usually use: > > opcontrol --vmlinux=/boot/vmlinux-`uname -r` > > opcontrol --stop > opcontrol --reset > opcontrol --start > > <perform stuff that needs profiling> > > opcontrol --stop > > "Normal" report > opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less > > Looking at specific module "sch_htb" > > opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname > -r`/kernel/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 10:21 ` Radu Rendec @ 2009-04-29 10:31 ` Jesper Dangaard Brouer 2009-04-29 11:03 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-29 10:31 UTC (permalink / raw) To: Radu Rendec; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Wed, 29 Apr 2009, Radu Rendec wrote: > Thanks for the oprofile newbie guide - it saved much time and digging > through man pages. You are welcome :-) Just noticed that Jeremy Kerr has made some python scripts to make it even easier to use oprofile. See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/ > Normal report looks like this: > samples % image name app name symbol name > 38424 30.7350 cls_u32.ko cls_u32 u32_classify > 5321 4.2562 e1000e.ko e1000e e1000_clean_rx_irq > 4690 3.7515 vmlinux vmlinux ipt_do_table > 3825 3.0596 sch_htb.ko sch_htb htb_dequeue > 3458 2.7660 vmlinux vmlinux __hash_conntrack > 2597 2.0773 vmlinux vmlinux nf_nat_setup_info > 2531 2.0245 vmlinux vmlinux kmem_cache_alloc > 2229 1.7830 vmlinux vmlinux ip_route_input > 1722 1.3774 vmlinux vmlinux nf_conntrack_in > 1547 1.2374 sch_htb.ko sch_htb htb_enqueue > 1519 1.2150 vmlinux vmlinux kmem_cache_free > 1471 1.1766 vmlinux vmlinux __slab_free > 1435 1.1478 vmlinux vmlinux dev_queue_xmit > 1313 1.0503 vmlinux vmlinux __qdisc_run > 1277 1.0215 vmlinux vmlinux netif_receive_skb > > All other symbols are below 1%. > > sch_htb.ko report is this: > ... I would rather want to see the output from cls_u32.ko opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/ > Am I misinterpreting the results, or does it look like the real problem > is actually packet classification? Yes, it looks like the problem is your u32 classification setup... Perhaps its not doing what you think its doing... didn't Jarek provide some hints for you to follow? > On Tue, 2009-04-28 at 12:15 +0200, Jesper Dangaard Brouer wrote: >> Remember to keep/copy the file "vmlinux". >> >> Here is the steps I usually use: >> >> opcontrol --vmlinux=/boot/vmlinux-`uname -r` >> >> opcontrol --stop >> opcontrol --reset >> opcontrol --start >> >> <perform stuff that needs profiling> >> >> opcontrol --stop >> >> "Normal" report >> opreport --symbols --image-path=/lib/modules/`uname -r`/kernel/ | less >> >> Looking at specific module "sch_htb" >> >> opreport --symbols -cl sch_htb.ko --image-path=/lib/modules/`uname >> -r`/kernel/ Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 10:31 ` Jesper Dangaard Brouer @ 2009-04-29 11:03 ` Radu Rendec 2009-04-29 12:23 ` Jarek Poplawski 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-29 11:03 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Wed, 2009-04-29 at 12:31 +0200, Jesper Dangaard Brouer wrote: > Just noticed that Jeremy Kerr has made some python scripts to make it even > easier to use oprofile. > See http://ozlabs.org/~jk/diary/tech/linux/hiprofile-v1.0.diary/ Thanks for the hint; I'll have a look at the scripts too. > I would rather want to see the output from cls_u32.ko > > opreport --symbols -cl cls_u32.ko --image-path=/lib/modules/`uname -r`/kernel/ samples % image name symbol name ------------------------------------------------------------------------------- 38424 100.000 cls_u32.ko u32_classify 38424 100.000 cls_u32.ko u32_classify [self] ------------------------------------------------------------------------------- Well, this doesn't tell us much more, but I think it's pretty obvious what cls_u32 is doing :) > > Am I misinterpreting the results, or does it look like the real problem > > is actually packet classification? > > Yes, it looks like the problem is your u32 classification setup... Perhaps > its not doing what you think its doing... didn't Jarek provide some hints > for you to follow? I've just realized that I might be hitting the worst-case bucket with the (ip) destinations I chose for the test traffic. I'll try I haven't tried tweaking htb_hysteresis yet (that was one of Jarek's hints) - it's debatable that it would help since the real problem seems to be in u32 (not htb), but I'll give it a try anyway. Another hint was to make sure that "tc class add" goes before corresponding "tc filter add" - checked: it's ok. Another interesting hint came from Calin Velea, whose tests suggest that the overall performance is better with napi turned off, since (rx) interrupt work is distributed to all cpus/cores. I'll try to replicate this as soon as I make some small changes to my test setup so that I'm able to measure overall htb throughput on the egress nic (bps and pps). Thanks, Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 11:03 ` Radu Rendec @ 2009-04-29 12:23 ` Jarek Poplawski 2009-04-29 13:15 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-29 12:23 UTC (permalink / raw) To: Radu Rendec Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea On Wed, Apr 29, 2009 at 02:03:26PM +0300, Radu Rendec wrote: > On Wed, 2009-04-29 at 12:31 +0200, Jesper Dangaard Brouer wrote: ... > > Yes, it looks like the problem is your u32 classification setup... Perhaps > > its not doing what you think its doing... didn't Jarek provide some hints > > for you to follow? > > I've just realized that I might be hitting the worst-case bucket with > the (ip) destinations I chose for the test traffic. I'll try > > I haven't tried tweaking htb_hysteresis yet (that was one of Jarek's > hints) - it's debatable that it would help since the real problem seems > to be in u32 (not htb), but I'll give it a try anyway. According to the author's(?) comment with hysteresis "The speed gain is about 1/6", so not very much here considering htb_dequeue time. > Another hint was to make sure that "tc class add" goes before > corresponding "tc filter add" - checked: it's ok. > > Another interesting hint came from Calin Velea, whose tests suggest that > the overall performance is better with napi turned off, since (rx) > interrupt work is distributed to all cpus/cores. I'll try to replicate > this as soon as I make some small changes to my test setup so that I'm > able to measure overall htb throughput on the egress nic (bps and pps). Radu, since not only your worst case, but also the real case u32 lookups are very big I think you should mainly have a look at Calin's u32 hash generator or at least his method, and only after optimizing it try these other tricks. Btw. I hope Calin made this nice program known to networking/admins lists too. Btw. #2: I think you wrote you didn't use iptables... Cheers, Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 12:23 ` Jarek Poplawski @ 2009-04-29 13:15 ` Radu Rendec 2009-04-29 13:38 ` Jarek Poplawski 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-29 13:15 UTC (permalink / raw) To: Jarek Poplawski Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea On Wed, 2009-04-29 at 14:23 +0200, Jarek Poplawski wrote: > According to the author's(?) comment with hysteresis "The speed gain > is about 1/6", so not very much here considering htb_dequeue time. Thought so :) > Radu, since not only your worst case, but also the real case u32 > lookups are very big I think you should mainly have a look at Calin's > u32 hash generator or at least his method, and only after optimizing > it try these other tricks. Btw. I hope Calin made this nice program > known to networking/admins lists too. I've just had a look at Calin's approach to optimizing u32 lookups. It does indeed make a very nice use of u32 hash capabilities, resulting in a maximum of 4 lookups. The algorithm he uses takes advantage of the fact that only a (small) subset of the whole ipv4 address space is actually used in an ISP's network. Unfortunately his approach makes it a bit difficult to dynamically adjust the configuration, since the controller (program/application) must remember the exact hash tables, filters etc in order to be able to add/remove CIDRs without rewriting the entire configuration. Unused hash tables also need to be "garbage collected" and reused, otherwise the hash table id space could be exhausted. Since I only use IP lookups (and u32 is very generic) I'm starting to ask myself if a different kind of data structures and classifier were more appropriate. For instance, I think a binary search tree that is matched against the bits in the ip address would result in pretty nice performance. It would take at most 32 iterations (descending through the tree) with less overhead than the (complex) u32 rule match. > Btw. #2: I think you wrote you didn't use iptables... No, I don't use iptables. Btw, the e1000e driver seems to have no way to disable NAPI. Am I missing something (like a global kernel config option that disables NAPI completely)? Thanks, Radu ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 13:15 ` Radu Rendec @ 2009-04-29 13:38 ` Jarek Poplawski 2009-04-29 16:21 ` Radu Rendec 0 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-29 13:38 UTC (permalink / raw) To: Radu Rendec Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea On Wed, Apr 29, 2009 at 04:15:51PM +0300, Radu Rendec wrote: ... > I've just had a look at Calin's approach to optimizing u32 lookups. It > does indeed make a very nice use of u32 hash capabilities, resulting in > a maximum of 4 lookups. The algorithm he uses takes advantage of the > fact that only a (small) subset of the whole ipv4 address space is > actually used in an ISP's network. ... Anyway, it looks like your main problem, and I doubt even dividing current work by e.g. 4 cores (if it were multi-threaded) is enough. These lookups are simply too long. > > Btw. #2: I think you wrote you didn't use iptables... > No, I don't use iptables. But your oprofile shows them. Maybe you shouldn't compile it into kernel at all? > > Btw, the e1000e driver seems to have no way to disable NAPI. Am I > missing something (like a global kernel config option that disables NAPI > completely)? Calin uses older kernel, and maybe e1000 driver, I don't know. Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 13:38 ` Jarek Poplawski @ 2009-04-29 16:21 ` Radu Rendec 2009-04-29 22:49 ` Calin Velea 0 siblings, 1 reply; 39+ messages in thread From: Radu Rendec @ 2009-04-29 16:21 UTC (permalink / raw) To: Jarek Poplawski Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev, Calin Velea I finally managed to disable NAPI on e1000e - apparently it can only be done on the "official" Intel driver (downloaded from their website), by compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem to be available in the (2.6.29) kernel driver. With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps. This makes sense, since h/w interrupt is much more time consuming than polling (that's the whole idea behind NAPI anyway). Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 16:21 ` Radu Rendec @ 2009-04-29 22:49 ` Calin Velea 2009-04-29 23:00 ` Re[2]: " Calin Velea 2009-04-30 11:19 ` Radu Rendec 0 siblings, 2 replies; 39+ messages in thread From: Calin Velea @ 2009-04-29 22:49 UTC (permalink / raw) To: Radu Rendec Cc: Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev Wednesday, April 29, 2009, 7:21:11 PM, you wrote: > I finally managed to disable NAPI on e1000e - apparently it can only be > done on the "official" Intel driver (downloaded from their website), by > compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem > to be available in the (2.6.29) kernel driver. > With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but > overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps. > This makes sense, since h/w interrupt is much more time consuming than > polling (that's the whole idea behind NAPI anyway). > Radu Rendec I tested with e1000 only, on a single quad-core CPU - the L2 cache was shared between the cores. For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually used belong to different physical CPUs, L2 cache sharing does not occur - maybe this could explain the performance drop in your case. Or there may be other explanation... Anyway - coming back to David Miller's words: "HTB acts upon global state, so anything that goes into a particular device's HTB ruleset is going to be single threaded. There really isn't any way around this. " It could be the only way to get more power is to increase the number of devices where you are shaping. You could split the IP space into 4 groups and direct the trafic to 4 IMQ devices with 4 iptables rules - -d 0.0.0.0/2 -j IMQ --todev imq0, -d 64.0.0.0/2 -j IMQ --todev imq1, etc... Or you can customize the split depeding on the traffic distribution. ipset nethash match can also be used. The 4 devices can have the same htb ruleset, only the right parts of it will match. You should test with 4 flows that use all the devices simultaneously and see what is the aggregate throughput. The performance gained through parallelism might be a lot higher than the added overhead of iptables and/or ipset nethash match. Anyway - this is more of a "hack" than a clean solution :) p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that -- Best regards, Calin mailto:calin.velea@gemenii.ro ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re[2]: htb parallelism on multi-core platforms 2009-04-29 22:49 ` Calin Velea @ 2009-04-29 23:00 ` Calin Velea 2009-04-30 11:19 ` Radu Rendec 1 sibling, 0 replies; 39+ messages in thread From: Calin Velea @ 2009-04-29 23:00 UTC (permalink / raw) To: Calin Velea Cc: Radu Rendec, Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev Hello Calin, Thursday, April 30, 2009, 1:49:46 AM, you wrote: > Wednesday, April 29, 2009, 7:21:11 PM, you wrote: >> I finally managed to disable NAPI on e1000e - apparently it can only be >> done on the "official" Intel driver (downloaded from their website), by >> compiling with "make CFLAGS_EXTRA=-DE1000E_NO_NAPI". This doesn't seem >> to be available in the (2.6.29) kernel driver. >> With NAPI disabled, 4 (of 8) cores go to 100% (instead of only one), but >> overall throughput *decreases* from ~110K pps (with NAPI) to ~80K pps. >> This makes sense, since h/w interrupt is much more time consuming than >> polling (that's the whole idea behind NAPI anyway). >> Radu Rendec > I tested with e1000 only, on a single quad-core CPU - the L2 cache was > shared between the cores. > For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually > used belong to different physical CPUs, L2 cache sharing does not occur - > maybe this could explain the performance drop in your case. > Or there may be other explanation... > Anyway - coming back to David Miller's words: > "HTB acts upon global state, so anything that goes into a particular > device's HTB ruleset is going to be single threaded. > There really isn't any way around this. " > It could be the only way to get more power is to increase the number > of devices where you are shaping. You could split the IP space into 4 groups > and direct the trafic to 4 IMQ devices with 4 iptables rules - > -d 0.0.0.0/2 -j IMQ --todev imq0, > -d 64.0.0.0/2 -j IMQ --todev imq1, etc... > Or you can customize the split depeding on the traffic distribution. > ipset nethash match can also be used. > The 4 devices can have the same htb ruleset, only the right parts > of it will match. > You should test with 4 flows that use all the devices simultaneously and > see what is the aggregate throughput. > The performance gained through parallelism might be a lot higher than the > added overhead of iptables and/or ipset nethash match. Anyway - this is more of > a "hack" than a clean solution :) > p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that You will also need -i ethX (router), or -m physdev --physdev-in ethX (bridge) to differentiate between upload and download in the iptables rules. -- Best regards, Calin mailto:calin.velea@gemenii.ro ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-29 22:49 ` Calin Velea 2009-04-29 23:00 ` Re[2]: " Calin Velea @ 2009-04-30 11:19 ` Radu Rendec 2009-04-30 11:44 ` Jesper Dangaard Brouer 2009-04-30 14:04 ` Re[2]: " Calin Velea 1 sibling, 2 replies; 39+ messages in thread From: Radu Rendec @ 2009-04-30 11:19 UTC (permalink / raw) To: Calin Velea Cc: Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote: > I tested with e1000 only, on a single quad-core CPU - the L2 cache was > shared between the cores. > > For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually > used belong to different physical CPUs, L2 cache sharing does not occur - > maybe this could explain the performance drop in your case. > Or there may be other explanation... It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and it is very probable - then I think the L2 cache was actually shared. That's because the used CPUs where either 0-3 or 4-7 but never a mix of them. So perhaps there is another explanation (maybe driver/hardware). > It could be the only way to get more power is to increase the number > of devices where you are shaping. You could split the IP space into 4 groups > and direct the trafic to 4 IMQ devices with 4 iptables rules - > > -d 0.0.0.0/2 -j IMQ --todev imq0, > -d 64.0.0.0/2 -j IMQ --todev imq1, etc... Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc, and the two qdiscs (HTB sets) are independent. This will result in a maximum of double the allocated bandwidth (if HTB sets are identical and traffic is equally distributed). > The performance gained through parallelism might be a lot higher than the > added overhead of iptables and/or ipset nethash match. Anyway - this is more of > a "hack" than a clean solution :) > > p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that Yes, the performance gained through parallelism is expected to be higher than the loss of the additional overhead. That's why I asked for parallel HTB in the first place, but got very disappointed after David Miller's reply :) Thanks a lot for all the hints and for the imq link. Imq is very interesting regardless of whether it proves to be useful for this project of mine or not. Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-30 11:19 ` Radu Rendec @ 2009-04-30 11:44 ` Jesper Dangaard Brouer 2009-04-30 14:04 ` Re[2]: " Calin Velea 1 sibling, 0 replies; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-30 11:44 UTC (permalink / raw) To: Radu Rendec; +Cc: Calin Velea, Jarek Poplawski, Denys Fedoryschenko, netdev On Thu, 30 Apr 2009, Radu Rendec wrote: > On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote: >> I tested with e1000 only, on a single quad-core CPU - the L2 cache was >> shared between the cores. >> >> For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually >> used belong to different physical CPUs, L2 cache sharing does not occur - >> maybe this could explain the performance drop in your case. >> Or there may be other explanation... > > It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified > CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and > it is very probable - then I think the L2 cache was actually shared. > That's because the used CPUs where either 0-3 or 4-7 but never a mix of > them. So perhaps there is another explanation (maybe driver/hardware). WRONG assumption regarding CPU id's Look in /proc/cpuinfo for the correct answer. (From a: model name : Intel(R) Xeon(R) CPU E5420 @ 2.50GHz) cat /proc/cpuinfo | egrep -e '(processor|physical id|core id)' processor : 0 physical id : 0 core id : 0 processor : 1 physical id : 1 core id : 0 processor : 2 physical id : 0 core id : 2 processor : 3 physical id : 1 core id : 2 processor : 4 physical id : 0 core id : 1 processor : 5 physical id : 1 core id : 1 processor : 6 physical id : 0 core id : 3 processor : 7 physical id : 1 core id : 3 E.g. Here CPU0 and CPU4 is sharing the same L2 cache. Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re[2]: htb parallelism on multi-core platforms 2009-04-30 11:19 ` Radu Rendec 2009-04-30 11:44 ` Jesper Dangaard Brouer @ 2009-04-30 14:04 ` Calin Velea 2009-05-08 10:15 ` Paweł Staszewski 1 sibling, 1 reply; 39+ messages in thread From: Calin Velea @ 2009-04-30 14:04 UTC (permalink / raw) To: Radu Rendec Cc: Calin Velea, Jarek Poplawski, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev Thursday, April 30, 2009, 2:19:36 PM, you wrote: > On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote: >> I tested with e1000 only, on a single quad-core CPU - the L2 cache was >> shared between the cores. >> >> For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually >> used belong to different physical CPUs, L2 cache sharing does not occur - >> maybe this could explain the performance drop in your case. >> Or there may be other explanation... > It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified > CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and > it is very probable - then I think the L2 cache was actually shared. > That's because the used CPUs where either 0-3 or 4-7 but never a mix of > them. So perhaps there is another explanation (maybe driver/hardware). >> It could be the only way to get more power is to increase the number >> of devices where you are shaping. You could split the IP space into 4 groups >> and direct the trafic to 4 IMQ devices with 4 iptables rules - >> >> -d 0.0.0.0/2 -j IMQ --todev imq0, >> -d 64.0.0.0/2 -j IMQ --todev imq1, etc... > Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share > bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc, > and the two qdiscs (HTB sets) are independent. This will result in a > maximum of double the allocated bandwidth (if HTB sets are identical and > traffic is equally distributed). >> The performance gained through parallelism might be a lot higher than the >> added overhead of iptables and/or ipset nethash match. Anyway - this is more of >> a "hack" than a clean solution :) >> >> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that > Yes, the performance gained through parallelism is expected to be higher > than the loss of the additional overhead. That's why I asked for > parallel HTB in the first place, but got very disappointed after David > Miller's reply :) > Thanks a lot for all the hints and for the imq link. Imq is very > interesting regardless of whether it proves to be useful for this > project of mine or not. > Radu Rendec Indeed, you need to use ipset with nethash to avoid bandwidth doubling. Let's say we have a shaping bridge: customer side (download) is on eth0, the upstream side (upload) is on eth1. Create customer groups with ipset (http://ipset.netfilter.org/) ipset -N cust_group1_ips nethash ipset -A cust_group1_ips <subnet/mask> .... ....for each subnet To shape the upload with multiple IMQs: -m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ --to-dev 0 -m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ --to-dev 1 -m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ --to-dev 2 -m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ --to-dev 3 You will apply the same htb upload limits to imq 0-3. Upload for customers having source IPs from the first group will be shaped by imq0, for the second, by imq1, etc... For download: -m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ --to-dev 4 -m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ --to-dev 5 -m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ --to-dev 6 -m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ --to-dev 7 and apply the same download limits on imq 4-7 > __________ NOD32 4045 (20090430) Information __________ > This message was checked by NOD32 antivirus system. > http://www.eset.com -- Best regards, Calin mailto:calin.velea@gemenii.ro ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-30 14:04 ` Re[2]: " Calin Velea @ 2009-05-08 10:15 ` Paweł Staszewski 2009-05-08 17:55 ` Vladimir Ivashchenko 0 siblings, 1 reply; 39+ messages in thread From: Paweł Staszewski @ 2009-05-08 10:15 UTC (permalink / raw) To: Linux Network Development list; +Cc: netdev Radu You have something wrong with your configuration i think. I make Traffic management for many different nets with space of /18 prefix outside net + 10.0.0.0/18 inside and some nets like /21 , /22 , /23, /20 network prefixes. Some stats from my router: tc -s -d filter show dev eth0 | grep dst | wc -l 14087 tc -s -d filter show dev eth1 | grep dst | wc -l 14087 cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 3075 @ 2.66GHz stepping : 11 cpu MHz : 2659.843 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority bogomips : 5319.68 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 3075 @ 2.66GHz stepping : 11 cpu MHz : 2659.843 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority bogomips : 5320.30 clflush size : 64 power management: mpstat -P ALL 1 10 Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s Average: all 0.00 0.00 0.15 0.00 0.00 0.10 0.00 99.75 73231.70 Average: 0 0.00 0.00 0.20 0.00 0.00 0.10 0.00 99.70 0.00 Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 27686.80 Average: 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Some opreport: CPU: Core 2, speed 2659.84 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 7592 8.3103 vmlinux rb_next 5393 5.9033 vmlinux e1000_get_hw_control 4514 4.9411 vmlinux hfsc_dequeue 4069 4.4540 vmlinux e1000_intr_msi 3695 4.0446 vmlinux u32_classify 3522 3.8552 vmlinux poll_idle 2234 2.4454 vmlinux _raw_spin_lock 2077 2.2735 vmlinux read_tsc 1855 2.0305 vmlinux rb_prev 1834 2.0075 vmlinux getnstimeofday 1800 1.9703 vmlinux e1000_clean_rx_irq 1553 1.6999 vmlinux ip_route_input 1509 1.6518 vmlinux hfsc_enqueue 1451 1.5883 vmlinux irq_entries_start 1419 1.5533 vmlinux mwait_idle 1392 1.5237 vmlinux e1000_clean_tx_irq 1345 1.4723 vmlinux rb_erase 1294 1.4164 vmlinux sfq_enqueue 1187 1.2993 libc-2.6.1.so (no symbols) 1162 1.2719 vmlinux sfq_dequeue 1134 1.2413 vmlinux ipt_do_table 1116 1.2216 vmlinux apic_timer_interrupt 1108 1.2128 vmlinux cftree_insert 1039 1.1373 vmlinux rtsc_y2x 985 1.0782 vmlinux e1000_xmit_frame 943 1.0322 vmlinux update_vf bwm-ng v0.6 (probing every 5.000s), press 'h' for help input: /proc/net/dev type: rate / iface Rx Tx Total ============================================================================== lo: 0.00 KB/s 0.00 KB/s 0.00 KB/s eth1: 20716.35 KB/s 24258.43 KB/s 44974.78 KB/s eth0: 24365.31 KB/s 30691.10 KB/s 55056.42 KB/s ------------------------------------------------------------------------------ bwm-ng v0.6 (probing every 5.000s), press 'h' for help input: /proc/net/dev type: rate | iface Rx Tx Total ============================================================================== lo: 0.00 P/s 0.00 P/s 0.00 P/s eth1: 38034.00 P/s 36751.00 P/s 74785.00 P/s eth0: 37195.40 P/s 38115.00 P/s 75310.40 P/s Maximum CPU load is when rush hour (from 5:00 pm to 10:00 pm) then it is 20% - 30% of each CPU. So i think you must change type of your hash tree in u32 filtering. I use simply split of big nets like /18, /20, /21 to /24 prefixes to build my hash tree. I make many tests and this configuration of hash works best for my configuration. Regards Paweł Sstaszewski Calin Velea pisze: > Thursday, April 30, 2009, 2:19:36 PM, you wrote: > > >> On Thu, 2009-04-30 at 01:49 +0300, Calin Velea wrote: >> >>> I tested with e1000 only, on a single quad-core CPU - the L2 cache was >>> shared between the cores. >>> >>> For 8 cores I suppose you have 2 quad-core CPUs. If the cores actually >>> used belong to different physical CPUs, L2 cache sharing does not occur - >>> maybe this could explain the performance drop in your case. >>> Or there may be other explanation... >>> > > >> It is correct, I have 2 quad-core CPUs. If adjacent kernel-identified >> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) - and >> it is very probable - then I think the L2 cache was actually shared. >> That's because the used CPUs where either 0-3 or 4-7 but never a mix of >> them. So perhaps there is another explanation (maybe driver/hardware). >> > > >>> It could be the only way to get more power is to increase the number >>> of devices where you are shaping. You could split the IP space into 4 groups >>> and direct the trafic to 4 IMQ devices with 4 iptables rules - >>> >>> -d 0.0.0.0/2 -j IMQ --todev imq0, >>> -d 64.0.0.0/2 -j IMQ --todev imq1, etc... >>> > > >> Yes, but what if let's say 10.0.0.0/24 and 70.0.0.0/24 need to share >> bandwidth? 10.a.b.c goes to imq0 qdisc, and 70.x.y.z goes to imq1 qdisc, >> and the two qdiscs (HTB sets) are independent. This will result in a >> maximum of double the allocated bandwidth (if HTB sets are identical and >> traffic is equally distributed). >> > > >>> The performance gained through parallelism might be a lot higher than the >>> added overhead of iptables and/or ipset nethash match. Anyway - this is more of >>> a "hack" than a clean solution :) >>> >>> p.s.: latest IMQ at http://www.linuximq.net/ is for 2.6.26 so you will need to try with that >>> > > >> Yes, the performance gained through parallelism is expected to be higher >> than the loss of the additional overhead. That's why I asked for >> parallel HTB in the first place, but got very disappointed after David >> Miller's reply :) >> > > >> Thanks a lot for all the hints and for the imq link. Imq is very >> interesting regardless of whether it proves to be useful for this >> project of mine or not. >> > > >> Radu Rendec >> > > > Indeed, you need to use ipset with nethash to avoid bandwidth doubling. > Let's say we have a shaping bridge: customer side (download) is > on eth0, the upstream side (upload) is on eth1. > > Create customer groups with ipset (http://ipset.netfilter.org/) > > ipset -N cust_group1_ips nethash > ipset -A cust_group1_ips <subnet/mask> > .... > ....for each subnet > > > > To shape the upload with multiple IMQs: > > -m physdev --physdev-in eth0 -m set --set cust_group1_ips src -j IMQ --to-dev 0 > -m physdev --physdev-in eth0 -m set --set cust_group2_ips src -j IMQ --to-dev 1 > -m physdev --physdev-in eth0 -m set --set cust_group3_ips src -j IMQ --to-dev 2 > -m physdev --physdev-in eth0 -m set --set cust_group4_ips src -j IMQ --to-dev 3 > > > You will apply the same htb upload limits to imq 0-3. > Upload for customers having source IPs from the first group will be shaped > by imq0, for the second, by imq1, etc... > > > For download: > > -m physdev --physdev-in eth1 -m set --set cust_group1_ips dst -j IMQ --to-dev 4 > -m physdev --physdev-in eth1 -m set --set cust_group2_ips dst -j IMQ --to-dev 5 > -m physdev --physdev-in eth1 -m set --set cust_group3_ips dst -j IMQ --to-dev 6 > -m physdev --physdev-in eth1 -m set --set cust_group4_ips dst -j IMQ --to-dev 7 > > and apply the same download limits on imq 4-7 > > > >> __________ NOD32 4045 (20090430) Information __________ >> > > >> This message was checked by NOD32 antivirus system. >> http://www.eset.com >> > > > > > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-05-08 10:15 ` Paweł Staszewski @ 2009-05-08 17:55 ` Vladimir Ivashchenko 2009-05-08 18:07 ` Denys Fedoryschenko 0 siblings, 1 reply; 39+ messages in thread From: Vladimir Ivashchenko @ 2009-05-08 17:55 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list > >> It is correct, I have 2 quad-core CPUs. If adjacent > kernel-identified > >> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) > - and > >> it is very probable - then I think the L2 cache was actually > shared. > >> That's because the used CPUs where either 0-3 or 4-7 but never a > mix of > >> them. So perhaps there is another explanation (maybe > driver/hardware). Keep in mind that on Intel quad-core CPU cache is shared between pairs of cores, not for all four cores. http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/desktop/processor/processors/core2quad/feature/index.htm -- Best Regards, Vladimir Ivashchenko Chief Technology Officer PrimeTel PLC, Cyprus - www.prime-tel.com Tel: +357 25 100100 Fax: +357 2210 2211 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-05-08 17:55 ` Vladimir Ivashchenko @ 2009-05-08 18:07 ` Denys Fedoryschenko 0 siblings, 0 replies; 39+ messages in thread From: Denys Fedoryschenko @ 2009-05-08 18:07 UTC (permalink / raw) To: Vladimir Ivashchenko Cc: Paweł Staszewski, Linux Network Development list Btw shared L2 cache have higher latency, than dedicated one. Thats why Core i7 rules (tested recently). On Friday 08 May 2009 20:55:12 Vladimir Ivashchenko wrote: > > >> It is correct, I have 2 quad-core CPUs. If adjacent > > > > kernel-identified > > > > >> CPUs are on the same physical CPU (e.g. CPU0, CPU1, CPU2 and CPU3) > > > > - and > > > > >> it is very probable - then I think the L2 cache was actually > > > > shared. > > > > >> That's because the used CPUs where either 0-3 or 4-7 but never a > > > > mix of > > > > >> them. So perhaps there is another explanation (maybe > > > > driver/hardware). > > Keep in mind that on Intel quad-core CPU cache is shared between pairs > of cores, not for all four cores. > > http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/desktop/proce >ssor/processors/core2quad/feature/index.htm ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-22 21:29 ` Jesper Dangaard Brouer 2009-04-23 8:20 ` Jarek Poplawski @ 2009-04-23 12:31 ` Radu Rendec 2009-04-23 18:43 ` Jarek Poplawski ` (2 more replies) 1 sibling, 3 replies; 39+ messages in thread From: Radu Rendec @ 2009-04-23 12:31 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Jarek Poplawski, Denys Fedoryschenko, netdev On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: > Its runtime adjustable, so its easy to try out. > > via /sys/module/sch_htb/parameters/htb_hysteresis Thanks for the tip! This means I can play around with various values while the machine is in production and see how it reacts. > The HTB classify hash has a scalability issue in kernels below 2.6.26. > Patrick McHardy fixes that up in 2.6.26. What kernel version are you > using? I'm using 2.6.26, so I guess the fix is already there :( > Could you explain how you do classification? And perhaps outline where you > possible scalability issue is located? > > If you are interested how I do scalable classification, see my > presentation from Netfilter Workshop 2008: > > http://nfws.inl.fr/en/?p=115 > http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf I had a look at your presentation and it seems to be focused in dividing a single iptables rule chain into multiple chains, so that rule lookup complexity decreases from linear to logarithmic. Since I only need to do shaping, I don't use iptables at all. Address matching is all done in on the egress side, using u32. Rule schema is this: 1. We have two /19 networks that differ pretty much in the first bits: 80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to individual /32 addresses. 2. The default ip hash (0x800) is size 1 (only one bucket) and has two rules that select between two subsequent hash tables (say 0x100 and 0x101) based on the most significant bits in the address. 3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets); bucket selection is done by bits b10 - b17 (with b0 being the least significant). 4. Each bucket contains complete cidr match rules (corresponding to real customer addresses). Since bits b11 - b31 are already checked in upper levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the worst case, if all customer addresses that "fall" into that bucket are /32 (fortunately this is not the real case). In conclusion each packet would be matched against at most 1026 rules (worst case). The real case is actually much better: only one bucket with 400 rules, all other less than 70 rules and most of them less than 10 rules. > > I guess htb_hysteresis only affects the actual shaping (which takes > > place after the packet is classified). > > Yes, htb_hysteresis basically is a hack to allow extra bursts... we > actually considered removing it completely... It's definitely worth a try at least. Thanks for the tips! Radu Rendec ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 12:31 ` Radu Rendec @ 2009-04-23 18:43 ` Jarek Poplawski 2009-04-23 19:06 ` Jesper Dangaard Brouer 2009-04-24 6:01 ` Jarek Poplawski [not found] ` <1039493214.20090424135024@gemenii.ro> 2009-04-24 11:35 ` Re[2]: " Calin Velea 2 siblings, 2 replies; 39+ messages in thread From: Jarek Poplawski @ 2009-04-23 18:43 UTC (permalink / raw) To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev Radu Rendec wrote, On 04/23/2009 02:31 PM: > On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: ... >> The HTB classify hash has a scalability issue in kernels below 2.6.26. >> Patrick McHardy fixes that up in 2.6.26. What kernel version are you >> using? > > I'm using 2.6.26, so I guess the fix is already there :( If Jesper meant the change of hash I can see it in 2.6.27 yet. ... > In conclusion each packet would be matched against at most 1026 rules > (worst case). The real case is actually much better: only one bucket > with 400 rules, all other less than 70 rules and most of them less than > 10 rules. Alas I can't analyze this all now, and probably I miss something, but your worst and real cases look suspiciously big. Do all these classes differ so much? Maybe you should have a look at cls_flow? Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 18:43 ` Jarek Poplawski @ 2009-04-23 19:06 ` Jesper Dangaard Brouer 2009-04-23 19:14 ` Jarek Poplawski 2009-04-24 6:01 ` Jarek Poplawski 1 sibling, 1 reply; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-23 19:06 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev On Thu, 23 Apr 2009, Jarek Poplawski wrote: > Radu Rendec wrote, On 04/23/2009 02:31 PM: > >> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: > ... >>> The HTB classify hash has a scalability issue in kernels below 2.6.26. >>> Patrick McHardy fixes that up in 2.6.26. What kernel version are you >>> using? >> >> I'm using 2.6.26, so I guess the fix is already there :( > > If Jesper meant the change of hash I can see it in 2.6.27 yet. I'm referring to: commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 Author: Patrick McHardy <kaber@trash.net> Date: Sat Jul 5 23:22:35 2008 -0700 net-sched: sch_htb: use dynamic class hash helpers Is there any easy git way to figure out which release this commit got into? Cheers, Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 19:06 ` Jesper Dangaard Brouer @ 2009-04-23 19:14 ` Jarek Poplawski 2009-04-23 19:47 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 39+ messages in thread From: Jarek Poplawski @ 2009-04-23 19:14 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev On Thu, Apr 23, 2009 at 09:06:59PM +0200, Jesper Dangaard Brouer wrote: > On Thu, 23 Apr 2009, Jarek Poplawski wrote: > >> Radu Rendec wrote, On 04/23/2009 02:31 PM: >> >>> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: >> ... >>>> The HTB classify hash has a scalability issue in kernels below 2.6.26. >>>> Patrick McHardy fixes that up in 2.6.26. What kernel version are you >>>> using? >>> >>> I'm using 2.6.26, so I guess the fix is already there :( >> >> If Jesper meant the change of hash I can see it in 2.6.27 yet. > > I'm referring to: > > commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 > Author: Patrick McHardy <kaber@trash.net> > Date: Sat Jul 5 23:22:35 2008 -0700 > > net-sched: sch_htb: use dynamic class hash helpers > > Is there any easy git way to figure out which release this commit got > into? I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag): http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 19:14 ` Jarek Poplawski @ 2009-04-23 19:47 ` Jesper Dangaard Brouer 2009-04-23 20:00 ` Jarek Poplawski 2009-04-23 20:09 ` Jeff King 0 siblings, 2 replies; 39+ messages in thread From: Jesper Dangaard Brouer @ 2009-04-23 19:47 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Radu Rendec, Denys Fedoryschenko, netdev, git On Thu, 23 Apr 2009, Jarek Poplawski wrote: > On Thu, Apr 23, 2009 at 09:06:59PM +0200, Jesper Dangaard Brouer wrote: >> On Thu, 23 Apr 2009, Jarek Poplawski wrote: >> >>> Radu Rendec wrote, On 04/23/2009 02:31 PM: >>> >>>> On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: >>> ... >>>>> The HTB classify hash has a scalability issue in kernels below 2.6.26. >>>>> Patrick McHardy fixes that up in 2.6.26. What kernel version are you >>>>> using? >>>> >>>> I'm using 2.6.26, so I guess the fix is already there :( >>> >>> If Jesper meant the change of hash I can see it in 2.6.27 yet. >> >> I'm referring to: >> >> commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 >> Author: Patrick McHardy <kaber@trash.net> >> Date: Sat Jul 5 23:22:35 2008 -0700 >> >> net-sched: sch_htb: use dynamic class hash helpers >> >> Is there any easy git way to figure out which release this commit got >> into? > > I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag): > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 I think I prefer the command line edition "git-describe". But it seems that the two approaches gives a different results. (Cc'ing the git mailing list as they might know the reason) git-describe f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 returns "v2.6.26-rc8-1107-gf4c1f3e" While you URL returns: "X-Git-Tag: v2.6.27-rc1~964^2~219" I also did a: "git log v2.6.26..v2.6.27 | grep f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2" commit f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 To Radu: The change I talked about is in 2.6.27, so you should try that kernel on you system. Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 19:47 ` Jesper Dangaard Brouer @ 2009-04-23 20:00 ` Jarek Poplawski 2009-04-23 20:09 ` Jeff King 1 sibling, 0 replies; 39+ messages in thread From: Jarek Poplawski @ 2009-04-23 20:00 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Radu Rendec, Denys Fedoryschenko, netdev, git On Thu, Apr 23, 2009 at 09:47:05PM +0200, Jesper Dangaard Brouer wrote: > On Thu, 23 Apr 2009, Jarek Poplawski wrote: ... >> I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag): >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 > > I think I prefer the command line edition "git-describe". But it seems > that the two approaches gives a different results. Probably there is something more needed around this git-describe. I prefer the command line too when I can remember this command line... Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 19:47 ` Jesper Dangaard Brouer 2009-04-23 20:00 ` Jarek Poplawski @ 2009-04-23 20:09 ` Jeff King 1 sibling, 0 replies; 39+ messages in thread From: Jeff King @ 2009-04-23 20:09 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Jarek Poplawski, Radu Rendec, Denys Fedoryschenko, netdev, git On Thu, Apr 23, 2009 at 09:47:05PM +0200, Jesper Dangaard Brouer wrote: >>> Is there any easy git way to figure out which release this commit got >>> into? >> >> I guess git-describe, but I prefer clicking at the "raw" (X-Git-Tag): >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=f4c1f3e0c59be0e6566d9c00b1d8b204ffb861a2 > > I think I prefer the command line edition "git-describe". But it seems > that the two approaches gives a different results. > (Cc'ing the git mailing list as they might know the reason) You want "git describe --contains". The default mode for describe is "you are at tag $X, plus $N commits, and by the way, the sha1 is $H" (shown as "$X-$N-g$H"). The default mode is useful for generating a unique semi-human-readable version number (e.g., to be included in your builds). -Peff ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: htb parallelism on multi-core platforms 2009-04-23 18:43 ` Jarek Poplawski 2009-04-23 19:06 ` Jesper Dangaard Brouer @ 2009-04-24 6:01 ` Jarek Poplawski 1 sibling, 0 replies; 39+ messages in thread From: Jarek Poplawski @ 2009-04-24 6:01 UTC (permalink / raw) To: Radu Rendec; +Cc: Jesper Dangaard Brouer, Denys Fedoryschenko, netdev On 23-04-2009 20:43, Jarek Poplawski wrote: > Radu Rendec wrote, On 04/23/2009 02:31 PM: ... >> In conclusion each packet would be matched against at most 1026 rules >> (worst case). The real case is actually much better: only one bucket >> with 400 rules, all other less than 70 rules and most of them less than >> 10 rules. > > Alas I can't analyze this all now, and probably I miss something, but > your worst and real cases look suspiciously big. Do all these classes > differ so much? Maybe you should have a look at cls_flow? Actually fixing this u32 config (hashes) should be enough here. Jarek P. ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <1039493214.20090424135024@gemenii.ro>]
* Re: htb parallelism on multi-core platforms [not found] ` <1039493214.20090424135024@gemenii.ro> @ 2009-04-24 11:19 ` Jarek Poplawski 0 siblings, 0 replies; 39+ messages in thread From: Jarek Poplawski @ 2009-04-24 11:19 UTC (permalink / raw) To: Calin Velea Cc: Radu Rendec, Jesper Dangaard Brouer, Denys Fedoryschenko, netdev On Fri, Apr 24, 2009 at 01:50:24PM +0300, Calin Velea wrote: > Hi, Hi, Very interesting message, but try to use plain format next time. I guess your mime/html original wasn't accepted by netdev@. Jarek P. > > Maybe some actual results I got some time ago could help you and others who had the same problems: > > Hardware: quad-core Xeon X3210 (2.13GHz, 8M L2 cache), 2 Intel PCI Express Gigabit NICs > Kernel: 2.6.20 > > I did some udp flood tests in the following configurations - the machine was configured as a > traffic shaping bridge, about 10k htb rules loaded, using hashing (see below): > > A) napi on, irqs for each card statically allocated to 2 CPU cores > > when flooding, the same CPU went 100% softirq always (seems logical, > since it is statically bound to the irq) > > B) napi on, CONFIG_IRQBALANCE=y > > when flooding, a random CPU went 100% softirq always. (here, > at high interrupt rates, NAPI kicks in and starts using polling > rather than irqs, so no more balancing takes place since there are > no more interrupts - checked this with /proc/interrupts - at high packet > rates the irq counters for the network cards stalled) > > C) napi off, CONFIG_IRQBALANCE=y > > this is the setup I used in the end since all CPU cores were used. All of them > went to 100%, and the pps rate I could pass through was higher than in > case A or B. > > > Also, your worst case hashing setup could be improved - I suggest you take a look at > http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method > described there will take a constant CPU time (4 checks) for each packet, regardless of how many > filter rules you have (provided you only filter by IP address). A tree of hashtables > is constructed which matches each of the four bytes from the IP address in succesion. > > Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got > troughputs of 1.3Gbps / 250.000 pps aggregated in+out in normal usage. CPU utilization > averages varied between 25 - 50 % for every core, so there was still room to grow. > I expect much higher pps rates with better hardware (higher freq/larger cache Xeons). > > > > Thursday, April 23, 2009, 3:31:47 PM, you wrote: > > > On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: > >> Its runtime adjustable, so its easy to try out. > > >> via /sys/module/sch_htb/parameters/htb_hysteresis > > > Thanks for the tip! This means I can play around with various values > > while the machine is in production and see how it reacts. > > >> The HTB classify hash has a scalability issue in kernels below 2.6.26. > >> Patrick McHardy fixes that up in 2.6.26. What kernel version are you > >> using? > > > I'm using 2.6.26, so I guess the fix is already there :( > > >> Could you explain how you do classification? And perhaps outline where you > >> possible scalability issue is located? > > >> If you are interested how I do scalable classification, see my > >> presentation from Netfilter Workshop 2008: > > >> http://nfws.inl.fr/en/?p=115 > >> http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf > > > I had a look at your presentation and it seems to be focused in dividing > > a single iptables rule chain into multiple chains, so that rule lookup > > complexity decreases from linear to logarithmic. > > > Since I only need to do shaping, I don't use iptables at all. Address > > matching is all done in on the egress side, using u32. Rule schema is > > this: > > > 1. We have two /19 networks that differ pretty much in the first bits: > > 80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to > > individual /32 addresses. > > > 2. The default ip hash (0x800) is size 1 (only one bucket) and has two > > rules that select between two subsequent hash tables (say 0x100 and > > 0x101) based on the most significant bits in the address. > > > 3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets); > > bucket selection is done by bits b10 - b17 (with b0 being the least > > significant). > > > 4. Each bucket contains complete cidr match rules (corresponding to real > > customer addresses). Since bits b11 - b31 are already checked in upper > > levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the > > worst case, if all customer addresses that "fall" into that bucket > > are /32 (fortunately this is not the real case). > > > In conclusion each packet would be matched against at most 1026 rules > > (worst case). The real case is actually much better: only one bucket > > with 400 rules, all other less than 70 rules and most of them less than > > 10 rules. > > >> > I guess htb_hysteresis only affects the actual shaping (which takes > >> > place after the packet is classified). > > >> Yes, htb_hysteresis basically is a hack to allow extra bursts... we > >> actually considered removing it completely... > > > It's definitely worth a try at least. Thanks for the tips! > > > Radu Rendec > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Best regards, > Calin mailto:calin.velea@gemenii.ro ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re[2]: htb parallelism on multi-core platforms 2009-04-23 12:31 ` Radu Rendec 2009-04-23 18:43 ` Jarek Poplawski [not found] ` <1039493214.20090424135024@gemenii.ro> @ 2009-04-24 11:35 ` Calin Velea 2 siblings, 0 replies; 39+ messages in thread From: Calin Velea @ 2009-04-24 11:35 UTC (permalink / raw) To: netdev Hi, Maybe some actual results I got some time ago could help you and others who had the same problems: Hardware: quad-core Xeon X3210 (2.13GHz, 8M L2 cache), 2 Intel PCI Express Gigabit NICs Kernel: 2.6.20 I did some udp flood tests in the following configurations - the machine was configured as a traffic shaping bridge, about 10k htb rules loaded, using hashing (see below): A) napi on, irqs for each card statically allocated to 2 CPU cores when flooding, the same CPU went 100% softirq always (seems logical, since it is statically bound to the irq) B) napi on, CONFIG_IRQBALANCE=y when flooding, a random CPU went 100% softirq always. (here, at high interrupt rates, NAPI kicks in and starts using polling rather than irqs, so no more balancing takes place since there are no more interrupts - checked this with /proc/interrupts - at high packet rates the irq counters for the network cards stalled) C) napi off, CONFIG_IRQBALANCE=y this is the setup I used in the end since all CPU cores were used. All of them went to 100%, and the pps rate I could pass through was higher than in case A or B. Also, your worst case hashing setup could be improved - I suggest you take a look at http://vcalinus.gemenii.ro/?p=9 (see the generated filters example). The hashing method described there will take a constant CPU time (4 checks) for each packet, regardless of how many filter rules you have (provided you only filter by IP address). A tree of hashtables is constructed which matches each of the four bytes from the IP address in succesion. Using this hashing method, the hardware above, 2.6.20 with napi off and irq balancing on, I got troughputs of 1.3Gbps / 250.000 pps aggregated in+out in normal usage. CPU utilization averages varied between 25 - 50 % for every core, so there was still room to grow. I expect much higher pps rates with better hardware (higher freq/larger cache Xeons). Thursday, April 23, 2009, 3:31:47 PM, you wrote: > On Wed, 2009-04-22 at 23:29 +0200, Jesper Dangaard Brouer wrote: >> Its runtime adjustable, so its easy to try out. >> via /sys/module/sch_htb/parameters/htb_hysteresis > Thanks for the tip! This means I can play around with various values > while the machine is in production and see how it reacts. >> The HTB classify hash has a scalability issue in kernels below 2.6.26. >> Patrick McHardy fixes that up in 2.6.26. What kernel version are you >> using? > I'm using 2.6.26, so I guess the fix is already there :( >> Could you explain how you do classification? And perhaps outline where you >> possible scalability issue is located? >> If you are interested how I do scalable classification, see my >> presentation from Netfilter Workshop 2008: >> http://nfws.inl.fr/en/?p=115 >> http://www.netoptimizer.dk/presentations/nfsw2008/Jesper-Brouer_Large-iptables-rulesets.pdf > I had a look at your presentation and it seems to be focused in dividing > a single iptables rule chain into multiple chains, so that rule lookup > complexity decreases from linear to logarithmic. > Since I only need to do shaping, I don't use iptables at all. Address > matching is all done in on the egress side, using u32. Rule schema is > this: > 1. We have two /19 networks that differ pretty much in the first bits: > 80.x.y.z and 83.a.b.c; customer address spaces range from /22 nets to > individual /32 addresses. > 2. The default ip hash (0x800) is size 1 (only one bucket) and has two > rules that select between two subsequent hash tables (say 0x100 and > 0x101) based on the most significant bits in the address. > 3. Level 2 hash tables (0x100 and 0x101) are size 256 (256 buckets); > bucket selection is done by bits b10 - b17 (with b0 being the least > significant). > 4. Each bucket contains complete cidr match rules (corresponding to real > customer addresses). Since bits b11 - b31 are already checked in upper > levels, this results in a maximum of 2 ^ 10 = 1024 rules, which is the > worst case, if all customer addresses that "fall" into that bucket > are /32 (fortunately this is not the real case). > In conclusion each packet would be matched against at most 1026 rules > (worst case). The real case is actually much better: only one bucket > with 400 rules, all other less than 70 rules and most of them less than > 10 rules. >> > I guess htb_hysteresis only affects the actual shaping (which takes >> > place after the packet is classified). >> Yes, htb_hysteresis basically is a hack to allow extra bursts... we >> actually considered removing it completely... > It's definitely worth a try at least. Thanks for the tips! > Radu Rendec > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, Calin mailto:calin.velea@gemenii.ro ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2009-05-08 18:08 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-17 10:40 htb parallelism on multi-core platforms Radu Rendec
2009-04-17 11:31 ` David Miller
2009-04-17 11:33 ` Badalian Vyacheslav
2009-04-17 22:41 ` Jarek Poplawski
2009-04-18 0:21 ` Denys Fedoryschenko
2009-04-18 7:56 ` Jarek Poplawski
2009-04-22 14:02 ` Radu Rendec
2009-04-22 21:29 ` Jesper Dangaard Brouer
2009-04-23 8:20 ` Jarek Poplawski
2009-04-23 13:56 ` Radu Rendec
2009-04-23 18:19 ` Jarek Poplawski
2009-04-23 20:19 ` Jesper Dangaard Brouer
2009-04-24 9:42 ` Radu Rendec
2009-04-28 10:15 ` Jesper Dangaard Brouer
2009-04-29 10:21 ` Radu Rendec
2009-04-29 10:31 ` Jesper Dangaard Brouer
2009-04-29 11:03 ` Radu Rendec
2009-04-29 12:23 ` Jarek Poplawski
2009-04-29 13:15 ` Radu Rendec
2009-04-29 13:38 ` Jarek Poplawski
2009-04-29 16:21 ` Radu Rendec
2009-04-29 22:49 ` Calin Velea
2009-04-29 23:00 ` Re[2]: " Calin Velea
2009-04-30 11:19 ` Radu Rendec
2009-04-30 11:44 ` Jesper Dangaard Brouer
2009-04-30 14:04 ` Re[2]: " Calin Velea
2009-05-08 10:15 ` Paweł Staszewski
2009-05-08 17:55 ` Vladimir Ivashchenko
2009-05-08 18:07 ` Denys Fedoryschenko
2009-04-23 12:31 ` Radu Rendec
2009-04-23 18:43 ` Jarek Poplawski
2009-04-23 19:06 ` Jesper Dangaard Brouer
2009-04-23 19:14 ` Jarek Poplawski
2009-04-23 19:47 ` Jesper Dangaard Brouer
2009-04-23 20:00 ` Jarek Poplawski
2009-04-23 20:09 ` Jeff King
2009-04-24 6:01 ` Jarek Poplawski
[not found] ` <1039493214.20090424135024@gemenii.ro>
2009-04-24 11:19 ` Jarek Poplawski
2009-04-24 11:35 ` Re[2]: " Calin Velea
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).