From mboxrd@z Thu Jan 1 00:00:00 1970 From: Radu Rendec Subject: Re: htb parallelism on multi-core platforms Date: Thu, 23 Apr 2009 16:56:42 +0300 Message-ID: <1240495002.6554.155.camel@blade.ines.ro> References: <20090423082052.GA4243@ff.dom.local> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: Jesper Dangaard Brouer , Denys Fedoryschenko , netdev@vger.kernel.org To: Jarek Poplawski Return-path: Received: from NAT-172-Unkn.Local.iNES.RO ([80.86.100.172]:36315 "EHLO blade.ines.ro" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751565AbZDWN4u (ORCPT ); Thu, 23 Apr 2009 09:56:50 -0400 In-Reply-To: <20090423082052.GA4243@ff.dom.local> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 2009-04-23 at 08:20 +0000, Jarek Poplawski wrote: > Within a common tree of classes it would a need finer locking to > separate some jobs but considering cache problems I doubt there would > be much gain from such redesigning for smp. On the other hand, a > common tree is necessary if these classes really have to share every > byte, which I doubt. Then we could think of config and maybe tiny > hardware "redesign" (to more qdiscs/roots). So, e.g. using additional > (cheap) NICs and even switch, if possible, looks quite natural way of > spanning. Similar thing (multiple htb qdiscs) should be possible in > the future with one multiqueue NIC too. Since htb has a tree structure by default, I think it's pretty difficult to distribute shaping across different htb-enabled queues. Actually we had thought of using completely separate machines, but soon we realized there are some issues. Consider the following example: Customer A and customer B share 2 Mbit of bandwith. Each of them is guaranteed to reach 1 Mbit and in addition is able to "borrow" up to 1 Mbit from the other's bandwith (depending on the other's traffic). This is done like this: * bucket C -> rate 2 Mbit, ceil 2 Mbit * bucket A -> rate 1 Mbit, ceil 2 Mbit, parent C * bucket B -> rate 1 Mbit, ceil 2 Mbit, parent C IP filters for customer A classify packets to bucket A, and similar for customer B to bucket B. It's obvious that buckets A, B and C must be in the same htb tree, otherwise customers A and B would not be able to borrow from each other's bandwidth. One simple rule would be to allocate all buckets (with all their child buckets) that have rate = ceil to the same tree / queue / whatever. I don't know if this is enough. > There is also an interesting thread "Software receive packet steering" > nearby, but using this for shaping only looks like "less simple": > http://lwn.net/Articles/328339/ I am aware of the thread and even tried out the author's patch (despite the fact that David Miller suggested it was not sane). Under heavy (simulated) traffic nothing was changed: only one ksoftirqd using 100% CPU, one CPU in 100%, others idle. This only confirms what I've already been told: htb is single threaded by design. It also proves that most of the packet processing work is actually in htb. > BTW, I hope you add filters after classes they point to. Do you mean the actual order I use for the "tc filter add" and "tc class add" commands? Does it make any difference? Anyway, speaking of htb redesign or improvement (to use multiple threads / CPUs) I think classification rules can be cloned on a per-thread basis (to avoid synchronization issues). This means sacrificing memory for the benefit of performance but probably it is better to do it this way. However, shaping data structures must be shared between all threads as long as it's not sure that all packets corresponding to a certain IP address are processed in the same thread (they most probably would not, if a round-robin alhorithm is used). While searching the Internet for what has already been accomplished in this area, I ran several time across the per-CPU cache issue. The commonly accepted opinion seems to be that CPU parallelism in packet processing implies synchronization issues which in turn imply cache misses, which ultimately result in performance loss. However, with only one core in 100% and other 7 cores idle, I doubt that CPU-cache is really worth (it's just a guess and it definitely needs real tests as evidence). Thanks, Radu Rendec