remarkably Increase iptables' speed on SMP system.

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* remarkably Increase iptables' speed on SMP system.
@ 2007-09-28  2:15 John Ye
  2007-09-28 12:18 ` Amin Azez
  2007-09-28 13:52 ` Jan Engelhardt
  0 siblings, 2 replies; 12+ messages in thread
From: John Ye @ 2007-09-28  2:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc:  john ye, YE QY

All,

Iptables can't make full use of SMP because it runs in softirq.
There are many reports or complains saying that when netfilter runs, only one or two CPUs are busy doing softirq while others are
idle. see http://www.ussg.iu.edu/hypermail/linux/kernel/0702.0/1833.html, you can find many of these reports by googleING 'iptables
SMP sofitrq'.
This situation becomes especially worse when iptables' load is high, for example, when there are too many rules to match or there
are too many connections to track.
irqbalance looks like resolving this problem, but it does NOT. Balancing irq among CPUs doesn't mean to take full advantages of SMP
in any sense, periodically shifting NIC irq among CPUs can't gain extra processing speed(Because CPUs are not concurrently run in
softirq), when irq is shifted from CPU0 to CPU1, the CPU1 is busy, CPU0 becomes idle.
Linux network irq handling code tends to collect same irqs on different CPUs into one CPU when NIC is busy.
This tendency will make irqblance not work well. After running iptables for some time, the originally balanced irq may become
unbalanced. And, even if irqbalance works well, iptalbes' processing capacity doesn't go up.

There is a kernel patch to let softirq network code(iptables included in) concurrently run on every CPUs on SMP system.
We wrote the kernel patch, a loadable module as well, to totally resolve iptables SMP issue.
Have discussed with kernel netdev experts. it should be working.

The patch(module) will greatly increase the speed of iptalbes by making full use of every CPUs in SMP system.

It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
You are welcome to review and test without patching and re-compiling the kerenl.

Thanks.

John Ye & Qianyu Ye

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28  2:15 remarkably Increase iptables' speed on SMP system John Ye
@ 2007-09-28 12:18 ` Amin Azez
  2007-09-28 13:29   ` Henrik Nordstrom
                     ` (4 more replies)
  2007-09-28 13:52 ` Jan Engelhardt
  1 sibling, 5 replies; 12+ messages in thread
From: Amin Azez @ 2007-09-28 12:18 UTC (permalink / raw)
  To: John Ye; +Cc: netfilter-devel, YE QY

* John Ye wrote, On 28/09/07 03:15:

> There is a kernel patch to let softirq network code(iptables included in) concurrently run on every CPUs on SMP system.
> We wrote the kernel patch, a loadable module as well, to totally resolve iptables SMP issue.
> Have discussed with kernel netdev experts. it should be working.
> 
> The patch(module) will greatly increase the speed of iptalbes by making full use of every CPUs in SMP system.
> 
> It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
> You are welcome to review and test without patching and re-compiling the kerenl.

This looks interesting, and I hope worthwhile.

I wonder if it will likely re-order packets with the same flow?

i.e. packets which take more processing may leave the bridge/router
after a packet of the same flow which arrived later.

Cases where this seems more likely are generally where not every packet
of the same flow requires the same level of processing.

Obvious examples are:
* udp snat, where only the first packet follows the nat table
* layer7 when it stops matching, the very next packet may get through
before the one that was the last to be matched.
* packet count or rate based rules that only sometimes call secondary
chains may be delayed more than the next packet if it doesn't match.

TCP (with SACK) may not be so bothered about this, but some GRE or UDP
protocols may care (get degraded service), it may also foil upstream
flow analysis which makes me realise that the layer7 match (and probably
string match) are open to being deceived by deliberate out-of-order
packets or intermediate fake packets with bad tcp sequence numbers. Hmm

Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28 12:18 ` Amin Azez
@ 2007-09-28 13:29   ` Henrik Nordstrom
  2007-09-28 16:01   ` Rennie deGraaf
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Henrik Nordstrom @ 2007-09-28 13:29 UTC (permalink / raw)
  To: Amin Azez; +Cc: John Ye, netfilter-devel, YE QY

[-- Attachment #1: Type: text/plain, Size: 348 bytes --]

On fre, 2007-09-28 at 13:18 +0100, Amin Azez wrote:

> i.e. packets which take more processing may leave the bridge/router
> after a packet of the same flow which arrived later.

From what I can tell the patch already deals with that by distributing
work based on source,destination if you want. (tunable, see bs_policy).

Regards
Henrik

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 307 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28  2:15 remarkably Increase iptables' speed on SMP system John Ye
  2007-09-28 12:18 ` Amin Azez
@ 2007-09-28 13:52 ` Jan Engelhardt
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Engelhardt @ 2007-09-28 13:52 UTC (permalink / raw)
  To: John Ye; +Cc: netfilter-devel, YE QY


On Sep 28 2007 10:15, John Ye wrote:
>
>It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
>You are welcome to review and test without patching and re-compiling the kerenl.

Well, send a patch. I have no idea what to make out of that single file,
which obviously even has some code that does not look nice.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28 12:18 ` Amin Azez
  2007-09-28 13:29   ` Henrik Nordstrom
@ 2007-09-28 16:01   ` Rennie deGraaf
  2007-09-29  9:52   ` John Ye
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Rennie deGraaf @ 2007-09-28 16:01 UTC (permalink / raw)
  To: Amin Azez; +Cc: John Ye, netfilter-devel, YE QY

[-- Attachment #1: Type: text/plain, Size: 3039 bytes --]

Amin Azez wrote:
> * John Ye wrote, On 28/09/07 03:15:
> 
>> There is a kernel patch to let softirq network code(iptables included in) concurrently run on every CPUs on SMP system.
>> We wrote the kernel patch, a loadable module as well, to totally resolve iptables SMP issue.
>> Have discussed with kernel netdev experts. it should be working.
>>
>> The patch(module) will greatly increase the speed of iptalbes by making full use of every CPUs in SMP system.
>>
>> It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
>> You are welcome to review and test without patching and re-compiling the kerenl.
> 
> 
> This looks interesting, and I hope worthwhile.
> 
> I wonder if it will likely re-order packets with the same flow?
> 
> i.e. packets which take more processing may leave the bridge/router
> after a packet of the same flow which arrived later.
> 
> Cases where this seems more likely are generally where not every packet
> of the same flow requires the same level of processing.
> 
> Obvious examples are:
> * udp snat, where only the first packet follows the nat table
> * layer7 when it stops matching, the very next packet may get through
> before the one that was the last to be matched.
> * packet count or rate based rules that only sometimes call secondary
> chains may be delayed more than the next packet if it doesn't match.
> 
> TCP (with SACK) may not be so bothered about this, but some GRE or UDP
> protocols may care (get degraded service), it may also foil upstream
> flow analysis which makes me realise that the layer7 match (and probably
> string match) are open to being deceived by deliberate out-of-order
> packets or intermediate fake packets with bad tcp sequence numbers. Hmm

Unless something is seriously wrong with netfilter or the patch, packets
should almost never be re-ordered unless their inter-arrival times are
less than a few milliseconds.  If the packets are that close together,
then existing network equipment will re-order them with non-trivial
probability, so any additional re-ordering introduced by this patch
shouldn't matter.  Applications need to be built to handle this, and if
anything in netfilter depends on packets arriving in the correct order
(other than stuff like TCP SYN segments that can't be delivered out of
order if both endpoints are following the protocol), it should be fixed.

There is a great paper by Bennet, Partridge and Shectman titled "Packet
Reordering is Not Pathological Network Behavior" published in IEEE/ACM
Transactions on Networking in December 1999 that examines this issue and
its effects on TCP, so the observation that parallelism in network
devices causes packet re-ordering is nothing new.  I did an experimental
analysis on the effects of inter-packet time on delivery order last
winter; my results are in Appendix A of my MSc thesis, which is
available at
http://pages.cpsc.ucalgary.ca/~degraaf/papers/thesis-degraaf.pdf

Rennie deGraaf

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28 12:18 ` Amin Azez
  2007-09-28 13:29   ` Henrik Nordstrom
  2007-09-28 16:01   ` Rennie deGraaf
@ 2007-09-29  9:52   ` John Ye
  2007-10-01  7:21   ` john ye
  2007-10-08 12:04   ` john ye
  4 siblings, 0 replies; 12+ messages in thread
From: John Ye @ 2007-09-29  9:52 UTC (permalink / raw)
  To: Amin Azez; +Cc: netfilter-devel, johny, iceburgue

Sam,

Thanks for your reply.

In terms of packet re-ordering, TCP has no problem because I hash the CPU by
cpu = (IP_SRC+IP_DST + tcp_src_port + tcp_dst_port) % nr_cpus.
This can make sure one TCP connection can only be processed on one CPU.

For UDP(snat), now, the code doesn't consider the problem. We need test,
and if we found this is really a big problem, we will change the code.

The main purpose of my message to you is simply to let you review the BS patch code
and point out potential problem for netfilter iptables on smp and give suggestions,
you all are network experts.

GRE is the protocol for MS style VPN if I remembered?. How the GRE re-ordering
is, I had no idea yet.

bridge is not working with BS now. I checked the code net/core/dev.c, bridge is handled
before IP(?). I can make it SMPable, later if need to do so.

Ha-Ha, for networking, a packet being processed too quick is not always a good thing.

BS_POL_RAMDOM ( /proc/sys/net/bs_policy ) is just simply to randomly dispatch skb
to a CPU without considering re-ordering, it's for testing only.
even with random cpu hash, the network speed can be doubled when iptalbes' load
is very high because other CPU joint the work.

John Ye

----- Original Message ----- 
From: "Amin Azez" <azez@ufomechanic.net>
To: "John Ye" <johny@asimco.com.cn>
Cc: <netfilter-devel@vger.kernel.org>; "YE QY" <iceburgue@gmail.com>
Sent: Friday, September 28, 2007 8:18 PM
Subject: Re: remarkably Increase iptables' speed on SMP system.

* John Ye wrote, On 28/09/07 03:15:

> There is a kernel patch to let softirq network code(iptables included in) concurrently run on every CPUs on SMP system.
> We wrote the kernel patch, a loadable module as well, to totally resolve iptables SMP issue.
> Have discussed with kernel netdev experts. it should be working.
>
> The patch(module) will greatly increase the speed of iptalbes by making full use of every CPUs in SMP system.
>
> It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
> You are welcome to review and test without patching and re-compiling the kerenl.

This looks interesting, and I hope worthwhile.

I wonder if it will likely re-order packets with the same flow?

i.e. packets which take more processing may leave the bridge/router
after a packet of the same flow which arrived later.

Cases where this seems more likely are generally where not every packet
of the same flow requires the same level of processing.

Obvious examples are:
* udp snat, where only the first packet follows the nat table
* layer7 when it stops matching, the very next packet may get through
before the one that was the last to be matched.
* packet count or rate based rules that only sometimes call secondary
chains may be delayed more than the next packet if it doesn't match.

TCP (with SACK) may not be so bothered about this, but some GRE or UDP
protocols may care (get degraded service), it may also foil upstream
flow analysis which makes me realise that the layer7 match (and probably
string match) are open to being deceived by deliberate out-of-order
packets or intermediate fake packets with bad tcp sequence numbers. Hmm

Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
       [not found] <001201c80298$3509ac10$0201a8c0@ibmea4709fd199>
@ 2007-09-29 13:23 ` john ye
  0 siblings, 0 replies; 12+ messages in thread
From: john ye @ 2007-09-29 13:23 UTC (permalink / raw)
  To: jengelh, netfilter-devel, iceburgue, John Ye

Dear Everyone,

OK, I will send you the patch soon. I have though that a loadable module should be much better
than kernel patch, because you don't need to compile and rebuild the kernel.

The packet re-ordering can be avoided by hash CPU based on a simple and quick formula

cpu = iph->saddr + iph->daddr + skb->h.th->source + skb->h.th->dest. (for TCP).

so you can see, one TCP connection is always dispatched to one cpu.
This is a CONNTRACK similar issue, but we don't need that complicated as connection tracking,
a simple hash should be enough.

As I said in previous email, we have not considered reordering issue for other protocols
, such as UDP(snat), GRE, etc.

The key is to to hash a cpu(0 to nr_cpus) based on packet. it should be simple and quick hash.

John Ye

----- Original Message ----- 
From: "Jan Engelhardt" <jengelh@computergmbh.de>
To: "John Ye" <johny@asimco.com.cn>
Cc: <netfilter-devel@vger.kernel.org>; "YE QY" <iceburgue@gmail.com>
Sent: Friday, September 28, 2007 9:52 PM
Subject: Re: remarkably Increase iptables' speed on SMP system.

On Sep 28 2007 10:15, John Ye wrote:
>
>It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
>You are welcome to review and test without patching and re-compiling the kerenl.

Well, send a patch. I have no idea what to make out of that single file,
which obviously even has some code that does not look nice.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28 12:18 ` Amin Azez
                     ` (2 preceding siblings ...)
  2007-09-29  9:52   ` John Ye
@ 2007-10-01  7:21   ` john ye
  2007-10-01 12:10     ` john ye
  2007-10-08 12:04   ` john ye
  4 siblings, 1 reply; 12+ messages in thread
From: john ye @ 2007-10-01  7:21 UTC (permalink / raw)
  To: Amin Azez; +Cc: netfilter-devel, YE QY, John Ye

All,

This is the patch for kernel 2.6.13-15-smp. I don't have environment of other kernel versions.
If you want module for multiple kernel version, you can ftp it from
ftp://218.247.5.185
login: netfilter
pass: iptables

the file is main.c

John Ye

Thanks.
--------------------------------------------------------------------------------
--- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.000000000 +0800
+++ new/net/ipv4/ip_input.c 2007-10-02 00:43:37.000000000 +0800
@@ -362,6 +362,187 @@
         return NET_RX_DROP;
 }

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+ *
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP system.
+Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
+will make the bottom part's load be very high. So, if the bottom part softirq
+can be distributed to processors and run concurrently on them, the network will
+gain much more packet handling capacity, network throughput will be be increased
+remarkably.
+
+Where useful:
+It's useful on SMP machines that meet the following 2 conditions:
+1) have high kernel network load, for example, running iptables with thousands of rules, etc).
+2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
+On these system, with the increase of softirq load, some CPUs will be idle
+while others(number is equal to # of NIC) keeps busy.
+IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
+Balancing the load of each cpus will not remarkably increase network speed.
+
+Where NOT useful:
+If the bottom half of softirq is too small(without running iptables), or the network
+is too idle, BS patch will not be seen to have visible effect. But It has no
+negative affect either.
+User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.
+
+How to test:
+On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
+softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
+use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
+without BS enabled, and with BS enabled.
+cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
+cat /proc/sys/net/bs_status. this shows the usage of each CPUs
+Test shown that when bottom softirq load is high, the network throughput can be nearly
+doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
+
+Bugs:
+It will NOT allow hotpug CPU.
+It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
+for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
+
+Some considerations in the future:
+1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
+at least not for network irq.
+2) Softirq load will become very small. It only run the top half of old softirq, which
+is much less expensive than bottom half---the netfilter program.
+To let top softirq process more packets, cant these 3 network parameters be enlarged?
+extern int netdev_max_backlog = 1000;
+extern int netdev_budget = 300;
+extern int weight_p = 64;
+3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
+
+Signed-off-by: John Ye (Seeker) <johny@asimco.com.cn>
+ *
+ */
+
+struct cpu_stat {
+ unsigned long irqs; //total irqs I have
+ unsigned long dids; //I did myself
+ unsigned long others; //help others
+ unsigned long works; //# of enqueues
+};
+#define BS_CPU_STAT_DEFINED
+
+static int nr_cpus = 0;
+
+static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues); // cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct work_struct, bs_works);
+struct cpu_stat bs_cpu_status[NR_CPUS];
+
+int bs_enable = 1;
+
+#define BS_POL_LINK 1
+#define BS_POL_RANDOM 2
+int bs_policy = BS_POL_LINK;
+
+static int ip_rcv1(struct sk_buff *skb, struct net_device *dev)
+{
+ return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish, nf_hook_input_cond(skb));
+}
+
+
+static void bs_func(void *data)
+{
+ int  flags, num, cpu;
+ struct sk_buff *skb, *last;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+ cpu = smp_processor_id();
+
+ bs_works = &per_cpu(bs_works, cpu);
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ local_bh_disable();
+restart:
+ num = 0;
+ while(1) {
+ last = skb;
+ spin_lock_irqsave(&q->lock, flags);
+         skb = __skb_dequeue(q);
+ spin_unlock_irqrestore(&q->lock, flags);
+ if(!skb) break;
+ num++;
+ //local_bh_disable();
+         ip_rcv1(skb, skb->dev);
+ //__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ }
+
+ bs_cpu_status[cpu].others += num;
+ if(num > 0) { goto restart; }
+
+ __local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ bs_works->func = 0;
+
+ return;
+}
+
+/* COPY_IN_START_FROM kernel/workqueue.c */
+struct cpu_workqueue_struct {
+
+ spinlock_t lock;
+
+ long remove_sequence; /* Least-recently added (next to run) */
+ long insert_sequence; /* Next to add */
+
+ struct list_head worklist;
+ wait_queue_head_t more_work;
+ wait_queue_head_t work_done;
+
+ struct workqueue_struct *wq;
+ struct task_struct *thread;
+
+ int run_depth; /* Detect run_workqueue() recursion depth */
+} ____cacheline_aligned;
+
+
+struct workqueue_struct {
+ struct cpu_workqueue_struct cpu_wq[NR_CPUS];
+ const char *name;
+ struct list_head list; /* Empty if single thread */
+};
+/* COPY_IN_END_FROM kernel/worqueue.c */
+
+extern struct workqueue_struct *keventd_wq;
+
+/* Preempt must be disabled. */
+static void __queue_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cwq->lock, flags);
+ work->wq_data = cwq;
+ list_add_tail(&work->entry, &cwq->worklist);
+ cwq->insert_sequence++;
+ wake_up(&cwq->more_work);
+ spin_unlock_irqrestore(&cwq->lock, flags);
+}
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+
+
 /*
  * Main IP Receive routine.
  */
@@ -424,8 +605,67 @@
  }
  }

+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+ if(!nr_cpus)
+ nr_cpus = num_online_cpus();
+
+    if(bs_enable && nr_cpus > 1 && iph->protocol != IPPROTO_ICMP) {
+    //if(bs_enable && iph->protocol == IPPROTO_ICMP) { //test on icmp first
+ unsigned int flags, cur, cpu;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+
+ cpu = cur = smp_processor_id();
+
+ bs_cpu_status[cur].irqs++;
+
+ //good point for Jamal. thanks no reordering
+ if(bs_policy == BS_POL_LINK) {
+ seed = 0;
+ if(iph->protocol == IPPROTO_TCP)
+ seed = skb->h.th->source + skb->h.th->dest;
+ cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
+ } else
+ //random distribute
+ if(bs_policy == BS_POL_RANDOM)
+ cpu = (bs_cpu_status[cur].irqs % nr_cpus);
+
+ if(cpu == cur) {
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ if(!q->next) {
+ skb_queue_head_init(q);
+ }
+
+ bs_works = &per_cpu(bs_works, cpu);
+        spin_lock_irqsave(&q->lock, flags);
+ __skb_queue_tail(q, skb);
+        spin_unlock_irqrestore(&q->lock, flags);
+ //if(net_ratelimit()) printk("qlen %d\n", q->qlen);
+
+           if (!bs_works->func) {
+       INIT_WORK(bs_works, bs_func, q);
+ bs_cpu_status[cpu].works++;
+ preempt_disable();
+ __queue_work(keventd_wq->cpu_wq + cpu, bs_works);
+ preempt_enable();
+ }
+ } else {
+ int cpu = smp_processor_id();
+ bs_cpu_status[cpu].irqs++;
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+ return 0;
+#else
  return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
-                     ip_rcv_finish, nf_hook_input_cond(skb));
+            ip_rcv_finish, nf_hook_input_cond(skb));
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+

 inhdr_error:
  IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
--- old/net/sysctl_net.c 2007-09-20 23:30:29.000000000 +0800
+++ new/net/sysctl_net.c 2007-10-02 00:32:42.000000000 +0800
@@ -30,6 +30,22 @@
 extern struct ctl_table tr_table[];
 #endif

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#if !defined(BS_CPU_STAT_DEFINED)
+struct cpu_stat {
+ unsigned long irqs; //total irqs
+ unsigned long dids; //I did,
+ unsigned long others;
+ unsigned long works;
+};
+#endif
+extern struct cpu_stat bs_cpu_status[NR_CPUS];
+
+extern int bs_enable;
+#endif
+
 struct ctl_table net_table[] = {
  {
  .ctl_name = NET_CORE,
@@ -61,5 +77,33 @@
  .child = tr_table,
  },
 #endif
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+ {
+ .ctl_name = 99,
+ .procname = "bs_status",
+ .data = &bs_cpu_status,
+ .maxlen = sizeof(bs_cpu_status),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = 99,
+ .procname = "bs_policy",
+ .data = &bs_policy,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = 99,
+ .procname = "bs_enable",
+ .data = &bs_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
+
  { 0 },
 };
--- old/kernel/workqueue.c 2007-09-21 04:48:13.000000000 +0800
+++ new/kernel/workqueue.c 2007-10-02 00:39:05.000000000 +0800
@@ -384,7 +384,12 @@
  kfree(wq);
 }

+/*
 static struct workqueue_struct *keventd_wq;
+*/
+/* EXPORTed so I have access */
+struct workqueue_struct *keventd_wq;
+EXPORT_SYMBOL(keventd_wq);

 int fastcall schedule_work(struct work_struct *work)
 {



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-10-01  7:21   ` john ye
@ 2007-10-01 12:10     ` john ye
  0 siblings, 0 replies; 12+ messages in thread
From: john ye @ 2007-10-01 12:10 UTC (permalink / raw)
  To: john ye, Amin Azez; +Cc: netfilter-devel, YE QY, John Ye

All,

I am feeling very sorry for patch's patch.
The following lines:
+               if(bs_policy == BS_POL_LINK) {
+                       seed = 0;
+                       if(iph->protocol == IPPROTO_TCP)
+                               seed = skb->h.th->source + skb->h.th->dest;
+                       cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
+               } else

should be changed into:
                if(bs_policy == BS_POL_LINK) {
                        seed = 0;
                        if(iph->protocol == IPPROTO_TCP || iph->protocol == IPPROTO_UDP) {
                                struct tcphdr *th = skb->nh.iph + 1;    //upd is same as tcp
                                seed = ntohs(th->source) + ntohs(th->dest);
                        }
                        cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
                } else

This is because skb-h.th has not been filled with correct value yet when tcp_rcv is called.
use the raw ip packet to get tcp/udp source port and dest port.

And, I think you had better to get the module for testing.
recompiling kernel is not a good and quick thing to do.

John Ye & Qianye Ye.




----- Original Message ----- 
From: "john ye" <johny@asimco.com.cn>
To: "Amin Azez" <azez@ufomechanic.net>
Cc: <netfilter-devel@vger.kernel.org>; "YE QY" <iceburgue@gmail.com>; "John Ye" <johny@asimco.com.cn>
Sent: Monday, October 01, 2007 3:21 PM
Subject: Re: remarkably Increase iptables' speed on SMP system.


All,

This is the patch for kernel 2.6.13-15-smp. I don't have environment of other kernel versions.
If you want module for multiple kernel version, you can ftp it from
ftp://218.247.5.185
login: netfilter
pass: iptables

the file is main.c

John Ye

Thanks.
--------------------------------------------------------------------------------
--- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.000000000 +0800
+++ new/net/ipv4/ip_input.c 2007-10-02 00:43:37.000000000 +0800
@@ -362,6 +362,187 @@
         return NET_RX_DROP;
 }

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+ *
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP system.
+Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
+will make the bottom part's load be very high. So, if the bottom part softirq
+can be distributed to processors and run concurrently on them, the network will
+gain much more packet handling capacity, network throughput will be be increased
+remarkably.
+
+Where useful:
+It's useful on SMP machines that meet the following 2 conditions:
+1) have high kernel network load, for example, running iptables with thousands of rules, etc).
+2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
+On these system, with the increase of softirq load, some CPUs will be idle
+while others(number is equal to # of NIC) keeps busy.
+IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
+Balancing the load of each cpus will not remarkably increase network speed.
+
+Where NOT useful:
+If the bottom half of softirq is too small(without running iptables), or the network
+is too idle, BS patch will not be seen to have visible effect. But It has no
+negative affect either.
+User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.
+
+How to test:
+On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
+softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
+use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
+without BS enabled, and with BS enabled.
+cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
+cat /proc/sys/net/bs_status. this shows the usage of each CPUs
+Test shown that when bottom softirq load is high, the network throughput can be nearly
+doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
+
+Bugs:
+It will NOT allow hotpug CPU.
+It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
+for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
+
+Some considerations in the future:
+1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
+at least not for network irq.
+2) Softirq load will become very small. It only run the top half of old softirq, which
+is much less expensive than bottom half---the netfilter program.
+To let top softirq process more packets, cant these 3 network parameters be enlarged?
+extern int netdev_max_backlog = 1000;
+extern int netdev_budget = 300;
+extern int weight_p = 64;
+3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
+
+Signed-off-by: John Ye (Seeker) <johny@asimco.com.cn>
+ *
+ */
+
+struct cpu_stat {
+ unsigned long irqs; //total irqs I have
+ unsigned long dids; //I did myself
+ unsigned long others; //help others
+ unsigned long works; //# of enqueues
+};
+#define BS_CPU_STAT_DEFINED
+
+static int nr_cpus = 0;
+
+static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues); // cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct work_struct, bs_works);
+struct cpu_stat bs_cpu_status[NR_CPUS];
+
+int bs_enable = 1;
+
+#define BS_POL_LINK 1
+#define BS_POL_RANDOM 2
+int bs_policy = BS_POL_LINK;
+
+static int ip_rcv1(struct sk_buff *skb, struct net_device *dev)
+{
+ return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish, nf_hook_input_cond(skb));
+}
+
+
+static void bs_func(void *data)
+{
+ int  flags, num, cpu;
+ struct sk_buff *skb, *last;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+ cpu = smp_processor_id();
+
+ bs_works = &per_cpu(bs_works, cpu);
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ local_bh_disable();
+restart:
+ num = 0;
+ while(1) {
+ last = skb;
+ spin_lock_irqsave(&q->lock, flags);
+         skb = __skb_dequeue(q);
+ spin_unlock_irqrestore(&q->lock, flags);
+ if(!skb) break;
+ num++;
+ //local_bh_disable();
+         ip_rcv1(skb, skb->dev);
+ //__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ }
+
+ bs_cpu_status[cpu].others += num;
+ if(num > 0) { goto restart; }
+
+ __local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ bs_works->func = 0;
+
+ return;
+}
+
+/* COPY_IN_START_FROM kernel/workqueue.c */
+struct cpu_workqueue_struct {
+
+ spinlock_t lock;
+
+ long remove_sequence; /* Least-recently added (next to run) */
+ long insert_sequence; /* Next to add */
+
+ struct list_head worklist;
+ wait_queue_head_t more_work;
+ wait_queue_head_t work_done;
+
+ struct workqueue_struct *wq;
+ struct task_struct *thread;
+
+ int run_depth; /* Detect run_workqueue() recursion depth */
+} ____cacheline_aligned;
+
+
+struct workqueue_struct {
+ struct cpu_workqueue_struct cpu_wq[NR_CPUS];
+ const char *name;
+ struct list_head list; /* Empty if single thread */
+};
+/* COPY_IN_END_FROM kernel/worqueue.c */
+
+extern struct workqueue_struct *keventd_wq;
+
+/* Preempt must be disabled. */
+static void __queue_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cwq->lock, flags);
+ work->wq_data = cwq;
+ list_add_tail(&work->entry, &cwq->worklist);
+ cwq->insert_sequence++;
+ wake_up(&cwq->more_work);
+ spin_unlock_irqrestore(&cwq->lock, flags);
+}
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+
+
 /*
  * Main IP Receive routine.
  */
@@ -424,8 +605,67 @@
  }
  }

+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+ if(!nr_cpus)
+ nr_cpus = num_online_cpus();
+
+    if(bs_enable && nr_cpus > 1 && iph->protocol != IPPROTO_ICMP) {
+    //if(bs_enable && iph->protocol == IPPROTO_ICMP) { //test on icmp first
+ unsigned int flags, cur, cpu;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+
+ cpu = cur = smp_processor_id();
+
+ bs_cpu_status[cur].irqs++;
+
+ //good point for Jamal. thanks no reordering
+ if(bs_policy == BS_POL_LINK) {
+ seed = 0;
+ if(iph->protocol == IPPROTO_TCP)
+ seed = skb->h.th->source + skb->h.th->dest;
+ cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
+ } else
+ //random distribute
+ if(bs_policy == BS_POL_RANDOM)
+ cpu = (bs_cpu_status[cur].irqs % nr_cpus);
+
+ if(cpu == cur) {
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ if(!q->next) {
+ skb_queue_head_init(q);
+ }
+
+ bs_works = &per_cpu(bs_works, cpu);
+        spin_lock_irqsave(&q->lock, flags);
+ __skb_queue_tail(q, skb);
+        spin_unlock_irqrestore(&q->lock, flags);
+ //if(net_ratelimit()) printk("qlen %d\n", q->qlen);
+
+           if (!bs_works->func) {
+       INIT_WORK(bs_works, bs_func, q);
+ bs_cpu_status[cpu].works++;
+ preempt_disable();
+ __queue_work(keventd_wq->cpu_wq + cpu, bs_works);
+ preempt_enable();
+ }
+ } else {
+ int cpu = smp_processor_id();
+ bs_cpu_status[cpu].irqs++;
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+ return 0;
+#else
  return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
-                     ip_rcv_finish, nf_hook_input_cond(skb));
+            ip_rcv_finish, nf_hook_input_cond(skb));
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+

 inhdr_error:
  IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
--- old/net/sysctl_net.c 2007-09-20 23:30:29.000000000 +0800
+++ new/net/sysctl_net.c 2007-10-02 00:32:42.000000000 +0800
@@ -30,6 +30,22 @@
 extern struct ctl_table tr_table[];
 #endif

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#if !defined(BS_CPU_STAT_DEFINED)
+struct cpu_stat {
+ unsigned long irqs; //total irqs
+ unsigned long dids; //I did,
+ unsigned long others;
+ unsigned long works;
+};
+#endif
+extern struct cpu_stat bs_cpu_status[NR_CPUS];
+
+extern int bs_enable;
+#endif
+
 struct ctl_table net_table[] = {
  {
  .ctl_name = NET_CORE,
@@ -61,5 +77,33 @@
  .child = tr_table,
  },
 #endif
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+ {
+ .ctl_name = 99,
+ .procname = "bs_status",
+ .data = &bs_cpu_status,
+ .maxlen = sizeof(bs_cpu_status),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = 99,
+ .procname = "bs_policy",
+ .data = &bs_policy,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = 99,
+ .procname = "bs_enable",
+ .data = &bs_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
+
  { 0 },
 };
--- old/kernel/workqueue.c 2007-09-21 04:48:13.000000000 +0800
+++ new/kernel/workqueue.c 2007-10-02 00:39:05.000000000 +0800
@@ -384,7 +384,12 @@
  kfree(wq);
 }

+/*
 static struct workqueue_struct *keventd_wq;
+*/
+/* EXPORTed so I have access */
+struct workqueue_struct *keventd_wq;
+EXPORT_SYMBOL(keventd_wq);

 int fastcall schedule_work(struct work_struct *work)
 {





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-09-28 12:18 ` Amin Azez
                     ` (3 preceding siblings ...)
  2007-10-01  7:21   ` john ye
@ 2007-10-08 12:04   ` john ye
  2007-10-08 16:40     ` Patrick McHardy
  4 siblings, 1 reply; 12+ messages in thread
From: john ye @ 2007-10-08 12:04 UTC (permalink / raw)
  To: Amin Azez; +Cc: netfilter-devel

All,

Now, the BS version 2 patch module is available. It support all protocols (version 1 is only working for ipv4).
So now all netfilter code (for example, bridge netfilter, ipv6 netfilter) can be parallelized on SMP.

I make the module be compiled and running on following kernel version:

#define KERNEL_VERSION_2_6_13__         //2.6.13-15 OK
#define KERNEL_VERSION_2_6_16         //2.6.16.53 OK    ------------ this one is selected.
#define KERNEL_VERSION_2_6_17__         //2.6.17.9 #3 OK
#define KERNEL_VERSION_2_6_18__         //2.6.18.8 & 2.6.18.2-34  OK
#define KERNEL_VERSION_2_6_19__         //2.6.19 #1 OK
#define KERNEL_VERSION_2_6_20__         //2.6.20 OK
#define KERNEL_VERSION_2_6_21__         //2.6.21.1 OK
#define KERNEL_VERSION_2_6_22__         //2.6.22.5 OK
#define KERNEL_VERSION_2_6_23__         //2.6.23-rc8 OK

This made the test work much easier without rebuilding kernel.

The file can be downloaded at
ftp://218.247.5.185
login: netfilter
password: iptables

files:
main.c is version 1 BS
bs_smp2.c  is version 2 BS
bs2.tar.gz is tar file of bs_smp2.c

You are welcome to review the code and test the performance.

John Ye

----- Original Message ----- 
From: "Amin Azez" <azez@ufomechanic.net>
To: "John Ye" <johny@asimco.com.cn>
Cc: <netfilter-devel@vger.kernel.org>; "YE QY" <iceburgue@gmail.com>
Sent: Friday, September 28, 2007 8:18 PM
Subject: Re: remarkably Increase iptables' speed on SMP system.

* John Ye wrote, On 28/09/07 03:15:

> There is a kernel patch to let softirq network code(iptables included in) concurrently run on every CPUs on SMP system.
> We wrote the kernel patch, a loadable module as well, to totally resolve iptables SMP issue.
> Have discussed with kernel netdev experts. it should be working.
>
> The patch(module) will greatly increase the speed of iptalbes by making full use of every CPUs in SMP system.
>
> It can be viewed and downloaded from blog http://blog.chinaunix.net/u/12848/showart.php?id=389602
> You are welcome to review and test without patching and re-compiling the kerenl.

This looks interesting, and I hope worthwhile.

I wonder if it will likely re-order packets with the same flow?

i.e. packets which take more processing may leave the bridge/router
after a packet of the same flow which arrived later.

Cases where this seems more likely are generally where not every packet
of the same flow requires the same level of processing.

Obvious examples are:
* udp snat, where only the first packet follows the nat table
* layer7 when it stops matching, the very next packet may get through
before the one that was the last to be matched.
* packet count or rate based rules that only sometimes call secondary
chains may be delayed more than the next packet if it doesn't match.

TCP (with SACK) may not be so bothered about this, but some GRE or UDP
protocols may care (get degraded service), it may also foil upstream
flow analysis which makes me realise that the layer7 match (and probably
string match) are open to being deceived by deliberate out-of-order
packets or intermediate fake packets with bad tcp sequence numbers. Hmm

Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-10-08 12:04   ` john ye
@ 2007-10-08 16:40     ` Patrick McHardy
  2007-10-10  1:48       ` John Ye
  0 siblings, 1 reply; 12+ messages in thread
From: Patrick McHardy @ 2007-10-08 16:40 UTC (permalink / raw)
  To: john ye; +Cc: Amin Azez, netfilter-devel

john ye wrote:
> The file can be downloaded at
> ftp://218.247.5.185
> login: netfilter
> password: iptables
> 
> files:
> main.c is version 1 BS
> bs_smp2.c  is version 2 BS
> bs2.tar.gz is tar file of bs_smp2.c
> 
> You are welcome to review the code and test the performance.


If you don't post patches to the list chances are good that nobody
is even going to look at them.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: remarkably Increase iptables' speed on SMP system.
  2007-10-08 16:40     ` Patrick McHardy
@ 2007-10-10  1:48       ` John Ye
  0 siblings, 0 replies; 12+ messages in thread
From: John Ye @ 2007-10-10  1:48 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Amin Azez, netfilter-devel,  john ye

All,

This is the BS version 2 patch for kernel 2.6.23-rc8.
If you need patch for other versions, I can make it and mail to you.

BS version 2 move the parallelization point from ip_rcv to netif_receive_skb.
So, it can support all protocols' netfilter, not only ipv4.

Our primarily test shown good result when iptables load is high.
I need your review and test. Only your test are trustable.


John Ye

-------------------------------------------------------------------------------
--- linux-2.6.23-rc8/net/core/dev.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/net/core/dev.c 2007-10-10 09:30:30.000000000 +0800
@@ -1919,12 +1919,269 @@
 }
 #endif

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+[PATCH: 2.6.13-15-SMP 1/2] network: concurrently run softirq network code on SMP
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+This is the version 2 BS patch. it will make parallelization for all protocol's
+netfilter runing in softirq, IPV4, IPV6, bridge, etc.
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP system.
+Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
+will make the bottom part's load be very high. So, if the bottom part softirq
+can be distributed to processors and run concurrently on them, the network will
+gain much more packet handling capacity, network throughput will be be increased
+remarkably.
+
+Where useful:
+It's useful on SMP machines that meet the following 2 conditions:
+1) have high kernel network load, for example, running iptables with thousands of rules, etc).
+2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
+On these system, with the increase of softirq load, some CPUs will be idle
+while others(number is equal to # of NIC) keeps busy.
+IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
+Balancing the load of each cpus will not remarkably increase network speed.
+
+Where NOT useful:
+If the bottom half of softirq is too small(without running iptables), or the network
+is too idle, BS patch will not be seen to have visible effect. But It has no
+negative affect either.
+User can turn off BS functionality by set /proc/sys/net/bs_policy value to 0.
+
+How to test:
+On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
+softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
+use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
+without BS enabled, and with BS enabled.
+cat /proc/sys/net/bs_policy. 1 for flow dispatch, 2 random dispatch. 0 no dispatch.
+cat /proc/sys/net/bs_status. this shows the usage of each CPUs
+Test shown that when bottom softirq load is high, the network throughput can be nearly
+doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
+
+Bugs:
+It will NOT allow hotplug CPU.
+It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
+for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
+
+Some considerations in the future:
+1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
+at least not for network irq.
+2) Softirq load will become very small. It only run the top half of old softirq, which
+is much less expensive than bottom half---the netfilter program.
+To let top softirq process more packets, can these 3 network parameters be given a larger value?
+extern int netdev_max_backlog = 1000;
+extern int netdev_budget = 300;
+extern int weight_p = 64;
+3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
+
+Signed-off-by: John Ye (Seeker) <[email]johny@webizmail.com[/email]>
+*/
+
+
+#define CBPTR( skb ) (*((void **)(skb->cb)))
+#define BS_USE_PERCPU_DATA
+struct cpu_stat
+{
+        unsigned long irqs;                       //total irqs
+        unsigned long dids;                       //I did,
+        unsigned long works;
+};
+#define BS_CPU_STAT_DEFINED
+
+static int nr_cpus = 0;
+
+#define BS_POL_LINK     1
+#define BS_POL_RANDOM   2
+int bs_policy = BS_POL_LINK; //cpu hash. 0 will turn off BS. 1 link based, 2 random
+
+static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues);
+static DEFINE_PER_CPU(struct work_struct, bs_works);
+//static DEFINE_PER_CPU(struct cpu_stat, bs_cpu_status);
+struct cpu_stat bs_cpu_status[NR_CPUS];
+
+//static int __netif_recv_skb(struct sk_buff *skb, struct net_device *odev);
+static int __netif_recv_skb(struct sk_buff *skb);
+
+static void bs_func(struct work_struct *data)
+{
+        int flags, num, cpu;
+        struct sk_buff *skb;
+        struct work_struct *bs_works;
+        struct sk_buff_head *q;
+        cpu = smp_processor_id();
+
+        bs_works = &per_cpu(bs_works, cpu);
+        q = &per_cpu(bs_cpu_queues, cpu);
+
+        //local_bh_disable();
+        restart:
+
+        num = 0;
+        while(1)
+        {
+                spin_lock_irqsave(&q->lock, flags);
+                if(!(skb = __skb_dequeue(q))) {
+                spin_unlock_irqrestore(&q->lock, flags);
+ break;
+ }
+                spin_unlock_irqrestore(&q->lock, flags);
+                num++;
+
+                local_bh_disable();
+                __netif_recv_skb(skb);
+                local_bh_enable();      // sub_preempt_count(SOFTIRQ_OFFSET - 1);
+        }
+
+        bs_cpu_status[cpu].dids += num;
+        //if(num > 2) printk("%d %d\n", num, cpu);
+        if(num > 0)
+                goto restart;
+
+        //__local_bh_enable();
+        bs_works->func = 0;
+
+        return;
+}
+
+struct cpu_workqueue_struct {
+
+ spinlock_t lock;
+
+ struct list_head worklist;
+ wait_queue_head_t more_work;
+ struct work_struct *current_work;
+
+ struct workqueue_struct *wq;
+ struct task_struct *thread;
+
+ int run_depth; /* Detect run_workqueue() recursion depth */
+} ____cacheline_aligned;
+
+struct workqueue_struct {
+ struct cpu_workqueue_struct *cpu_wq;
+ struct list_head list;
+ const char *name;
+ int singlethread;
+ int freezeable; /* Freeze threads during suspend */
+};
+
+#ifndef CONFIG_BOTTOM_SOFTIRQ_MODULE
+extern void __queue_work(struct cpu_workqueue_struct *cwq, struct work_struct *work);
+extern struct workqueue_struct *keventd_wq;
+#endif
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+
+static inline int bs_dispatch(struct sk_buff *skb)
+{
+ struct iphdr *iph = ip_hdr(skb);
+
+ if(!nr_cpus)
+ nr_cpus = num_online_cpus();
+
+ if(bs_policy && nr_cpus > 1) { // && iph->protocol != IPPROTO_ICMP) {
+ //if(bs_policy && nr_cpus > 1 && iph->protocol == IPPROTO_ICMP) { //test on icmp first
+ unsigned int flags, cur, cpu;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+
+ cpu = cur = smp_processor_id();
+
+ bs_cpu_status[cur].irqs++;
+
+ //good point for Jamal. thanks no reordering
+ if(bs_policy == BS_POL_LINK) {
+ int seed = 0;
+ if(iph->protocol == IPPROTO_TCP || iph->protocol == IPPROTO_UDP) {
+ struct tcphdr *th = (struct tcphdr*)(iph + 1);  //udp is same as tcp
+ seed = ntohs(th->source) + ntohs(th->dest);
+ }
+ cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
+
+ /*
+ if(net_ratelimit() && iph->protocol == IPPROTO_TCP) {
+ struct tcphdr *th = iph + 1;
+
+ printk("seed %u (%u %u) cpu %d. source %d dest %d\n",
+                                seed, iph->saddr + iph->daddr, iph->saddr + iph->daddr + seed, cpu,
+ ntohs(th->source), ntohs(th->dest));
+ }
+ */
+ } else
+ //random distribute
+ if(bs_policy == BS_POL_RANDOM)
+ cpu = (bs_cpu_status[cur].irqs % nr_cpus);
+
+ //cpu = cur;
+ //cpu = (cur? 0: 1);
+
+ if(cpu == cur) {
+ bs_cpu_status[cpu].dids++;
+ return __netif_recv_skb(skb);
+ }
+
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ if(!q->next) { // || skb_queue_len(q) == 0 ) {
+ skb_queue_head_init(q);
+ }
+
+
+ bs_works = &per_cpu(bs_works, cpu);
+ spin_lock_irqsave(&q->lock, flags);
+ __skb_queue_tail(q, skb);
+ spin_unlock_irqrestore(&q->lock, flags);
+
+    if (!bs_works->func) {
+                        INIT_WORK(bs_works, bs_func);
+                        bs_cpu_status[cpu].works++;
+                        preempt_disable();
+ set_bit(WORK_STRUCT_PENDING, work_data_bits(bs_works));
+ __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), bs_works);
+                        preempt_enable();
+ }
+
+ } else {
+
+ bs_cpu_status[smp_processor_id()].dids++;
+ return __netif_recv_skb(skb);
+ }
+ return 0;
+}
+
+
+
+#endif
+
+
 int netif_receive_skb(struct sk_buff *skb)
 {
- struct packet_type *ptype, *pt_prev;
+ //struct packet_type *ptype, *pt_prev;
  struct net_device *orig_dev;
- int ret = NET_RX_DROP;
- __be16 type;
+ //int ret = NET_RX_DROP;
+ //__be16 type;

  /* if we've gotten here through NAPI, check netpoll */
  if (skb->dev->poll && netpoll_rx(skb))
@@ -1947,6 +2204,19 @@
  skb_reset_transport_header(skb);
  skb->mac_len = skb->network_header - skb->mac_header;

+ CBPTR(skb) = orig_dev;
+ return bs_dispatch(skb);
+}
+
+int __netif_recv_skb(struct sk_buff *skb)
+{
+ struct packet_type *ptype, *pt_prev;
+ struct net_device *orig_dev;
+ int ret = NET_RX_DROP;
+ __be16 type;
+
+ orig_dev = CBPTR(skb);
+ CBPTR(skb) = 0;
  pt_prev = NULL;

  rcu_read_lock();
--- linux-2.6.23-rc8/kernel/workqueue.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/kernel/workqueue.c 2007-10-10 08:52:05.000000000 +0800
@@ -138,7 +138,9 @@
 }

 /* Preempt must be disabled. */
-static void __queue_work(struct cpu_workqueue_struct *cwq,
+//static void __queue_work(struct cpu_workqueue_struct *cwq,
+// struct work_struct *work)
+void __queue_work(struct cpu_workqueue_struct *cwq,
  struct work_struct *work)
 {
  unsigned long flags;
@@ -515,7 +517,12 @@
 }
 EXPORT_SYMBOL(cancel_delayed_work_sync);

+
+/*
 static struct workqueue_struct *keventd_wq __read_mostly;
+*/
+struct workqueue_struct *keventd_wq __read_mostly;
+

 /**
  * schedule_work - put work task in global workqueue
@@ -848,5 +855,6 @@
  cpu_singlethread_map = cpumask_of_cpu(singlethread_cpu);
  hotcpu_notifier(workqueue_cpu_callback, 0);
  keventd_wq = create_workqueue("events");
+ printk("keventd_wq %p %p OK.\n", keventd_wq, keventd_wq->cpu_wq);
  BUG_ON(!keventd_wq);
 }
--- linux-2.6.23-rc8/net/sysctl_net.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/net/sysctl_net.c 2007-10-09 21:10:41.000000000 +0800
@@ -29,6 +29,15 @@
 #include <linux/if_tr.h>
 #endif

+struct cpu_stat
+{
+        unsigned long irqs;                       /* total irqs on me */
+        unsigned long dids;                       /* I did, */
+        unsigned long works;                      /* q works */
+};
+extern int bs_policy;
+extern struct cpu_stat bs_cpu_status[NR_CPUS];
+
 struct ctl_table net_table[] = {
  {
  .ctl_name = NET_CORE,
@@ -36,6 +45,24 @@
  .mode = 0555,
  .child = core_table,
  },
+
+        {
+                .ctl_name       = 99,
+                .procname       = "bs_status",
+                .data           = &bs_cpu_status,
+                .maxlen         = sizeof(bs_cpu_status),
+                .mode           = 0644,
+                .proc_handler   = &proc_dointvec,
+        },
+        {
+                .ctl_name       = 99,
+                .procname       = "bs_policy",
+                .data           = &bs_policy,
+                .maxlen         = sizeof(int),
+                .mode           = 0644,
+                .proc_handler   = &proc_dointvec,
+        },
+
 #ifdef CONFIG_INET
  {
  .ctl_name = NET_IPV4,



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-10  1:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-28  2:15 remarkably Increase iptables' speed on SMP system John Ye
2007-09-28 12:18 ` Amin Azez
2007-09-28 13:29   ` Henrik Nordstrom
2007-09-28 16:01   ` Rennie deGraaf
2007-09-29  9:52   ` John Ye
2007-10-01  7:21   ` john ye
2007-10-01 12:10     ` john ye
2007-10-08 12:04   ` john ye
2007-10-08 16:40     ` Patrick McHardy
2007-10-10  1:48       ` John Ye
2007-09-28 13:52 ` Jan Engelhardt
     [not found] <001201c80298$3509ac10$0201a8c0@ibmea4709fd199>
2007-09-29 13:23 ` john ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).