From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhang, Yanmin" Subject: [RFC v2: Patch 1/3] net: hand off skb list to other cpu to submit to upper layer Date: Wed, 11 Mar 2009 16:53:44 +0800 Message-ID: <1236761624.2567.442.camel@ymzhang> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: herbert@gondor.apana.org.au, jesse.brandeburg@intel.com, shemminger@vyatta.com, David Miller To: netdev@vger.kernel.org, LKML Return-path: Received: from mga05.intel.com ([192.55.52.89]:60823 "EHLO fmsmga101.fm.intel.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752174AbZCKIyP (ORCPT ); Wed, 11 Mar 2009 04:54:15 -0400 Sender: netdev-owner@vger.kernel.org List-ID: I got some comments. Special thanks to =EF=BB=BFStephen Hemminger for t= eaching me on what reorder is and some other comments. Also thank other guys who rais= ed comments. v2 has some improvements. 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing= _cpu. Admin could use it to configure the binding between RX and cpu number. So it'= s convenient for drivers to use the new capability. 2) Delete function netif_rx_queue; 3) Optimize ipi notification. There is no new notification when destina= tion's input_pkt_alien_queue isn't empty. 4) Did lots of testing, mostly focusing on slab allocator (slab/slub/sl= qb) and use SLUB with big slub_max_order currently. --- =EF=BB=BF=EF=BB=BFSubject: net: hand off skb list to other cpu to submi= t to upper layer =46rom: =EF=BB=BFZhang Yanmin =EF=BB=BFRecently, I am investigating an ip_forward performance issue w= ith 10G IXGBE NIC. I start the testing on 2 machines. Every machine has 2 10G NICs. The 1s= t one seconds packets by pktgen. The 2nd receives the packets from one NIC and forwar= ds them out from the 2nd NIC.=20 Initial testing showed cpu cache sharing has impact on speed. As NICs s= upports multi-queue, I bind the queues to different logical cpu of different ph= ysical cpu while considering cache sharing carefully. I could get about 30~40% imp= rovement; Comparing with sending speed on the 1st machine, the forward speed is s= till not good, only about 60% of sending speed. As a matter of fact, IXGBE driver star= ts NAPI when interrupt arrives. When ip_forward=3D1, receiver collects a packet and = forwards it out immediately. So although IXGBE collects packets with NAPI, the forwardi= ng really has much impact on collection. As IXGBE runs very fast, it drops packets qu= ickly. The better way for receiving cpu is doing nothing than just collecting packets. Currently kernel has backlog to support a similar capability, but proce= ss_backlog still runs on the receiving cpu. I enhance backlog by adding a new input_pkt_= alien_queue to softnet_data. Receving cpu collects packets and link them into skb list= , then delivers the list to the =EF=BB=BFinput_pkt_alien_queue of other cpu. process_ba= cklog picks up the skb list from =EF=BB=BFinput_pkt_alien_queue when =EF=BB=BFinput_pkt_queue is em= pty. =EF=BB=BF I tested my patch on top of 2.6.28.5. The improvement is about 43%. Some questions: 1) Reorder: My method wouldn't introduce reorder issue, because we use = N:1 mapping between RX queue and cpu number. 2) If there is no free cpu to work on packet collection: It depends on = cpu resource allocation. We could allocate more RX queue to the same cpu. With my ne= w testing, the forwarding speed could be at about 4.8M pps (packets per second and pac= ket size is 60Byte) on Nehalem machine, and 8 packet processing cpus almost have no idle ti= me while receiving cpu idle is about 50%. I just have 4 old NIC and couldn't test more on this= question. 3) packet delaying: I didn't calculate it or measure it and might measu= re it later. The forwarding speed is close to 270M bytes/s. At least sar shows mostly re= ceiving matches forwarding. But at sending side, the sending speed is bigger than forwa= rding speed, although my method decreases the difference largely. 4) 10G NICs other than IXGBE: I have no other 10G NICs now. 5) Other kinds of machines working as forwarder: I test it between 1 2*= 4 stoakley and 2*4*2 Nehalem. I reversed the testing and found the improvement on stoa= kley is less than 30%, not so big as on Nehalem. =EF=BB=BF6) Memory utilization: =EF=BB=BFMy nehalem machine has 12GB me= mory. To reach the maximum speed, I tried netdev_max_backlog=3D400000. That consumes 10GB memory sometime= s. 7) Any impact if driver enables the new capability but admin doesn't co= nfigure it: I didn't measure the speed difference now. 8) If receiving cpu collects packets very fast and processing cpu is sl= ow: We can start many RX queues on the receiving cpu and bind them to different processing cp= u. Current patch is against 2.6.29-rc7. Signed-off-by: =EF=BB=BFZhang Yanmin --- --- linux-2.6.29-rc7/include/linux/netdevice.h 2009-03-09 15:20:49.0000= 00000 +0800 +++ linux-2.6.29-rc7_backlog/include/linux/netdevice.h 2009-03-11 10:17= :08.000000000 +0800 @@ -1119,6 +1119,9 @@ static inline int unregister_gifconf(uns /* * Incoming packets are placed on per-cpu queues so that * no locking is needed. + * To speed up fast network, sometimes place incoming packets + * to other cpu queues. Use input_pkt_alien_queue.lock to + * protect input_pkt_alien_queue. */ struct softnet_data { @@ -1127,6 +1130,7 @@ struct softnet_data struct list_head poll_list; struct sk_buff *completion_queue; =20 + struct sk_buff_head input_pkt_alien_queue; struct napi_struct backlog; }; =20 @@ -1368,6 +1372,8 @@ extern void dev_kfree_skb_irq(struct sk_ extern void dev_kfree_skb_any(struct sk_buff *skb); =20 #define HAVE_NETIF_RX 1 +extern int raise_netif_irq(int cpu, + struct sk_buff_head *skb_queue); extern int netif_rx(struct sk_buff *skb); extern int netif_rx_ni(struct sk_buff *skb); #define HAVE_NETIF_RECEIVE_SKB 1 --- linux-2.6.29-rc7/net/core/dev.c 2009-03-09 15:20:50.000000000 +0800 +++ linux-2.6.29-rc7_backlog/net/core/dev.c 2009-03-11 10:27:57.0000000= 00 +0800 @@ -1997,6 +1997,114 @@ int netif_rx_ni(struct sk_buff *skb) =20 EXPORT_SYMBOL(netif_rx_ni); =20 +static void net_drop_skb(struct sk_buff_head *skb_queue) +{ + struct sk_buff *skb =3D __skb_dequeue(skb_queue); + + while (skb) { + __get_cpu_var(netdev_rx_stat).dropped++; + kfree_skb(skb); + skb =3D __skb_dequeue(skb_queue); + } +} + +static int net_backlog_local_merge(struct sk_buff_head *skb_queue) +{ + struct softnet_data *queue; + unsigned long flags; + + queue =3D &__get_cpu_var(softnet_data); + if (queue->input_pkt_queue.qlen + skb_queue->qlen <=3D + netdev_max_backlog) { + + local_irq_save(flags); + if (!queue->input_pkt_queue.qlen) + napi_schedule(&queue->backlog); + skb_queue_splice_tail_init(skb_queue, &queue->input_pkt_queue); + local_irq_restore(flags); + + return 0; + } else { + net_drop_skb(skb_queue); + return 1; + } +} + +static void net_napi_backlog(void *data) +{ + struct softnet_data *queue =3D &__get_cpu_var(softnet_data); + + napi_schedule(&queue->backlog); + kfree(data); +} + +static int net_backlog_notify_cpu(int cpu) +{ + struct call_single_data *data; + + data =3D kmalloc(sizeof(struct call_single_data), GFP_ATOMIC); + if (!data) + return -1; + + data->func =3D net_napi_backlog; + data->info =3D data; + data->flags =3D 0; + __smp_call_function_single(cpu, data); + + return 0; +} + +int raise_netif_irq(int cpu, struct sk_buff_head *skb_queue) +{ + unsigned long flags; + struct softnet_data *queue; + int retval, need_notify=3D0; + + if (!skb_queue || skb_queue_empty(skb_queue)) + return 0; + + /* + * If cpu is offline, we queue skb back to + * the queue on current cpu. + */ + if ((unsigned)cpu >=3D nr_cpu_ids || + !cpu_online(cpu) || + cpu =3D=3D smp_processor_id()) { + net_backlog_local_merge(skb_queue); + return 0; + } + + queue =3D &per_cpu(softnet_data, cpu); + if (queue->input_pkt_alien_queue.qlen > netdev_max_backlog) + goto failed1; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + if (skb_queue_empty(&queue->input_pkt_alien_queue)) + need_notify =3D 1; + skb_queue_splice_tail_init(skb_queue, + &queue->input_pkt_alien_queue); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, + flags); + + if (need_notify) { + retval =3D net_backlog_notify_cpu(cpu); + if (unlikely(retval)) + goto failed2; + } + + return 0; + +failed2: + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init(&queue->input_pkt_alien_queue, skb_queue); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, + flags); +failed1: + net_drop_skb(skb_queue); + + return 1; +} + static void net_tx_action(struct softirq_action *h) { struct softnet_data *sd =3D &__get_cpu_var(softnet_data); @@ -2336,6 +2444,13 @@ static void flush_backlog(void *arg) struct net_device *dev =3D arg; struct softnet_data *queue =3D &__get_cpu_var(softnet_data); struct sk_buff *skb, *tmp; + unsigned long flags; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, flags); =20 skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp) if (skb->dev =3D=3D dev) { @@ -2594,9 +2709,19 @@ static int process_backlog(struct napi_s local_irq_disable(); skb =3D __skb_dequeue(&queue->input_pkt_queue); if (!skb) { - __napi_complete(napi); - local_irq_enable(); - break; + if (!skb_queue_empty(&queue->input_pkt_alien_queue)) { + spin_lock(&queue->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock(&queue->input_pkt_alien_queue.lock); + + skb =3D __skb_dequeue(&queue->input_pkt_queue); + } else { + __napi_complete(napi); + local_irq_enable(); + break; + } } local_irq_enable(); =20 @@ -4985,6 +5110,11 @@ static int dev_cpu_callback(struct notif local_irq_enable(); =20 /* Process offline CPU's input_pkt_queue */ + spin_lock(&oldsd->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init(&oldsd->input_pkt_alien_queue, + &oldsd->input_pkt_queue); + spin_unlock(&oldsd->input_pkt_alien_queue.lock); + while ((skb =3D __skb_dequeue(&oldsd->input_pkt_queue))) netif_rx(skb); =20 @@ -5184,10 +5314,13 @@ static int __init net_dev_init(void) struct softnet_data *queue; =20 queue =3D &per_cpu(softnet_data, i); + skb_queue_head_init(&queue->input_pkt_queue); queue->completion_queue =3D NULL; INIT_LIST_HEAD(&queue->poll_list); =20 + skb_queue_head_init(&queue->input_pkt_alien_queue); + queue->backlog.poll =3D process_backlog; queue->backlog.weight =3D weight_p; queue->backlog.gro_list =3D NULL; @@ -5247,6 +5380,7 @@ EXPORT_SYMBOL(netdev_set_master); EXPORT_SYMBOL(netdev_state_change); EXPORT_SYMBOL(netif_receive_skb); EXPORT_SYMBOL(netif_rx); +EXPORT_SYMBOL(raise_netif_irq); EXPORT_SYMBOL(register_gifconf); EXPORT_SYMBOL(register_netdevice); EXPORT_SYMBOL(register_netdevice_notifier);