From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer Date: Tue, 24 Feb 2009 21:18:59 -0800 Message-ID: <20090224211859.0ee8c632@nehalam> References: <1235525270.2604.483.camel@ymzhang> <20090224181153.06aa1fbd@nehalam> <1235529343.2604.499.camel@ymzhang> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "netdev@vger.kernel.org" , LKML , jesse.brandeburg@intel.com To: "Zhang, Yanmin" Return-path: Received: from mail.vyatta.com ([76.74.103.46]:41186 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751474AbZBYFTE convert rfc822-to-8bit (ORCPT ); Wed, 25 Feb 2009 00:19:04 -0500 In-Reply-To: <1235529343.2604.499.camel@ymzhang> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 25 Feb 2009 10:35:43 +0800 "Zhang, Yanmin" wrote: > On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote: > > On Wed, 25 Feb 2009 09:27:49 +0800 > > "Zhang, Yanmin" wrote: > >=20 > > > =EF=BB=BFSubject: hand off skb list to other cpu to submit to upp= er layer > > > From: =EF=BB=BFZhang Yanmin > > >=20 > > > Recently, I am investigating an ip_forward performance issue with= 10G IXGBE NIC. > > > I start the testing on 2 machines. Every machine has 2 10G NICs. = The 1st one seconds > > > packets by pktgen. The 2nd receives the packets from one NIC and = forwards them out > > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues= to different logical > > > cpu of different physical cpu while considering cache sharing car= efully. > > >=20 > > > Comparing with sending speed on the 1st machine, the forward spee= d is not good, only > > > about 60% of sending speed. As a matter of fact, IXGBE driver sta= rts NAPI when interrupt > > > arrives. When ip_forward=3D1, receiver collects a packet and forw= ards it out immediately. > > > So although IXGBE collects packets with NAPI, the forwarding real= ly has much impact on > > > collection. As IXGBE runs very fast, it drops packets quickly. Th= e better way for > > > receiving cpu is doing nothing than just collecting packets. > > >=20 > > > Currently kernel has backlog to support a similar capability, but= process_backlog still > > > runs on the receiving cpu. I enhance backlog by adding a new inpu= t_pkt_alien_queue to > > > softnet_data. Receving cpu collects packets and link them into sk= b list, then delivers > > > the list to the =EF=BB=BFinput_pkt_alien_queue of other cpu. proc= ess_backlog picks up the skb list > > > from =EF=BB=BFinput_pkt_alien_queue when =EF=BB=BFinput_pkt_queue= is empty. > > >=20 > > > NIC driver could use this capability like below step in NAPI RX c= leanup function. > > > 1) Initiate a local var struct sk_buff_head skb_head; > > > 2) In the packet collection loop, just calls netif_rx_queue or __= skb_queue_tail(skb_head, skb) > > > to add skb to the list; > > > 3) Before exiting, calls raise_netif_irq to submit the skb list t= o specific cpu. > > >=20 > > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget b= efore testing. > > >=20 > > > I tested my patch on top of 2.6.28.5. The improvement is about 43= %. > > >=20 > > > Signed-off-by: =EF=BB=BFZhang Yanmin > > >=20 > > > --- > Thanks for your comments. >=20 > >=20 > > You can't safely put packets on another CPU queue without adding a = spinlock. > =EF=BB=BFinput_pkt_alien_queue is a struct sk_buff_head which has a s= pinlock. We use > that lock to protect the queue. I was reading netif_rx_queue() and you have it using __skb_queue_tail()= which has no locking.=20 > > And if you add the spinlock, you drop the performance back down for= your > > device and all the other devices. > My testing shows 43% improvement. As multi-core machines are becoming > popular, we can allocate some core for packet collection only. >=20 > I use the spinlock carefully. The deliver cpu locks it only when =EF=BB= =BFinput_pkt_queue > is empty, and just merges the list to =EF=BB=BFinput_pkt_queue. Later= skb dequeue needn't > hold the spinlock. In the other hand, the original receving cpu dispa= tches a batch > of skb (64 packets with IXGBE default) when holding the lock once. >=20 > > Also, you will end up reordering > > packets which hurts single stream TCP performance. > Would you like to elaborate the scenario? Does your speaking mean mul= ti-queue > also hurts =EF=BB=BFsingle stream TCP performance when we bind multi-= queue(interrupt) to > different cpu? >=20 > >=20 > > Is this all because the hardware doesn't do MSI-X > IXGBE supports MSI-X and I enables it when testing. =EF=BB=BF The rec= eiver has 16 multi-queue, > so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physic= al cpu. >=20 > > or are you testing only > > a single flow.=20 > What does a single flow mean here? One sender? I do start one sender = for testing because > I couldn't get enough hardware. Multiple receive queues only have an performance gain if the packets ar= e being sent with different SRC/DST address pairs. That is how the hardware is = supposed to break them into queues. Reordering is what happens when packts that are sent as [ 0, 1, 2, 3, 4= ] get received as [ 0, 1, 4, 3, 2 ] because your receive processing happe= ned on different CPU's. You really need to test this with some program like 'iperf' to s= ee the effect it has on TCP. Older Juniper routers used to have hardware that did thi= s and it caused it caused performance loss. Do some google searches and you will= see it is a active research topic about whether reordering is okay or not. = Existing multiqueue is safe because it doesn't reorder inside a single flow; it = only changes order between flows: [ A1, A2, B1, B2] =3D> [ A1, B1, A2, B2 ] >=20 > In addition, my patch doesn't change old interface, so there would be= no performance > hurt to old drivers. >=20 > yanmin >=20 >=20 Isn't this a problem: > +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_que= ue) > { > struct softnet_data *queue; > unsigned long flags; > + int this_cpu; > =20 > /* if netpoll wants it, pretend we never saw it */ > if (netpoll_rx(skb)) > @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb) > if (!skb->tstamp.tv64) > net_timestamp(skb); > =20 > + if (skb_queue) > + this_cpu =3D 0; > + else > + this_cpu =3D 1; Why bother with a special boolean? and instead just test for skb_queue = !=3D NULL > + > /* > * The code is rearranged so that the path is the most > * short when CPU is congested, but is still operating. > */ > local_irq_save(flags); > + > queue =3D &__get_cpu_var(softnet_data); > + if (!skb_queue) > + skb_queue =3D &queue->input_pkt_queue; > =20 > __get_cpu_var(netdev_rx_stat).total++; > - if (queue->input_pkt_queue.qlen <=3D netdev_max_backlog) { > - if (queue->input_pkt_queue.qlen) { > -enqueue: > - __skb_queue_tail(&queue->input_pkt_queue, skb); > - local_irq_restore(flags); > - return NET_RX_SUCCESS; > + > + if (skb_queue->qlen <=3D netdev_max_backlog) { > + if (!skb_queue->qlen && this_cpu) { > + napi_schedule(&queue->backlog); > } Won't this break if skb_queue is NULL (non NAPI case)?