From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: [PATCH] net-tun: restructure tun_do_read for better sleep/wakeup efficiency Date: Mon, 12 May 2014 09:15:57 +0300 Message-ID: <20140512061557.GA12581@redhat.com> References: <1399422244-22751-1-git-send-email-xii@google.com> <5369AB36.6030609@redhat.com> <536C4733.9020704@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Xi Wang , "David S. Miller" , netdev@vger.kernel.org, Maxim Krasnyansky , Neal Cardwell , Eric Dumazet To: Jason Wang Return-path: Received: from mx1.redhat.com ([209.132.183.28]:55451 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753192AbaELGRK (ORCPT ); Mon, 12 May 2014 02:17:10 -0400 Content-Disposition: inline In-Reply-To: <536C4733.9020704@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, May 09, 2014 at 11:10:43AM +0800, Jason Wang wrote: > On 05/09/2014 02:22 AM, Xi Wang wrote: > > On Tue, May 6, 2014 at 8:40 PM, Jason Wang wrote: > >> On 05/07/2014 08:24 AM, Xi Wang wrote: > >>> tun_do_read always adds current thread to wait queue, even if a packet > >>> is ready to read. This is inefficient because both sleeper and waker > >>> want to acquire the wait queue spin lock when packet rate is high. > >> After commit 61a5ff15ebdab87887861a6b128b108404e4706d, this will only > >> help for blocking read. Looks like for performance critical userspaces, > >> they will use non blocking reads. > >>> We restructure the read function and use common kernel networking > >>> routines to handle receive, sleep and wakeup. With the change > >>> available packets are checked first before the reading thread is added > >>> to the wait queue. > >> This is interesting, since it may help if we want to add rx busy loop > >> for tun. (In fact I worked a similar patch like this). > > > > Yes this should be a good side effect and I am also interested in trying. > > Busy polling in user space is not ideal as it doesn't give the lowest latency. > > Besides differences in interrupt latency etc., there is a bad case for > > non blocking mode: When a packet arrives right before the polling thread > > returns to userspace. The control flow has to cross kernel/userspace > > boundary 3 times before the packet can be processed, while kernel > > blocking or busy polling only needs 1 boundary crossing. > > So if we want to implement this, we need a feature bit to turn it on. > Then vhost may benefit from this. IFF_TUN_POLL_BUSY_LOOP ? I'm not sure it has to be a flag. Maybe an ioctl is better, if userspace misconfigures this it is only hurting itself, right? Maybe add a module parameter to control polling timeout, or reuse low_latency_poll. > > > > > >>> Ran performance tests with the following configuration: > >>> > >>> - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer > >>> - sender pinned to one core and receiver pinned to another core > >>> - sender send small UDP packets (64 bytes total) as fast as it can > >>> - sandy bridge cores > >>> - throughput are receiver side goodput numbers > >>> > >>> The results are > >>> > >>> baseline: 757k pkts/sec, cpu utilization at 1.54 cpus > >>> changed: 804k pkts/sec, cpu utilization at 1.57 cpus > >>> > >>> The performance difference is largely determined by packet rate and > >>> inter-cpu communication cost. For example, if the sender and > >>> receiver are pinned to different cpu sockets, the results are > >>> > >>> baseline: 558k pkts/sec, cpu utilization at 1.71 cpus > >>> changed: 690k pkts/sec, cpu utilization at 1.67 cpus > >> So I believe your consumer is using blocking reads. How about re-test > >> with non blocking reads and re-test to make sure no regression? > > > > I tested non blocking read and found no regression. However the sender > > is the bottleneck in my case so packet blasting is not a good test for > > non blocking mode. I switched to RR / ping pong type of traffic through tap. > > The packet rates for both cases are ~477k and the difference is way > > below noise. > > > > > >>> Co-authored-by: Eric Dumazet > >>> Signed-off-by: Xi Wang > >>> --- > >>> drivers/net/tun.c | 68 +++++++++++++++++++++---------------------------------- > >>> 1 file changed, 26 insertions(+), 42 deletions(-) > >>> > >>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c > >>> index ee328ba..cb25385 100644 > >>> --- a/drivers/net/tun.c > >>> +++ b/drivers/net/tun.c > >>> @@ -133,8 +133,7 @@ struct tap_filter { > >>> struct tun_file { > >>> struct sock sk; > >>> struct socket socket; > >>> - struct socket_wq wq; > >>> - struct tun_struct __rcu *tun; > >>> + struct tun_struct __rcu *tun ____cacheline_aligned_in_smp; > >> This seems a optimization which is un-related to the topic. May send as > >> another patch but did you really see improvement for this? > > > > There is an ~1% difference (not as reliable as other data since the difference > > is small). This is not a major performance contributor. > > > > > >>> struct net *net; > >>> struct fasync_struct *fasync; > >>> /* only used for fasnyc */ > >>> @@ -498,12 +497,12 @@ static void tun_detach_all(struct net_device *dev) > >>> for (i = 0; i < n; i++) { > >>> tfile = rtnl_dereference(tun->tfiles[i]); > >>> BUG_ON(!tfile); > >>> - wake_up_all(&tfile->wq.wait); > >>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk); > >>> RCU_INIT_POINTER(tfile->tun, NULL); > >>> --tun->numqueues; > >>> } > >>> list_for_each_entry(tfile, &tun->disabled, next) { > >>> - wake_up_all(&tfile->wq.wait); > >>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk); > >>> RCU_INIT_POINTER(tfile->tun, NULL); > >>> } > >>> BUG_ON(tun->numqueues != 0); > >>> @@ -807,8 +806,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > >>> /* Notify and wake up reader process */ > >>> if (tfile->flags & TUN_FASYNC) > >>> kill_fasync(&tfile->fasync, SIGIO, POLL_IN); > >>> - wake_up_interruptible_poll(&tfile->wq.wait, POLLIN | > >>> - POLLRDNORM | POLLRDBAND); > >>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk); > >>> > >>> rcu_read_unlock(); > >>> return NETDEV_TX_OK; > >>> @@ -965,7 +963,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait) > >>> > >>> tun_debug(KERN_INFO, tun, "tun_chr_poll\n"); > >>> > >>> - poll_wait(file, &tfile->wq.wait, wait); > >>> + poll_wait(file, sk_sleep(sk), wait); > >>> > >>> if (!skb_queue_empty(&sk->sk_receive_queue)) > >>> mask |= POLLIN | POLLRDNORM; > >>> @@ -1330,46 +1328,21 @@ done: > >>> static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, > >>> const struct iovec *iv, ssize_t len, int noblock) > >>> { > >>> - DECLARE_WAITQUEUE(wait, current); > >>> struct sk_buff *skb; > >>> ssize_t ret = 0; > >>> + int peeked, err, off = 0; > >>> > >>> tun_debug(KERN_INFO, tun, "tun_do_read\n"); > >>> > >>> - if (unlikely(!noblock)) > >>> - add_wait_queue(&tfile->wq.wait, &wait); > >>> - while (len) { > >>> - if (unlikely(!noblock)) > >>> - current->state = TASK_INTERRUPTIBLE; > >>> - > >>> - /* Read frames from the queue */ > >>> - if (!(skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue))) { > >>> - if (noblock) { > >>> - ret = -EAGAIN; > >>> - break; > >>> - } > >>> - if (signal_pending(current)) { > >>> - ret = -ERESTARTSYS; > >>> - break; > >>> - } > >>> - if (tun->dev->reg_state != NETREG_REGISTERED) { > >>> - ret = -EIO; > >>> - break; > >>> - } > >>> - > >>> - /* Nothing to read, let's sleep */ > >>> - schedule(); > >>> - continue; > >>> - } > >>> + if (!len) > >>> + return ret; > >>> > >>> + /* Read frames from queue */ > >>> + skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0, > >>> + &peeked, &off, &err); > >>> + if (skb) { > >> This changes the userspace ABI a little bit. Originally, userspace can > >> see different error codes and do responds, but here it can only see zero. > > > > Thanks for catching this! Seems forwarding the &err parameter of > > __skb_recv_datagram > > should get the most of the error code compatibility back? > > Seems not, -ERESTARTSYS and EIO were missed. > > I'll check > > related code. > > > > > >>> ret = tun_put_user(tun, tfile, skb, iv, len); > >>> kfree_skb(skb); > >>> - break; > >>> - } > >>> - > >>> - if (unlikely(!noblock)) { > >>> - current->state = TASK_RUNNING; > >>> - remove_wait_queue(&tfile->wq.wait, &wait); > >>> } > >>> > >>> return ret; > >>> @@ -2187,20 +2160,28 @@ out: > >>> static int tun_chr_open(struct inode *inode, struct file * file) > >>> { > >>> struct tun_file *tfile; > >>> + struct socket_wq *wq; > >>> > >>> DBG1(KERN_INFO, "tunX: tun_chr_open\n"); > >>> > >>> + wq = kzalloc(sizeof(*wq), GFP_KERNEL); > >>> + if (!wq) > >>> + return -ENOMEM; > >>> + > >> Why not just reusing the socket_wq structure inside tun_file structure > >> like what we did in the past? > > > > There is no strong reason for going either way. Changing to dynamic allocation > > is based on: Less chance of cacheline contention and syncing the code pattern > > with core stack. > > It's seems another possible optimization un-related to the topic, better > send with another patch. But I suspect how much it will help for the > performance. > > Checking the other socket implementation such as af unix socket, the > socket_wq structure were also embedded in the parent socket structure. > > > > > > -Xi > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html