Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-06  5:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@c2.user-mode-linux.org, yzhao81@gmail.com
In-Reply-To: <20100401110841.GE3323@redhat.com>

Michael,
>> 
>>For the DOS issue, I'm not sure how much the limit get_user_pages()
>> can pin is reasonable, should we compute the bindwidth to make it?

>There's a ulimit for locked memory. Can we use this, decreasing
>the value for rlimit array? We can do this when backend is
>enabled and re-increment when backend is disabled.

I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
the initial value for it is 0x10000, after right shift PAGE_SHIFT,
it's only 16 pages we can lock then, it seems too small, since the 
guest virito-net driver may submit a lot requests one time.


Thanks
Xiaohui


^ permalink raw reply

* Re: [RFC PATCH 1/2] netdev: buffer infrastructure to log network driver's information
From: Koki Sanagi @ 2010-04-06  5:43 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, eric.dumazet, netdev, izumi.taku, kaneshige.kenji,
	jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	alexander.h.duyck, peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <20100406001034.GA2156@localhost.localdomain>

(2010/04/06 9:10), Neil Horman wrote:
> On Mon, Apr 05, 2010 at 12:31:55PM -0700, David Miller wrote:
>> From: Eric Dumazet<eric.dumazet@gmail.com>
>> Date: Mon, 05 Apr 2010 10:42:26 +0200
>>
>>> Le lundi 05 avril 2010 à 15:52 +0900, Koki Sanagi a écrit :
>>>> This patch implements buffer infrastructure under driver/net.
>>>> This buffer records information from network driver.
>>>>
>>>> Signed-off-by: Koki Sanagi<sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>>    drivers/net/Kconfig     |    8 +
>>>>    drivers/net/Makefile    |    1 +
>>>>    drivers/net/ndrvbuf.c   |  535 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    include/linux/ndrvbuf.h |   57 +++++
>>>>    4 files changed, 601 insertions(+), 0 deletions(-)
>>>>
>>>
>>> Wow, 600 lines... thats what I call bloat...
>>
>> And we have all sorts of facilities for creating filesystem
>> streams and ring buffers of debug information.
>>
>> You could even hook into 'perf' to log and process these
>> events in probably like 12 lines of code.
>>
> I'm still having a hard time understanding why this approach is preferable to
> the previous approach you took using tracepoints.  Granted you can't get driver
> internal state as easily, but its generic and doesn't do...this.
> Neil
>
>
>

We can get below information with this patch

1. Driver operates normaly or not
2. Tx ring's state

About 1, the preivous approach meets, but about 2, some hooks need in driver
code like this patch. If we get it, it is available to solute "Tx Unit Hung"
message. This message indicates that tx descriptor ring's process is not smooth.
When a countermeasure was taken to system that outputs "Tx Unit Hung" message,
this state information is available to evaluate a countermeasure.
But what you say is true, this patch is not generic.
it may be good to rebase the previous approach to focus on 1.
And it is better to consider separately about 2.



^ permalink raw reply

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Xin, Xiaohui @ 2010-04-06  5:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <20100404114028.GF3189@redhat.com>

Michael,
> >>> For the write logging, do you have a function in hand that we can
> >>> recompute the log? If that, I think I can use it to recompute the
> >>>log info when the logging is suddenly enabled.
> >>> For the outstanding requests, do you mean all the user buffers have
> >>>submitted before the logging ioctl changed? That may be a lot, and
> >> >some of them are still in NIC ring descriptors. Waiting them to be
> >>>finished may be need some time. I think when logging ioctl changed,
> >> >then the logging is changed just after that is also reasonable.
 
> >>The key point is that after loggin ioctl returns, any
> >>subsequent change to memory must be logged. It does not
> >>matter when was the request submitted, otherwise we will
> >>get memory corruption on migration.

> >The change to memory happens when vhost_add_used_and_signal(), right?
> >So after ioctl returns, just recompute the log info to the events in the async queue,
> >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
 
>> Thanks
>> Xiaohui

>Yes, I think this will work.

Thanks, so do you have the function to recompute the log info in your hand that I can 
use? I have weakly remembered that you have noticed it before some time.

> > Thanks
> > Xiaohui
> > 
> >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/vhost/vhost.h |   10 +++
> >  2 files changed, 192 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 22d5fef..2aafd90 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -17,11 +17,13 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/rcupdate.h>
> >  #include <linux/file.h>
> > +#include <linux/aio.h>
> >  
> >  #include <linux/net.h>
> >  #include <linux/if_packet.h>
> >  #include <linux/if_arp.h>
> >  #include <linux/if_tun.h>
> > +#include <linux/mpassthru.h>
> >  
> >  #include <net/sock.h>
> >  
> > @@ -47,6 +49,7 @@ struct vhost_net {
> >  	struct vhost_dev dev;
> >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > +	struct kmem_cache       *cache;
> >  	/* Tells us whether we are polling a socket for TX.
> >  	 * We only do this when socket buffer fills up.
> >  	 * Protected by tx vq lock. */
> > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> >  }
> >  
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > +	if (!list_empty(&vq->notifier)) {
> > +		iocb = list_first_entry(&vq->notifier,
> > +				struct kiocb, ki_list);
> > +		list_del(&iocb->ki_list);
> > +	}
> > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > +	return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	struct vhost_log *vq_log = NULL;
> > +	int rx_total_len = 0;
> > +	int log, size;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	if (vq->receiver)
> > +		vq->receiver(vq);
> > +
> > +	vq_log = unlikely(vhost_has_feature(
> > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, iocb->ki_nbytes);
> > +		log = (int)iocb->ki_user_data;
> > +		size = iocb->ki_nbytes;
> > +		rx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		kmem_cache_free(net->cache, iocb);
> > +
> > +		if (unlikely(vq_log))
> > +			vhost_log_write(vq, vq_log, log, size);
> > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	int tx_total_len = 0;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, 0);
> > +		tx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +
> > +		kmem_cache_free(net->cache, iocb);
> > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> >  /* Expects to be always run from workqueue - which acts as
> >   * read-size critical section for our kind of RCU. */
> >  static void handle_tx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, s;
> >  	struct msghdr msg = {
> >  		.msg_name = NULL,
> > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> >  		tx_poll_stop(net);
> >  	hdr_size = vq->hdr_size;
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> >  		/* Skip header. TODO: support TSO. */
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> >  		msg.msg_iovlen = out;
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->ki_pos = head;
> > +			iocb->private = (void *)vq;
> > +		}
> > +
> >  		len = iov_length(vq->iov, out);
> >  		/* Sanity check */
> >  		if (!len) {
> > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> >  			break;
> >  		}
> >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> >  		if (unlikely(err < 0)) {
> >  			vhost_discard_vq_desc(vq);
> >  			tx_poll_start(net, sock);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		if (err != len)
> >  			pr_err("Truncated TX packet: "
> >  			       " len %d != %zd\n", err, len);
> > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> >  static void handle_rx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, log, s;
> >  	struct vhost_log *vq_log;
> >  	struct msghdr msg = {
> > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> >  	int err;
> >  	size_t hdr_size;
> >  	struct socket *sock = rcu_dereference(vq->private_data);
> > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> >  		return;
> >  
> >  	use_mm(net->dev.mm);
> > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> >  	vhost_disable_notify(vq);
> >  	hdr_size = vq->hdr_size;
> >  
> > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > +	/* In async cases, for write logging, the simple way is to get
> > +	 * the log info always, and really logging is decided later.
> > +	 * Thus, when logging enabled, we can get log, and when logging
> > +	 * disabled, we can get log disabled accordingly.
> > +	 */
> > +
> > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> >  		vq->log : NULL;
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> >  		msg.msg_iovlen = in;
> >  		len = iov_length(vq->iov, in);
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->private = vq;
> > +			iocb->ki_pos = head;
> > +			iocb->ki_user_data = log;
> > +		}
> >  		/* Sanity check */
> >  		if (!len) {
> >  			vq_err(vq, "Unexpected header len for RX: "
> > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> >  			       iov_length(vq->hdr, s), hdr_size);
> >  			break;
> >  		}
> > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > +
> > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> >  		if (err < 0) {
> >  			vhost_discard_vq_desc(vq);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		/* TODO: Should check and handle checksum. */
> >  		if (err > len) {
> >  			pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> >  
> > +
> >  static void handle_tx_kick(struct work_struct *work)
> >  {
> >  	struct vhost_virtqueue *vq;
> > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > +	n->cache = NULL;
> >  	return 0;
> >  }
> >  
> > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> >  }
> >  
> > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > +{
> > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> > +	if (n->cache) {
> > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > +			kmem_cache_free(n->cache, iocb);
> > +		kmem_cache_destroy(n->cache);
> > +	}
> > +}
> > +
> >  static int vhost_net_release(struct inode *inode, struct file *f)
> >  {
> >  	struct vhost_net *n = f->private_data;
> > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	/* We do an extra flush before freeing memory,
> >  	 * since jobs can re-queue themselves. */
> >  	vhost_net_flush(n);
> > +	vhost_notifier_cleanup(n);
> >  	kfree(n);
> >  	return 0;
> >  }
> > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> >  	return sock;
> >  }
> >  
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > +	struct file *file = fget(fd);
> > +	struct socket *sock;
> > +	if (!file)
> > +		return ERR_PTR(-EBADF);
> > +	sock = mp_get_socket(file);
> > +	if (IS_ERR(sock))
> > +		fput(file);
> > +	return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> >  {
> >  	struct socket *sock;
> >  	if (fd == -1)
> > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> >  	sock = get_tun_socket(fd);
> >  	if (!IS_ERR(sock))
> >  		return sock;
> > +	sock = get_mp_socket(fd);
> > +	if (!IS_ERR(sock)) {
> > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > +		return sock;
> > +	}
> >  	return ERR_PTR(-ENOTSOCK);
> >  }
> >  
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > +	struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +		vq->receiver = NULL;
> > +		INIT_LIST_HEAD(&vq->notifier);
> > +		spin_lock_init(&vq->notify_lock);
> > +		if (!n->cache) {
> > +			n->cache = kmem_cache_create("vhost_kiocb",
> > +					sizeof(struct kiocb), 0,
> > +					SLAB_HWCACHE_ALIGN, NULL);
> > +		}
> > +	}
> > +}
> > +
> >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  {
> >  	struct socket *sock, *oldsock;
> > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	}
> >  	vq = n->vqs + index;
> >  	mutex_lock(&vq->mutex);
> > -	sock = get_socket(fd);
> > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > +	sock = get_socket(vq, fd);
> >  	if (IS_ERR(sock)) {
> >  		r = PTR_ERR(sock);
> >  		goto err;
> >  	}
> >  
> > +	vhost_init_link_state(n, index);
> > +
> >  	/* start polling new socket */
> >  	oldsock = vq->private_data;
> >  	if (sock == oldsock)
> > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	vhost_net_disable_vq(n, vq);
> >  	rcu_assign_pointer(vq->private_data, sock);
> >  	vhost_net_enable_vq(n, vq);
> > -	mutex_unlock(&vq->mutex);
> >  done:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	if (oldsock) {
> >  		vhost_net_flush_vq(n, index);
> > @@ -516,6 +690,7 @@ done:
> >  	}
> >  	return r;
> >  err:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	return r;
> >  }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..cffe39a 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> >  	u64 len;
> >  };
> >  
> > +enum vhost_vq_link_state {
> > +	VHOST_VQ_LINK_SYNC = 	0,
> > +	VHOST_VQ_LINK_ASYNC = 	1,
> > +};
> > +
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >  	struct vhost_dev *dev;
> > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> >  	/* Log write descriptors */
> >  	void __user *log_base;
> >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > +	/*Differiate async socket for 0-copy from normal*/
> > +	enum vhost_vq_link_state link_state;
> > +	struct list_head notifier;
> > +	spinlock_t notify_lock;
> > +	void (*receiver)(struct vhost_virtqueue *);
> >  };
> >  
> >  struct vhost_dev {
> > -- 
> > 1.5.4.4

^ permalink raw reply

* [PATCH v2] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-06  5:56 UTC (permalink / raw)
  To: davem, netdev, eric.dumazet

Version 2:
- added a u16 filler to pad rps_dev_flow structure
- define RPS_NO_CPU as 0xffff
- add inet_rps_save_rxhash helper function to copy skb's rxhash into inet_sk
- add a "voidflow" which can be used get_rps_cpu does not return a flow (avoids some conditionals)
- use raw_smp_processor_id in rps_record_sock_flow, this is no requirement to pr
event preemption
---
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU
   
bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a343a21..09c8658 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,74 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+	u16 cpu;
+	u16 fill;
+	unsigned int last_qtail;
+};
+
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+	unsigned int mask;
+	struct rcu_head rcu;
+	struct work_struct free_work;
+	struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+    (_num * sizeof(struct rps_dev_flow)))
+
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+	unsigned int mask;
+	u16 *ents;
+};
+
+#define RPS_NO_CPU 0xffff
+
+static inline void rps_set_sock_flow(struct rps_sock_flow_table *table,
+				     u32 hash, int cpu)
+{
+	if (table->ents && hash) {
+		unsigned int index = hash & table->mask;
+
+		if (table->ents[index] != cpu)
+			table->ents[index] = cpu;
+	}
+}
+
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+					u32 hash)
+{
+	/* We only give a hint, preemption can change our cpu under us */
+	rps_set_sock_flow(table, hash, raw_smp_processor_id());
+}
+
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+				       u32 hash)
+{
+	rps_set_sock_flow(table, hash, RPS_NO_CPU);
+}
+
+extern struct rps_sock_flow_table rps_sock_flow_table;
+
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 	struct rps_map *rps_map;
+	struct rps_dev_flow_table *rps_flow_table;
 	struct kobject kobj;
 	struct netdev_rx_queue *first;
 	atomic_t count;
 } ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1331,8 +1391,9 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 
 	/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83fd344..801cd63 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -21,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/jhash.h>
+#include <linux/netdevice.h>
 
 #include <net/flow.h>
 #include <net/sock.h>
@@ -101,6 +102,7 @@ struct rtable;
  * @uc_ttl - Unicast TTL
  * @inet_sport - Source port
  * @inet_id - ID counter for DF pkts
+ * @rxhash - flow hash received from netif layer
  * @tos - TOS
  * @mc_ttl - Multicasting TTL
  * @is_icsk - is this an inet_connection_sock?
@@ -124,6 +126,9 @@ struct inet_sock {
 	__u16			cmsg_flags;
 	__be16			inet_sport;
 	__u16			inet_id;
+#ifdef CONFIG_RPS
+	__u32			rxhash;
+#endif
 
 	struct ip_options	*opt;
 	__u8			tos;
@@ -219,4 +224,27 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
 	return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
 }
 
+static inline void inet_rps_record_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	rps_record_sock_flow(&rps_sock_flow_table, inet_sk(sk)->rxhash);
+#endif
+}
+
+static inline void inet_rps_reset_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	rps_reset_sock_flow(&rps_sock_flow_table, inet_sk(sk)->rxhash);
+#endif
+}
+
+static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+{
+#ifdef CONFIG_RPS
+	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
+		inet_rps_reset_flow(sk);
+		inet_sk(sk)->rxhash = rxhash;
+	}
+#endif
+}
 #endif	/* _INET_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..8f4c625 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <linux/pci.h>
+#include <linux/bootmem.h>
 
 #include "net-sysfs.h"
 
@@ -2202,22 +2203,28 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 #ifdef CONFIG_RPS
+/* One global table that all flow-based protocols share. */
+struct rps_sock_flow_table rps_sock_flow_table;
+EXPORT_SYMBOL(rps_sock_flow_table);
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
+ * rcu_read_lock must be held on entry.
  */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+		       struct rps_dev_flow **rflowp)
 {
 	struct ipv6hdr *ip6;
 	struct iphdr *ip;
 	struct netdev_rx_queue *rxqueue;
 	struct rps_map *map;
+	struct rps_dev_flow_table *flow_table;
 	int cpu = -1;
 	u8 ip_proto;
+	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
-	rcu_read_lock();
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2232,7 +2239,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 	} else
 		rxqueue = dev->_rx;
 
-	if (!rxqueue->rps_map)
+	if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
 		goto done;
 
 	if (skb->rxhash)
@@ -2284,9 +2291,47 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 		skb->rxhash = 1;
 
 got_hash:
+	flow_table = rcu_dereference(rxqueue->rps_flow_table);
+	if (flow_table && rps_sock_flow_table.ents) {
+		u16 next_cpu;
+		struct rps_dev_flow *rflow;
+
+		rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
+		tcpu = rflow->cpu;
+
+		next_cpu = rps_sock_flow_table.ents[skb->rxhash &
+		    rps_sock_flow_table.mask];
+
+		/*
+		 * If the desired CPU (where last recvmsg was done) is
+		 * different from current CPU (one in the rx-queue flow
+		 * table entry), switch if one of the following holds:
+		 *   - Current CPU is unset (equal to RPS_NO_CPU).
+		 *   - Current CPU is offline.
+		 *   - The current CPU's queue tail has advanced beyond the
+		 *     last packet that was enqueued using this table entry.
+		 *     This guarantees that all previous packets for the flow
+		 *     have been dequeued, thus preserving in order delivery.
+		 */
+		if (unlikely(tcpu != next_cpu) &&
+		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
+		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
+		      rflow->last_qtail)) >= 0)) {
+			tcpu = rflow->cpu = next_cpu;
+			if (tcpu != RPS_NO_CPU)
+				rflow->last_qtail = per_cpu(softnet_data,
+				    tcpu).input_queue_head;
+		}
+		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
+			*rflowp = rflow;
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
 	map = rcu_dereference(rxqueue->rps_map);
 	if (map) {
-		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
 		if (cpu_online(tcpu)) {
 			cpu = tcpu;
@@ -2295,7 +2340,6 @@ got_hash:
 	}
 
 done:
-	rcu_read_unlock();
 	return cpu;
 }
 
@@ -2321,13 +2365,14 @@ static void trigger_softirq(void *data)
 	__napi_schedule(&queue->backlog);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
-#endif /* CONFIG_SMP */
+#endif /* CONFIG_RPS */
 
 /*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
  */
-static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
+			      unsigned int *qtail)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
@@ -2342,6 +2387,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 		if (queue->input_pkt_queue.qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
+#ifdef CONFIG_RPS
+			*qtail = queue->input_queue_head +
+			    queue->input_pkt_queue.qlen;
+#endif
 			rps_unlock(queue);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2356,11 +2405,10 @@ enqueue:
 
 				cpu_set(cpu, rcpus->mask[rcpus->select]);
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-			} else
-				__napi_schedule(&queue->backlog);
-#else
-			__napi_schedule(&queue->backlog);
+				goto enqueue;
+			}
 #endif
+			__napi_schedule(&queue->backlog);
 		}
 		goto enqueue;
 	}
@@ -2391,7 +2439,7 @@ enqueue:
 
 int netif_rx(struct sk_buff *skb)
 {
-	int cpu;
+	unsigned int qtail;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2401,14 +2449,24 @@ int netif_rx(struct sk_buff *skb)
 		net_timestamp(skb);
 
 #ifdef CONFIG_RPS
-	cpu = get_rps_cpu(skb->dev, skb);
-	if (cpu < 0)
-		cpu = smp_processor_id();
-#else
-	cpu = smp_processor_id();
+	{
+		struct rps_dev_flow voidflow, *rflow = &voidflow;
+		int cpu, err;
+
+		rcu_read_lock();
+
+		cpu = get_rps_cpu(skb->dev, skb, &rflow);
+		if (cpu >= 0) {
+			err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+			rcu_read_unlock();
+			return err;
+		}
+
+		rcu_read_unlock();
+	}
 #endif
 
-	return enqueue_to_backlog(skb, cpu);
+	return enqueue_to_backlog(skb, smp_processor_id(), &qtail);
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2775,17 +2833,22 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	int cpu;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
+	int cpu, err;
 
-	cpu = get_rps_cpu(skb->dev, skb);
+	rcu_read_lock();
 
-	if (cpu < 0)
-		return __netif_receive_skb(skb);
-	else
-		return enqueue_to_backlog(skb, cpu);
-#else
-	return __netif_receive_skb(skb);
+	cpu = get_rps_cpu(skb->dev, skb, &rflow);
+
+	if (cpu >= 0) {
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+		rcu_read_unlock();
+		return err;
+	}
+
+	rcu_read_unlock();
 #endif
+	return __netif_receive_skb(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
@@ -2801,6 +2864,9 @@ static void flush_backlog(void *arg)
 		if (skb->dev == dev) {
 			__skb_unlink(skb, &queue->input_pkt_queue);
 			kfree_skb(skb);
+#ifdef CONFIG_RPS
+			queue->input_queue_head++;
+#endif
 		}
 	rps_unlock(queue);
 }
@@ -3124,6 +3190,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			local_irq_enable();
 			break;
 		}
+#ifdef CONFIG_RPS
+		queue->input_queue_head++;
+#endif
 		rps_unlock(queue);
 		local_irq_enable();
 
@@ -5487,8 +5556,12 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
+#ifdef CONFIG_RPS
+		oldsd->input_queue_head++;
+#endif
+	}
 
 	return NOTIFY_OK;
 }
@@ -5669,6 +5742,42 @@ static struct pernet_operations __net_initdata default_device_ops = {
 	.exit_batch = default_device_exit_batch,
 };
 
+
+#ifdef CONFIG_RPS
+static __initdata unsigned long rps_sock_flow_entries;
+
+static int __init set_rps_sock_flow_entries(char *str)
+{
+	if (str)
+		rps_sock_flow_entries = simple_strtoul(str, &str, 0);
+
+	return 0;
+}
+
+__setup("rps_flow_entries=", set_rps_sock_flow_entries);
+
+static int alloc_rps_sock_flow_entries(void)
+{
+	unsigned int i, hash_size;
+
+	if (!rps_sock_flow_entries)
+		return 0;
+
+	rps_sock_flow_table.ents =
+	    alloc_large_system_hash("RPS flow table", sizeof(u16),
+	    rps_sock_flow_entries, 0, 0, &hash_size, NULL, 0);
+	hash_size = 1 << hash_size;
+	rps_sock_flow_table.mask = hash_size - 1;
+	for (i = 0; i < hash_size; i++)
+		rps_sock_flow_table.ents[i] = RPS_NO_CPU;
+
+	printk(KERN_INFO "RPS: flow table configured with %d entries\n",
+	    hash_size);
+
+	return 0;
+}
+#endif
+
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
  *	unhooks any devices that fail to initialise (normally hardware not
@@ -5689,6 +5798,11 @@ static int __init net_dev_init(void)
 	if (dev_proc_init())
 		goto out;
 
+#ifdef CONFIG_RPS
+	if (alloc_rps_sock_flow_entries())
+		goto out;
+#endif
+
 	if (netdev_kobject_init())
 		goto out;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1e7fdd6..95863b2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -600,22 +600,105 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	return len;
 }
 
+static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+					   struct rx_queue_attribute *attr,
+					   char *buf)
+{
+	struct rps_dev_flow_table *flow_table;
+	unsigned int val = 0;
+
+	rcu_read_lock();
+	flow_table = rcu_dereference(queue->rps_flow_table);
+	if (flow_table)
+		val = flow_table->mask + 1;
+	rcu_read_unlock();
+
+	return sprintf(buf, "%u\n", val);
+}
+
+static void rps_dev_flow_table_release_work(struct work_struct *work)
+{
+	struct rps_dev_flow_table *table = container_of(work,
+	    struct rps_dev_flow_table, free_work);
+
+	vfree(table);
+}
+
+static void rps_dev_flow_table_release(struct rcu_head *rcu)
+{
+	struct rps_dev_flow_table *table = container_of(rcu,
+	    struct rps_dev_flow_table, rcu);
+
+	INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
+	schedule_work(&table->free_work);
+}
+
+ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+				     struct rx_queue_attribute *attr,
+				     const char *buf, size_t len)
+{
+	unsigned int count;
+	char *endp;
+	struct rps_dev_flow_table *table, *old_table;
+	static DEFINE_SPINLOCK(rps_dev_flow_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	count = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EINVAL;
+
+	if (count) {
+		int i;
+
+		count = roundup_pow_of_two(count);
+		table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
+		if (!table)
+			return -ENOMEM;
+
+		table->mask = count - 1;
+		for (i = 0; i < count; i++)
+			table->flows[i].cpu = RPS_NO_CPU;
+	} else
+		table = NULL;
+
+	spin_lock(&rps_dev_flow_lock);
+	old_table = queue->rps_flow_table;
+	rcu_assign_pointer(queue->rps_flow_table, table);
+	spin_unlock(&rps_dev_flow_lock);
+
+	if (old_table)
+		call_rcu(&old_table->rcu, rps_dev_flow_table_release);
+
+	return len;
+}
+
 static struct rx_queue_attribute rps_cpus_attribute =
 	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
 
+
+static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
+	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
+	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+
 static struct attribute *rx_queue_default_attrs[] = {
 	&rps_cpus_attribute.attr,
+	&rps_dev_flow_table_cnt_attribute.attr,
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
-	struct rps_map *map = queue->rps_map;
 	struct netdev_rx_queue *first = queue->first;
 
-	if (map)
-		call_rcu(&map->rcu, rps_map_release);
+	if (queue->rps_map)
+		call_rcu(&queue->rps_map->rcu, rps_map_release);
+
+	if (queue->rps_flow_table)
+		call_rcu(&queue->rps_flow_table->rcu,
+		    rps_dev_flow_table_release);
 
 	if (atomic_dec_and_test(&first->count))
 		kfree(first);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 55e1190..eb6155a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -418,6 +418,8 @@ int inet_release(struct socket *sock)
 	if (sk) {
 		long timeout;
 
+		inet_rps_reset_flow(sk);
+
 		/* Applications forget to leave groups before exiting */
 		ip_mc_drop_socket(sk);
 
@@ -714,6 +716,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -722,12 +726,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-
 static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 			     size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -737,6 +742,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 	return sock_no_sendpage(sock, page, offset, size, flags);
 }
 
+int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		 size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	inet_rps_record_flow(sk);
+
+	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
+				   flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(inet_recvmsg);
 
 int inet_shutdown(struct socket *sock, int how)
 {
@@ -866,7 +887,7 @@ const struct proto_ops inet_stream_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -893,7 +914,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -923,7 +944,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4df5f9..2f40fe0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1674,6 +1674,8 @@ process:
 
 	skb->dev = NULL;
 
+	inet_rps_save_rxhash(sk, skb->rxhash);
+
 	bh_lock_sock_nested(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {

^ permalink raw reply related

* RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-06  6:06 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, mst@redhat.com,
	jdike@c2.user-mode-linux.org, davem@davemloft.net
In-Reply-To: <1270252268.13897.14.camel@w-sridhar.beaverton.ibm.com>

Sridhar,

>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it. 
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>>vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.

>What is the advantage of this approach compared to PCI-passthrough
>of the host NIC to the guest?

PCI-passthrough needs hardware support, a kind of iommu engine will
help to translate guest physical address to host physical address.
And currently, a PCI-passthrough device cannot pass live migration.

The zero-copy is a pure software solution. It doesn't need special hardware support.
In theory, it can pass live migration.
 
>Does this require pinning of the entire guest memory? Or only the
>send/receive buffers?

We need only to pin the send/receive buffers.

Thanks
Xiaohui

>Thanks
>Sridhar
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?
> 
> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here. This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support
> 	To support GRO
> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Xin, Xiaohui @ 2010-04-06  6:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, mst@redhat.com,
	jdike@c2.user-mode-linux.org, davem@davemloft.net
In-Reply-To: <20100402085556.75a8ff7c@nehalam>


>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> 
>> The patch let host NIC driver to receive user space skb,
>> then the driver has chance to directly DMA to guest user
>> space buffers thru single ethX interface.
>> We want it to be more generic as a zero copy framework.
>> 
>> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
>> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
>> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
>> ---
>> 
>> We consider 2 way to utilize the user buffres, but not sure which one
>> is better. Please give any comments.
>> 
>> One:    Modify __alloc_skb() function a bit, it can only allocate a
>>         structure of sk_buff, and the data pointer is pointing to a
>>         user buffer which is coming from a page constructor API.
>>         Then the shinfo of the skb is also from guest.
>>         When packet is received from hardware, the skb->data is filled
>>         directly by h/w. What we have done is in this way.
>> 
>>         Pros:   We can avoid any copy here.
>>         Cons:   Guest virtio-net driver needs to allocate skb as almost
>>                 the same method with the host NIC drivers, say the size
>>                 of netdev_alloc_skb() and the same reserved space in the
>>                 head of skb. Many NIC drivers are the same with guest and
>>                 ok for this. But some lastest NIC drivers reserves special
>>                 room in skb head. To deal with it, we suggest to provide
>>                 a method in guest virtio-net driver to ask for parameter
>>                 we interest from the NIC driver when we know which device
>>                 we have bind to do zero-copy. Then we ask guest to do so.
>>                 Is that reasonable?
>> 
>> Two:    Modify driver to get user buffer allocated from a page constructor
>>         API(to substitute alloc_page()), the user buffer are used as payload
>>         buffers and filled by h/w directly when packet is received. Driver
>>         should associate the pages with skb (skb_shinfo(skb)->frags). For
>>         the head buffer side, let host allocates skb, and h/w fills it.
>>         After that, the data filled in host skb header will be copied into
>>         guest header buffer which is submitted together with the payload buffer.
>> 
>>         Pros:   We could less care the way how guest or host allocates their
>>                 buffers.
>>         Cons:   We still need a bit copy here for the skb header.
>> 
>> We are not sure which way is the better here. This is the first thing we want
>> to get comments from the community. We wish the modification to the network
>> part will be generic which not used by vhost-net backend only, but a user
>> application may use it as well when the zero-copy device may provides async
>> read/write operations later.
>> 
>> 
>> Thanks
>> Xiaohui

>How do you deal with the DoS problem of hostile user space app posting huge
>number of receives and never getting anything. 

That's a problem we are trying to deal with. It's critical for long term.
Currently, we tried to limit the pages it can pin, but not sure how much is reasonable.
For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent
just for now.

Thanks
Xiaohui

^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: Jiri Pirko @ 2010-04-06  7:09 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: davem, netdev
In-Reply-To: <201004050457.o354v6ec008492@94.43.138.210.xn.2iij.net>

Mon, Apr 05, 2010 at 05:59:30AM CEST, yoshfuji@linux-ipv6.org wrote:
>Fix kernel panic by NULL pointer dereference in the context of
>ieee80211_ops->prepare_multicast().
>
>This bug was introduced by commit 22bedad3c.. ("net: convert
>multicast list to list_head").
>
>Call __hw_addr_init() in ieee80211_alloc_hw() to initialize
>list_head of private device multicast list, like we do in
>bond_init().
>
>Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
>---
> net/mac80211/main.c |    3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
>diff --git a/net/mac80211/main.c b/net/mac80211/main.c
>index 84ad249..0b82cd2 100644
>--- a/net/mac80211/main.c
>+++ b/net/mac80211/main.c
>@@ -388,6 +388,9 @@ struct ieee80211_hw *ieee80211_alloc_hw(size_t priv_data_len,
> 	local->uapsd_max_sp_len = IEEE80211_DEFAULT_MAX_SP_LEN;
> 
> 	INIT_LIST_HEAD(&local->interfaces);
>+
>+	__hw_addr_init(&local->mc_list);
>+
> 	mutex_init(&local->iflist_mtx);
> 	mutex_init(&local->scan_mtx);
> 

Whoups, missed this bit. Thanks a lot.

Rewieved-by: Jiri Pirko <jpirko@redhat.com>

>-- 
>1.5.6.5
>

^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: David Miller @ 2010-04-06  7:12 UTC (permalink / raw)
  To: jpirko; +Cc: yoshfuji, netdev
In-Reply-To: <20100406070922.GE2869@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 6 Apr 2010 09:09:23 +0200

> Whoups, missed this bit. Thanks a lot.
> 
> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
> 

Applied, and patchwork doesn't know what "Rewieved-by" is so
I fixed the typo and added it to the changelog :-)

^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: Jiri Pirko @ 2010-04-06  7:17 UTC (permalink / raw)
  To: David Miller; +Cc: yoshfuji, netdev
In-Reply-To: <20100406.001259.15002237.davem@davemloft.net>

Tue, Apr 06, 2010 at 09:12:59AM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jpirko@redhat.com>
>Date: Tue, 6 Apr 2010 09:09:23 +0200
>
>> Whoups, missed this bit. Thanks a lot.
>> 
>> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
>> 
>
>Applied, and patchwork doesn't know what "Rewieved-by" is so
>I fixed the typo and added it to the changelog :-)

Oh my :) Looks like I'm still sleeping...

^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-06  7:39 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20100404205028H.fujita.tomonori@lab.ntt.co.jp>

Thanks, Fujita.

The patch looks fine. I'll run some regression tests on the patched driver to check that things still work and if it's ok we will ack it shortly.

vlad



> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Sunday, April 04, 2010 2:51 PM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; davem@davemloft.net;
> netdev@vger.kernel.org; Eilon Greenstein
> Subject: RE: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> On Sun, 4 Apr 2010 03:24:46 -0700
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
>
> > Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to
> > use the generic model (not just the mapping macros).
>
> I've attached the patch.
>
> There is one functional change: pci_alloc_consistent ->
> dma_alloc_coherent
>
> pci_alloc_consistent is a wrapper function of dma_alloc_coherent with
> GFP_ATOMIC flag (see include/asm-generic/pci-dma-compat.h).
>
> pci_alloc_consistent uses GFP_ATOMIC flag because of the compatibility
> for some broken drivers that use the function in interrupt. But
> GFP_ATOMIC should be avoided if possible. Looks like bnx2x doesn't use
> pci_alloc_consistent in interrupt so I replaced them with
> dma_alloc_coherent with GFP_KERNEL.
>
> Please check if that change works for bnx2x.
>
> > One last question: since which kernel version the generic DMA layer
> > may be used instead of PCI DMA layer?
>
> After 2.6.34-rc2.
>
> Well, on the majority of architectures, you have been able to use the
> generic DMA API over the PCI DMA API. The PCI DMA API is just the
> wrapper of the generic DMA API. But on some architectures, two APIs
> worked differently a bit. since 2.6.34-rc2, two API work in the exact
> same way on all the architectures.
>
>
> =
> From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Subject: [PATCH] bnx2x: use the DMA API instead of the pci equivalents
>
> The DMA API is preferred.
>
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
>  drivers/net/bnx2x.h      |    4 +-
>  drivers/net/bnx2x_main.c |  110
> +++++++++++++++++++++++----------------------
>  2 files changed, 58 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {
>                        \
>
>  struct sw_rx_bd {
>       struct sk_buff  *skb;
> -     DECLARE_PCI_UNMAP_ADDR(mapping)
> +     DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>
>  struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>
>  struct sw_rx_page {
>       struct page     *page;
> -     DECLARE_PCI_UNMAP_ADDR(mapping)
> +     DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>
>  union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index fa9275c..63a17d6 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -842,7 +842,7 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>       /* unmap first bd */
>       DP(BNX2X_MSG_OFF, "free bd_idx %d\n", bd_idx);
>       tx_start_bd = &fp->tx_desc_ring[bd_idx].start_bd;
> -     pci_unmap_single(bp->pdev, BD_UNMAP_ADDR(tx_start_bd),
> +     dma_unmap_single(&bp->pdev->dev, BD_UNMAP_ADDR(tx_start_bd),
>                        BD_UNMAP_LEN(tx_start_bd), PCI_DMA_TODEVICE);
>
>       nbd = le16_to_cpu(tx_start_bd->nbd) - 1;
> @@ -872,8 +872,8 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>
>               DP(BNX2X_MSG_OFF, "free frag bd_idx %d\n", bd_idx);
>               tx_data_bd = &fp->tx_desc_ring[bd_idx].reg_bd;
> -             pci_unmap_page(bp->pdev, BD_UNMAP_ADDR(tx_data_bd),
> -                            BD_UNMAP_LEN(tx_data_bd),
> PCI_DMA_TODEVICE);
> +             dma_unmap_page(&bp->pdev->dev,
> BD_UNMAP_ADDR(tx_data_bd),
> +                            BD_UNMAP_LEN(tx_data_bd), DMA_TO_DEVICE);
>               if (--nbd)
>                       bd_idx = TX_BD(NEXT_TX_IDX(bd_idx));
>       }
> @@ -1086,7 +1086,7 @@ static inline void
> bnx2x_free_rx_sge(struct bnx2x *bp,
>       if (!page)
>               return;
>
> -     pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> +     dma_unmap_page(&bp->pdev->dev, dma_unmap_addr(sw_buf, mapping),
>                      SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
>       __free_pages(page, PAGES_PER_SGE_SHIFT);
>
> @@ -1115,15 +1115,15 @@ static inline int
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
>       if (unlikely(page == NULL))
>               return -ENOMEM;
>
> -     mapping = pci_map_page(bp->pdev, page, 0,
> SGE_PAGE_SIZE*PAGES_PER_SGE,
> -                            PCI_DMA_FROMDEVICE);
> +     mapping = dma_map_page(&bp->pdev->dev, page, 0,
> +                            SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
>       if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>               __free_pages(page, PAGES_PER_SGE_SHIFT);
>               return -ENOMEM;
>       }
>
>       sw_buf->page = page;
> -     pci_unmap_addr_set(sw_buf, mapping, mapping);
> +     dma_unmap_addr_set(sw_buf, mapping, mapping);
>
>       sge->addr_hi = cpu_to_le32(U64_HI(mapping));
>       sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1143,15 +1143,15 @@ static inline int
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
>       if (unlikely(skb == NULL))
>               return -ENOMEM;
>
> -     mapping = pci_map_single(bp->pdev, skb->data, bp->rx_buf_size,
> -                              PCI_DMA_FROMDEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> bp->rx_buf_size,
> +                              DMA_FROM_DEVICE);
>       if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>               dev_kfree_skb(skb);
>               return -ENOMEM;
>       }
>
>       rx_buf->skb = skb;
> -     pci_unmap_addr_set(rx_buf, mapping, mapping);
> +     dma_unmap_addr_set(rx_buf, mapping, mapping);
>
>       rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1173,13 +1173,13 @@ static void bnx2x_reuse_rx_skb(struct
> bnx2x_fastpath *fp,
>       struct eth_rx_bd *cons_bd = &fp->rx_desc_ring[cons];
>       struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>
> -     pci_dma_sync_single_for_device(bp->pdev,
> -
> pci_unmap_addr(cons_rx_buf, mapping),
> -                                    RX_COPY_THRESH,
> PCI_DMA_FROMDEVICE);
> +     dma_sync_single_for_device(&bp->pdev->dev,
> +                                dma_unmap_addr(cons_rx_buf, mapping),
> +                                RX_COPY_THRESH, DMA_FROM_DEVICE);
>
>       prod_rx_buf->skb = cons_rx_buf->skb;
> -     pci_unmap_addr_set(prod_rx_buf, mapping,
> -                        pci_unmap_addr(cons_rx_buf, mapping));
> +     dma_unmap_addr_set(prod_rx_buf, mapping,
> +                        dma_unmap_addr(cons_rx_buf, mapping));
>       *prod_bd = *cons_bd;
>  }
>
> @@ -1283,9 +1283,9 @@ static void bnx2x_tpa_start(struct
> bnx2x_fastpath *fp, u16 queue,
>
>       /* move empty skb from pool to prod and map it */
>       prod_rx_buf->skb = fp->tpa_pool[queue].skb;
> -     mapping = pci_map_single(bp->pdev,
> fp->tpa_pool[queue].skb->data,
> -                              bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> -     pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> +     mapping = dma_map_single(&bp->pdev->dev,
> fp->tpa_pool[queue].skb->data,
> +                              bp->rx_buf_size, DMA_FROM_DEVICE);
> +     dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>
>       /* move partial skb from cons to pool (don't unmap yet) */
>       fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,8 +1361,9 @@ static int bnx2x_fill_frag_skb(struct
> bnx2x *bp, struct bnx2x_fastpath *fp,
>               }
>
>               /* Unmap the page as we r going to pass it to
> the stack */
> -             pci_unmap_page(bp->pdev,
> pci_unmap_addr(&old_rx_pg, mapping),
> -                           SGE_PAGE_SIZE*PAGES_PER_SGE,
> PCI_DMA_FROMDEVICE);
> +             dma_unmap_page(&bp->pdev->dev,
> +                            dma_unmap_addr(&old_rx_pg, mapping),
> +                            SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
>
>               /* Add one frag and update the appropriate
> fields in the skb */
>               skb_fill_page_desc(skb, j, old_rx_pg.page, 0, frag_len);
> @@ -1389,8 +1390,8 @@ static void bnx2x_tpa_stop(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>       /* Unmap skb in the pool anyway, as we are going to change
>          pool entry status to BNX2X_TPA_STOP even if new skb
> allocation
>          fails. */
> -     pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> -                      bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> +     dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf,
> mapping),
> +                      bp->rx_buf_size, DMA_FROM_DEVICE);
>
>       if (likely(new_skb)) {
>               /* fix ip xsum and give it to the stack */
> @@ -1620,10 +1621,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
>                               }
>                       }
>
> -                     pci_dma_sync_single_for_device(bp->pdev,
> -                                     pci_unmap_addr(rx_buf, mapping),
> -                                                    pad +
> RX_COPY_THRESH,
> -
> PCI_DMA_FROMDEVICE);
> +                     dma_sync_single_for_device(&bp->pdev->dev,
> +                                     dma_unmap_addr(rx_buf, mapping),
> +                                                pad + RX_COPY_THRESH,
> +                                                DMA_FROM_DEVICE);
>                       prefetch(skb);
>                       prefetch(((char *)(skb)) + 128);
>
> @@ -1665,10 +1666,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
>
>                       } else
>                       if (likely(bnx2x_alloc_rx_skb(bp, fp,
> bd_prod) == 0)) {
> -                             pci_unmap_single(bp->pdev,
> -                                     pci_unmap_addr(rx_buf, mapping),
> +                             dma_unmap_single(&bp->pdev->dev,
> +                                     dma_unmap_addr(rx_buf, mapping),
>                                                bp->rx_buf_size,
> -                                              PCI_DMA_FROMDEVICE);
> +                                              DMA_FROM_DEVICE);
>                               skb_reserve(skb, pad);
>                               skb_put(skb, len);
>
> @@ -4940,9 +4941,9 @@ static inline void
> bnx2x_free_tpa_pool(struct bnx2x *bp,
>               }
>
>               if (fp->tpa_state[i] == BNX2X_TPA_START)
> -                     pci_unmap_single(bp->pdev,
> -                                      pci_unmap_addr(rx_buf,
> mapping),
> -                                      bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> +                     dma_unmap_single(&bp->pdev->dev,
> +                                      dma_unmap_addr(rx_buf,
> mapping),
> +                                      bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
>               dev_kfree_skb(skb);
>               rx_buf->skb = NULL;
> @@ -4978,7 +4979,7 @@ static void bnx2x_init_rx_rings(struct
> bnx2x *bp)
>                                       fp->disable_tpa = 1;
>                                       break;
>                               }
> -                             pci_unmap_addr_set((struct sw_rx_bd *)
> +                             dma_unmap_addr_set((struct sw_rx_bd *)
>
> &bp->fp->tpa_pool[i],
>                                                  mapping, 0);
>                               fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -5658,8 +5659,8 @@ static void bnx2x_nic_init(struct bnx2x
> *bp, u32 load_code)
>
>  static int bnx2x_gunzip_init(struct bnx2x *bp)
>  {
> -     bp->gunzip_buf = pci_alloc_consistent(bp->pdev, FW_BUF_SIZE,
> -                                           &bp->gunzip_mapping);
> +     bp->gunzip_buf = dma_alloc_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> +
> &bp->gunzip_mapping, GFP_KERNEL);
>       if (bp->gunzip_buf  == NULL)
>               goto gunzip_nomem1;
>
> @@ -5679,8 +5680,8 @@ gunzip_nomem3:
>       bp->strm = NULL;
>
>  gunzip_nomem2:
> -     pci_free_consistent(bp->pdev, FW_BUF_SIZE, bp->gunzip_buf,
> -                         bp->gunzip_mapping);
> +     dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE, bp->gunzip_buf,
> +                       bp->gunzip_mapping);
>       bp->gunzip_buf = NULL;
>
>  gunzip_nomem1:
> @@ -5696,8 +5697,8 @@ static void bnx2x_gunzip_end(struct bnx2x *bp)
>       bp->strm = NULL;
>
>       if (bp->gunzip_buf) {
> -             pci_free_consistent(bp->pdev, FW_BUF_SIZE,
> bp->gunzip_buf,
> -                                 bp->gunzip_mapping);
> +             dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> bp->gunzip_buf,
> +                               bp->gunzip_mapping);
>               bp->gunzip_buf = NULL;
>       }
>  }
> @@ -6692,7 +6693,7 @@ static void bnx2x_free_mem(struct bnx2x *bp)
>  #define BNX2X_PCI_FREE(x, y, size) \
>       do { \
>               if (x) { \
> -                     pci_free_consistent(bp->pdev, size, x, y); \
> +                     dma_free_coherent(&bp->pdev->dev, size, x, y); \
>                       x = NULL; \
>                       y = 0; \
>               } \
> @@ -6773,7 +6774,7 @@ static int bnx2x_alloc_mem(struct bnx2x *bp)
>
>  #define BNX2X_PCI_ALLOC(x, y, size) \
>       do { \
> -             x = pci_alloc_consistent(bp->pdev, size, y); \
> +             x = dma_alloc_coherent(&bp->pdev->dev, size, y,
> GFP_KERNEL); \
>               if (x == NULL) \
>                       goto alloc_mem_err; \
>               memset(x, 0, size); \
> @@ -6906,9 +6907,9 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
>                       if (skb == NULL)
>                               continue;
>
> -                     pci_unmap_single(bp->pdev,
> -                                      pci_unmap_addr(rx_buf,
> mapping),
> -                                      bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> +                     dma_unmap_single(&bp->pdev->dev,
> +                                      dma_unmap_addr(rx_buf,
> mapping),
> +                                      bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
>                       rx_buf->skb = NULL;
>                       dev_kfree_skb(skb);
> @@ -10269,8 +10270,8 @@ static int bnx2x_run_loopback(struct
> bnx2x *bp, int loopback_mode, u8 link_up)
>
>       bd_prod = TX_BD(fp_tx->tx_bd_prod);
>       tx_start_bd = &fp_tx->tx_desc_ring[bd_prod].start_bd;
> -     mapping = pci_map_single(bp->pdev, skb->data,
> -                              skb_headlen(skb), PCI_DMA_TODEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> +                              skb_headlen(skb), DMA_TO_DEVICE);
>       tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
>       tx_start_bd->nbd = cpu_to_le16(2); /* start + pbd */
> @@ -11316,8 +11317,8 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
>               }
>       }
>
> -     mapping = pci_map_single(bp->pdev, skb->data,
> -                              skb_headlen(skb), PCI_DMA_TODEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> +                              skb_headlen(skb), DMA_TO_DEVICE);
>
>       tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11374,8 +11375,9 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
>               if (total_pkt_bd == NULL)
>                       total_pkt_bd =
> &fp->tx_desc_ring[bd_prod].reg_bd;
>
> -             mapping = pci_map_page(bp->pdev, frag->page,
> frag->page_offset,
> -                                    frag->size, PCI_DMA_TODEVICE);
> +             mapping = dma_map_page(&bp->pdev->dev, frag->page,
> +                                    frag->page_offset,
> +                                    frag->size, DMA_TO_DEVICE);
>
>               tx_data_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>               tx_data_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11832,15 +11834,15 @@ static int __devinit
> bnx2x_init_dev(struct pci_dev *pdev,
>               goto err_out_release;
>       }
>
> -     if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
> +     if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) == 0) {
>               bp->flags |= USING_DAC_FLAG;
> -             if (pci_set_consistent_dma_mask(pdev,
> DMA_BIT_MASK(64)) != 0) {
> -                     pr_err("pci_set_consistent_dma_mask
> failed, aborting\n");
> +             if (dma_set_coherent_mask(&pdev->dev,
> DMA_BIT_MASK(64)) != 0) {
> +                     pr_err("dma_set_coherent_mask failed,
> aborting\n");
>                       rc = -EIO;
>                       goto err_out_release;
>               }
>
> -     } else if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
> +     } else if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(32)) != 0) {
>               pr_err("System does not support DMA, aborting\n");
>               rc = -EIO;
>               goto err_out_release;
> --
> 1.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-06  7:49 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@c2.user-mode-linux.org, yzhao81@gmail.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BB9@shzsmsx502.ccr.corp.intel.com>

On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote:
> Michael,
> >> 
> >>For the DOS issue, I'm not sure how much the limit get_user_pages()
> >> can pin is reasonable, should we compute the bindwidth to make it?
> 
> >There's a ulimit for locked memory. Can we use this, decreasing
> >the value for rlimit array? We can do this when backend is
> >enabled and re-increment when backend is disabled.
> 
> I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
> the initial value for it is 0x10000, after right shift PAGE_SHIFT,
> it's only 16 pages we can lock then, it seems too small, since the 
> guest virito-net driver may submit a lot requests one time.
> 
> 
> Thanks
> Xiaohui

Yes, that's the default, but system administrator can always increase
this value with ulimit if necessary.

-- 
MST

^ permalink raw reply

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Michael S. Tsirkin @ 2010-04-06  7:51 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BC1@shzsmsx502.ccr.corp.intel.com>

On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote:
> Michael,
> > >>> For the write logging, do you have a function in hand that we can
> > >>> recompute the log? If that, I think I can use it to recompute the
> > >>>log info when the logging is suddenly enabled.
> > >>> For the outstanding requests, do you mean all the user buffers have
> > >>>submitted before the logging ioctl changed? That may be a lot, and
> > >> >some of them are still in NIC ring descriptors. Waiting them to be
> > >>>finished may be need some time. I think when logging ioctl changed,
> > >> >then the logging is changed just after that is also reasonable.
>  
> > >>The key point is that after loggin ioctl returns, any
> > >>subsequent change to memory must be logged. It does not
> > >>matter when was the request submitted, otherwise we will
> > >>get memory corruption on migration.
> 
> > >The change to memory happens when vhost_add_used_and_signal(), right?
> > >So after ioctl returns, just recompute the log info to the events in the async queue,
> > >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
>  
> >> Thanks
> >> Xiaohui
> 
> >Yes, I think this will work.
> 
> Thanks, so do you have the function to recompute the log info in your hand that I can 
> use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

> > > Thanks
> > > Xiaohui
> > > 
> > >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > >  drivers/vhost/vhost.h |   10 +++
> > >  2 files changed, 192 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 22d5fef..2aafd90 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -17,11 +17,13 @@
> > >  #include <linux/workqueue.h>
> > >  #include <linux/rcupdate.h>
> > >  #include <linux/file.h>
> > > +#include <linux/aio.h>
> > >  
> > >  #include <linux/net.h>
> > >  #include <linux/if_packet.h>
> > >  #include <linux/if_arp.h>
> > >  #include <linux/if_tun.h>
> > > +#include <linux/mpassthru.h>
> > >  
> > >  #include <net/sock.h>
> > >  
> > > @@ -47,6 +49,7 @@ struct vhost_net {
> > >  	struct vhost_dev dev;
> > >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > +	struct kmem_cache       *cache;
> > >  	/* Tells us whether we are polling a socket for TX.
> > >  	 * We only do this when socket buffer fills up.
> > >  	 * Protected by tx vq lock. */
> > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > >  }
> > >  
> > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > > +	if (!list_empty(&vq->notifier)) {
> > > +		iocb = list_first_entry(&vq->notifier,
> > > +				struct kiocb, ki_list);
> > > +		list_del(&iocb->ki_list);
> > > +	}
> > > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > +	return iocb;
> > > +}
> > > +
> > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	struct vhost_log *vq_log = NULL;
> > > +	int rx_total_len = 0;
> > > +	int log, size;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	if (vq->receiver)
> > > +		vq->receiver(vq);
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(
> > > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, iocb->ki_nbytes);
> > > +		log = (int)iocb->ki_user_data;
> > > +		size = iocb->ki_nbytes;
> > > +		rx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +		kmem_cache_free(net->cache, iocb);
> > > +
> > > +		if (unlikely(vq_log))
> > > +			vhost_log_write(vq, vq_log, log, size);
> > > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	int tx_total_len = 0;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, 0);
> > > +		tx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +
> > > +		kmem_cache_free(net->cache, iocb);
> > > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  /* Expects to be always run from workqueue - which acts as
> > >   * read-size critical section for our kind of RCU. */
> > >  static void handle_tx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, s;
> > >  	struct msghdr msg = {
> > >  		.msg_name = NULL,
> > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		tx_poll_stop(net);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > >  		/* Skip header. TODO: support TSO. */
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > >  		msg.msg_iovlen = out;
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->ki_pos = head;
> > > +			iocb->private = (void *)vq;
> > > +		}
> > > +
> > >  		len = iov_length(vq->iov, out);
> > >  		/* Sanity check */
> > >  		if (!len) {
> > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > >  			break;
> > >  		}
> > >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > >  		if (unlikely(err < 0)) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			tx_poll_start(net, sock);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		if (err != len)
> > >  			pr_err("Truncated TX packet: "
> > >  			       " len %d != %zd\n", err, len);
> > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > >  static void handle_rx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, log, s;
> > >  	struct vhost_log *vq_log;
> > >  	struct msghdr msg = {
> > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > >  	int err;
> > >  	size_t hdr_size;
> > >  	struct socket *sock = rcu_dereference(vq->private_data);
> > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> > >  		return;
> > >  
> > >  	use_mm(net->dev.mm);
> > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > >  	vhost_disable_notify(vq);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > +	/* In async cases, for write logging, the simple way is to get
> > > +	 * the log info always, and really logging is decided later.
> > > +	 * Thus, when logging enabled, we can get log, and when logging
> > > +	 * disabled, we can get log disabled accordingly.
> > > +	 */
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > >  		vq->log : NULL;
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > >  		msg.msg_iovlen = in;
> > >  		len = iov_length(vq->iov, in);
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->private = vq;
> > > +			iocb->ki_pos = head;
> > > +			iocb->ki_user_data = log;
> > > +		}
> > >  		/* Sanity check */
> > >  		if (!len) {
> > >  			vq_err(vq, "Unexpected header len for RX: "
> > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > >  			       iov_length(vq->hdr, s), hdr_size);
> > >  			break;
> > >  		}
> > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > > +
> > > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> > >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> > >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> > >  		if (err < 0) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		/* TODO: Should check and handle checksum. */
> > >  		if (err > len) {
> > >  			pr_err("Discarded truncated rx packet: "
> > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > >  
> > > +
> > >  static void handle_tx_kick(struct work_struct *work)
> > >  {
> > >  	struct vhost_virtqueue *vq;
> > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > +	n->cache = NULL;
> > >  	return 0;
> > >  }
> > >  
> > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > >  }
> > >  
> > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > +{
> > > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > > +	if (n->cache) {
> > > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > > +			kmem_cache_free(n->cache, iocb);
> > > +		kmem_cache_destroy(n->cache);
> > > +	}
> > > +}
> > > +
> > >  static int vhost_net_release(struct inode *inode, struct file *f)
> > >  {
> > >  	struct vhost_net *n = f->private_data;
> > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > >  	/* We do an extra flush before freeing memory,
> > >  	 * since jobs can re-queue themselves. */
> > >  	vhost_net_flush(n);
> > > +	vhost_notifier_cleanup(n);
> > >  	kfree(n);
> > >  	return 0;
> > >  }
> > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > >  	return sock;
> > >  }
> > >  
> > > -static struct socket *get_socket(int fd)
> > > +static struct socket *get_mp_socket(int fd)
> > > +{
> > > +	struct file *file = fget(fd);
> > > +	struct socket *sock;
> > > +	if (!file)
> > > +		return ERR_PTR(-EBADF);
> > > +	sock = mp_get_socket(file);
> > > +	if (IS_ERR(sock))
> > > +		fput(file);
> > > +	return sock;
> > > +}
> > > +
> > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > >  {
> > >  	struct socket *sock;
> > >  	if (fd == -1)
> > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > >  	sock = get_tun_socket(fd);
> > >  	if (!IS_ERR(sock))
> > >  		return sock;
> > > +	sock = get_mp_socket(fd);
> > > +	if (!IS_ERR(sock)) {
> > > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > +		return sock;
> > > +	}
> > >  	return ERR_PTR(-ENOTSOCK);
> > >  }
> > >  
> > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > +{
> > > +	struct vhost_virtqueue *vq = n->vqs + index;
> > > +
> > > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +		vq->receiver = NULL;
> > > +		INIT_LIST_HEAD(&vq->notifier);
> > > +		spin_lock_init(&vq->notify_lock);
> > > +		if (!n->cache) {
> > > +			n->cache = kmem_cache_create("vhost_kiocb",
> > > +					sizeof(struct kiocb), 0,
> > > +					SLAB_HWCACHE_ALIGN, NULL);
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  {
> > >  	struct socket *sock, *oldsock;
> > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	}
> > >  	vq = n->vqs + index;
> > >  	mutex_lock(&vq->mutex);
> > > -	sock = get_socket(fd);
> > > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > > +	sock = get_socket(vq, fd);
> > >  	if (IS_ERR(sock)) {
> > >  		r = PTR_ERR(sock);
> > >  		goto err;
> > >  	}
> > >  
> > > +	vhost_init_link_state(n, index);
> > > +
> > >  	/* start polling new socket */
> > >  	oldsock = vq->private_data;
> > >  	if (sock == oldsock)
> > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	vhost_net_disable_vq(n, vq);
> > >  	rcu_assign_pointer(vq->private_data, sock);
> > >  	vhost_net_enable_vq(n, vq);
> > > -	mutex_unlock(&vq->mutex);
> > >  done:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	if (oldsock) {
> > >  		vhost_net_flush_vq(n, index);
> > > @@ -516,6 +690,7 @@ done:
> > >  	}
> > >  	return r;
> > >  err:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	return r;
> > >  }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index d1f0453..cffe39a 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -43,6 +43,11 @@ struct vhost_log {
> > >  	u64 len;
> > >  };
> > >  
> > > +enum vhost_vq_link_state {
> > > +	VHOST_VQ_LINK_SYNC = 	0,
> > > +	VHOST_VQ_LINK_ASYNC = 	1,
> > > +};
> > > +
> > >  /* The virtqueue structure describes a queue attached to a device. */
> > >  struct vhost_virtqueue {
> > >  	struct vhost_dev *dev;
> > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > >  	/* Log write descriptors */
> > >  	void __user *log_base;
> > >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > > +	/*Differiate async socket for 0-copy from normal*/
> > > +	enum vhost_vq_link_state link_state;
> > > +	struct list_head notifier;
> > > +	spinlock_t notify_lock;
> > > +	void (*receiver)(struct vhost_virtqueue *);
> > >  };
> > >  
> > >  struct vhost_dev {
> > > -- 
> > > 1.5.4.4

^ permalink raw reply

* NET: sb1250: Fix compile warning in driver
From: Ralf Baechle @ 2010-04-06  9:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

 drivers/net/sb1250-mac.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sb1250-mac.c b/drivers/net/sb1250-mac.c
index 9944e5d..142261b 100644
--- a/drivers/net/sb1250-mac.c
+++ b/drivers/net/sb1250-mac.c
@@ -2664,7 +2664,6 @@ static int sbmac_close(struct net_device *dev)
 static int sbmac_poll(struct napi_struct *napi, int budget)
 {
 	struct sbmac_softc *sc = container_of(napi, struct sbmac_softc, napi);
-	struct net_device *dev = sc->sbm_dev;
 	int work_done;
 
 	work_done = sbdma_rx_process(sc, &(sc->sbm_rxdma), budget, 1);

^ permalink raw reply related

* Re: [PATCH net-next 00/12] tg3: Bugfix, msg fixups, and checkpatch cleanups
From: David Miller @ 2010-04-06 10:59 UTC (permalink / raw)
  To: mcarlson; +Cc: netdev, andy
In-Reply-To: <1270498770-23765-1-git-send-email-mcarlson@broadcom.com>

From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Mon, 5 Apr 2010 13:19:18 -0700

> This patchset fixes a minor APD bug, elaborates on the recent messaging
> improvements, and implements some checkpatch cleanups.

These all look fine, applied to net-next-2.6, thanks!

^ permalink raw reply

* Re: NET: sb1250: Fix compile warning in driver
From: David Miller @ 2010-04-06 11:03 UTC (permalink / raw)
  To: ralf; +Cc: netdev
In-Reply-To: <20100406093320.GA31967@linux-mips.org>

From: Ralf Baechle <ralf@linux-mips.org>
Date: Tue, 6 Apr 2010 10:33:20 +0100

> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

Applied to net-next-2.6, thanks Ralf!

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2010-04-06 11:33 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


The tcp splice oops is pretty nasty... anyways.

1) Fixup rcu_deref calls done outside RCU read lock in netlabel,
   from Paul Moore.

2) gianfar fixes (memory leak on close, message alignment) from
   Andy Fleming and Kim Philips.

3) MAC address probing fix in smc91c92_cs from Ken Kawasaki.

4) Some small wireless fixes via John Linville and co. including
   a few device ID additions.
	a) iwlwifi bool conversion to flags broke regulatory handling
	b) iwlwifi tfd counting on 4965 chips fix
	c) mac80211's reg_regdb_search_lock needs to be a mutex
	d) off-by-one test fix in wireless mesh metric handling

5) New cxgb4 driver.

6) TCP doesn't maintain queue comsumed state properly across socket
   lock dropping (and thus backlog processing) during splice so this
   confuses tcp_collapse() and we crash.  Fix from Steven J. Magnani

7) bond_uninit() deadlock fix from Amerigo Wang.

8) be2net fixes (redboot flashing, big endian flashing and VLAN rx
   issues) from Ajit Khaparde.

9) stmmac needs crc32, from Carmelo AMOROSO

10) round-robin bonding does htons() on a u8 :-)  Fix from Eric
    Dumazet.

11) Missing lock release in sgisseq driver, from Julia Lawall

12) Need to validate socket address length before derefing in
    socket ->connect() handlers.  From Changli Gao.

Please pull, thanks a lot!

The following changes since commit db217dece3003df0841bacf9556b5c06aa097dae:
  Linus Torvalds (1):
        Merge git://git.kernel.org/.../davem/sparc-2.6

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Ajit Khaparde (3):
      be2net: fix a bug in flashing the redboot section
      be2net: fix flashing on big endian architectures
      be2net: fix bug in vlan rx path for big endian architecture

Amerigo Wang (1):
      bonding: fix potential deadlock in bond_uninit()

Andy Fleming (1):
      gianfar: Fix a memory leak in gianfar close code

Ben Konrath (1):
      ar9170: add support for NEC WL300NU-G USB dongle

Benjamin Larsson (1):
      Add a pci-id to the mwl8k driver

Carmelo AMOROSO (1):
      stmmac: fix kconfig for crc32 build error

Changli Gao (1):
      net: check the length of the socket address passed to connect(2)

Dan Carpenter (1):
      iwlwifi: range checking issue

Daniel Mack (1):
      net/wireless/libertas: do not call wiphy_unregister() w/o wiphy_register()

David S. Miller (1):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6

Dimitris Michailidis (6):
      cxgb4: Add register, message, and FW definitions
      cxgb4: Add HW and FW support code
      cxgb4: Add packet queues and packet DMA code
      cxgb4: Add remaining driver headers and L2T management
      cxgb4: Add main driver file and driver Makefile
      net: Hook up cxgb4 to Kconfig and Makefile

Eric Dumazet (1):
      bonding: bond_xmit_roundrobin() fix

Gertjan van Wingerde (2):
      rt2x00: Fix typo in RF register programming of rt2800.
      rt2x00: Disable powersaving by default in rt2500usb.

Giuseppe CAVALLARO (1):
      stmmac: add documentation for the driver.

Hans de Goede (1):
      Add USB ID for Thomson SpeedTouch 120g to p54usb id table

Johannes Berg (1):
      mac80211: move netdev queue enabling to correct spot

John W. Linville (2):
      wireless: convert reg_regdb_search_lock to mutex
      mac80211: correct typos in "unavailable upon resume" warning

Julia Lawall (1):
      drivers/net: Add missing unlock

Ken Kawasaki (1):
      smc91c92_cs: fix the problem of "Unable to find hardware address"

Kim Phillips (2):
      net: gianfar - initialize per-queue statistics
      net: gianfar - align BD ring size console messages

Neil Horman (1):
      r8169: clean up my printk uglyness

Paul Moore (1):
      netlabel: Fix several rcu_dereference() calls used without RCU read locks

Porsch, Marco (1):
      mac80211: fix PREQ processing and one small bug

Reinette Chatre (1):
      iwlwifi: fix regulatory

Shanyu Zhao (1):
      iwlwifi: clear unattended interrupts in tasklet

Steven J. Magnani (1):
      net: Fix oops from tcp_collapse() when using splice()

Valentin Longchamp (1):
      setup correct int pipe type in ar9170_usb_exec_cmd

Wey-Yi Guy (1):
      iwlwifi: counting number of tfds can be free for 4965

 Documentation/networking/stmmac.txt         |  143 ++
 drivers/net/Kconfig                         |   25 +
 drivers/net/Makefile                        |    1 +
 drivers/net/benet/be_cmds.c                 |    4 +-
 drivers/net/benet/be_main.c                 |   21 +-
 drivers/net/bonding/bond_main.c             |   28 +-
 drivers/net/cxgb4/Makefile                  |    7 +
 drivers/net/cxgb4/cxgb4.h                   |  741 ++++++
 drivers/net/cxgb4/cxgb4_main.c              | 3388 +++++++++++++++++++++++++++
 drivers/net/cxgb4/cxgb4_uld.h               |  239 ++
 drivers/net/cxgb4/l2t.c                     |  624 +++++
 drivers/net/cxgb4/l2t.h                     |  110 +
 drivers/net/cxgb4/sge.c                     | 2431 +++++++++++++++++++
 drivers/net/cxgb4/t4_hw.c                   | 3131 +++++++++++++++++++++++++
 drivers/net/cxgb4/t4_hw.h                   |  100 +
 drivers/net/cxgb4/t4_msg.h                  |  664 ++++++
 drivers/net/cxgb4/t4_regs.h                 |  878 +++++++
 drivers/net/cxgb4/t4fw_api.h                | 1580 +++++++++++++
 drivers/net/gianfar.c                       |   12 +-
 drivers/net/pcmcia/smc91c92_cs.c            |   12 +-
 drivers/net/r8169.c                         |    4 +-
 drivers/net/sgiseeq.c                       |    4 +-
 drivers/net/stmmac/Kconfig                  |    1 +
 drivers/net/wireless/ath/ar9170/usb.c       |    4 +-
 drivers/net/wireless/iwlwifi/iwl-4965.c     |    6 +-
 drivers/net/wireless/iwlwifi/iwl-agn.c      |   12 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |    4 +-
 drivers/net/wireless/libertas/cfg.c         |    8 +-
 drivers/net/wireless/libertas/dev.h         |    1 +
 drivers/net/wireless/mwl8k.c                |    1 +
 drivers/net/wireless/p54/p54usb.c           |    1 +
 drivers/net/wireless/rt2x00/rt2500usb.c     |    5 +
 drivers/net/wireless/rt2x00/rt2800lib.c     |    4 +-
 net/bluetooth/l2cap.c                       |    3 +-
 net/bluetooth/rfcomm/sock.c                 |    3 +-
 net/bluetooth/sco.c                         |    3 +-
 net/can/bcm.c                               |    3 +
 net/ieee802154/af_ieee802154.c              |    3 +
 net/ipv4/af_inet.c                          |    5 +
 net/ipv4/tcp.c                              |    1 +
 net/mac80211/mesh_hwmp.c                    |    4 +-
 net/mac80211/tx.c                           |    6 +
 net/mac80211/util.c                         |   18 +-
 net/netlabel/netlabel_domainhash.c          |   28 +-
 net/netlabel/netlabel_unlabeled.c           |   66 +-
 net/netlink/af_netlink.c                    |    3 +
 net/wireless/reg.c                          |   12 +-
 47 files changed, 14221 insertions(+), 131 deletions(-)
 create mode 100644 Documentation/networking/stmmac.txt
 create mode 100644 drivers/net/cxgb4/Makefile
 create mode 100644 drivers/net/cxgb4/cxgb4.h
 create mode 100644 drivers/net/cxgb4/cxgb4_main.c
 create mode 100644 drivers/net/cxgb4/cxgb4_uld.h
 create mode 100644 drivers/net/cxgb4/l2t.c
 create mode 100644 drivers/net/cxgb4/l2t.h
 create mode 100644 drivers/net/cxgb4/sge.c
 create mode 100644 drivers/net/cxgb4/t4_hw.c
 create mode 100644 drivers/net/cxgb4/t4_hw.h
 create mode 100644 drivers/net/cxgb4/t4_msg.h
 create mode 100644 drivers/net/cxgb4/t4_regs.h
 create mode 100644 drivers/net/cxgb4/t4fw_api.h

^ permalink raw reply

* Re: patch to improve x.25 throughput negotiation
From: andrew hendry @ 2010-04-06 12:09 UTC (permalink / raw)
  To: John Hughes; +Cc: netdev
In-Reply-To: <4BB8C2CA.6040102@Calva.COM>

I have reproduced a few ways.
1. X25_MASK_THROUGHPUT on the x25_subscript_struct, then call
SIOCX25SSUBSCRIP, then call SIOCX25FACILITIES without setting the
throughput field. Call connect.
2. No subscrip setting, call SIOCX25FACILITIES without setting the
throughput field. Call connect.
3. No subcrip, no facilities ioctl, call connect.

The patch removes the bad facility and makes the router accept the
call for the above cases.
I don't currently have a setup to test both direction throughput negotiation.

Tested-by: Andrew Hendry <andrew.hendry@gmail.com>

On Mon, Apr 5, 2010 at 2:48 AM, John Hughes <john@calva.com> wrote:
> The current X.25 code has some bugs in throughput negotiation:
>
>  1. It does negotiation in all cases, usually there is no need
>  2. It incorrectly attempts to negotiate the throughput class in one
>     direction only.  There are separate throughput classes for input
>     and output and if either is negotiated both mist be negotiates.
>
> This is bug https://bugzilla.kernel.org/show_bug.cgi?id=15681
>
> This bug was first reported by Daniel Ferenci to the linux-x25 mailing list
> on 6/8/2004, but is still present.
>
>

^ permalink raw reply

* [RFC][PATCH] ipmr: Fix struct mfcctl to be independent of MAXVIFS.
From: Eric W. Biederman @ 2010-04-06 12:16 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Patrick McHardy, Ilia K, Tom Goff


Right now if you recompile the kernel to support more VIFS users of
the MRT_ADD_VIF and MRT_DEL_VIF will break because the ABI changes.

Correct this by forcing the number of VIFS handled in mfcctl to 32.
Allow for larger MAXVIFS by placing a second array of ttls at the end
of struct mfcctl.

struct mfcctl is insane.  The last 4 fields are dead, and the mfc_ttls
array is only 2 byte aligned, with a 2 byte hole right after it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/mroute.h |    4 +++-
 net/ipv4/ipmr.c        |   29 +++++++++++++++++++++++------
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index c5f3d53..c5e066c 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -76,15 +76,17 @@ struct vifctl {
  *	Cache manipulation structures for mrouted and PIMd
  */
  
+#define MFCCTL_VIFS 32
 struct mfcctl {
 	struct in_addr mfcc_origin;		/* Origin of mcast	*/
 	struct in_addr mfcc_mcastgrp;		/* Group in question	*/
 	vifi_t	mfcc_parent;			/* Where it arrived	*/
-	unsigned char mfcc_ttls[MAXVIFS];	/* Where it is going	*/
+	unsigned char mfcc_ttls[MFCCTL_VIFS];	/* Where it is going	*/
 	unsigned int mfcc_pkt_cnt;		/* pkt count for src-grp */
 	unsigned int mfcc_byte_cnt;
 	unsigned int mfcc_wrong_if;
 	int	     mfcc_expire;
+	unsigned char mfcc_ttls_extra[];	/* The rest of where it is going */
 };
 
 /* 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0b9d03c..2120668 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -797,7 +797,8 @@ static int ipmr_mfc_delete(struct net *net, struct mfcctl *mfc)
 	return -ENOENT;
 }
 
-static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
+static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc,
+			unsigned char *ttls, int mrtsock)
 {
 	int line;
 	struct mfc_cache *uc, *c, **cp;
@@ -817,7 +818,7 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
 	if (c != NULL) {
 		write_lock_bh(&mrt_lock);
 		c->mfc_parent = mfc->mfcc_parent;
-		ipmr_update_thresholds(c, mfc->mfcc_ttls);
+		ipmr_update_thresholds(c, ttls);
 		if (!mrtsock)
 			c->mfc_flags |= MFC_STATIC;
 		write_unlock_bh(&mrt_lock);
@@ -834,7 +835,7 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
 	c->mfc_origin = mfc->mfcc_origin.s_addr;
 	c->mfc_mcastgrp = mfc->mfcc_mcastgrp.s_addr;
 	c->mfc_parent = mfc->mfcc_parent;
-	ipmr_update_thresholds(c, mfc->mfcc_ttls);
+	ipmr_update_thresholds(c, ttls);
 	if (!mrtsock)
 		c->mfc_flags |= MFC_STATIC;
 
@@ -954,6 +955,8 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 	int ret;
 	struct vifctl vif;
 	struct mfcctl mfc;
+	unsigned char ttls[MAXVIFS];
+	unsigned extra_oifs;
 	struct net *net = sock_net(sk);
 
 	if (optname != MRT_INIT) {
@@ -961,7 +964,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 			return -EACCES;
 	}
 
-	switch (optname) {
+	switch (optname){
 	case MRT_INIT:
 		if (sk->sk_type != SOCK_RAW ||
 		    inet_sk(sk)->inet_num != IPPROTO_IGMP)
@@ -1012,15 +1015,29 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 		 */
 	case MRT_ADD_MFC:
 	case MRT_DEL_MFC:
-		if (optlen != sizeof(mfc))
+		/* How many extra interfaces do we have information for? */
+		extra_oifs = optlen - sizeof(mfc);
+		if (extra_oifs > (MAXVIFS - MFCCTL_VIFS))
+			extra_oifs = MAXVIFS - MFCCTL_VIFS;
+
+		if (optlen < sizeof(mfc))
 			return -EINVAL;
 		if (copy_from_user(&mfc, optval, sizeof(mfc)))
 			return -EFAULT;
+
+		memcpy(ttls, mfc.mfcc_ttls, sizeof(mfc.mfcc_ttls));
+		memset(ttls + MFCCTL_VIFS, 255, MAXVIFS - MFCCTL_VIFS);
+		if (copy_from_user(ttls + MFCCTL_VIFS,optval + sizeof(mfc),
+				   extra_oifs))
+			return -EFAULT;
+
+
 		rtnl_lock();
 		if (optname == MRT_DEL_MFC)
 			ret = ipmr_mfc_delete(net, &mfc);
 		else
-			ret = ipmr_mfc_add(net, &mfc, sk == net->ipv4.mroute_sk);
+			ret = ipmr_mfc_add(net, &mfc, ttls,
+					   sk == net->ipv4.mroute_sk);
 		rtnl_unlock();
 		return ret;
 		/*
-- 
1.6.5.2.143.g8cc62


^ permalink raw reply related

* Re: [PATCH] sky2: rx hash offload
From: Eric Dumazet @ 2010-04-06 12:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, Tom Herbert
In-Reply-To: <20100405084800.3bcec66a@nehalam>

Le lundi 05 avril 2010 à 08:48 -0700, Stephen Hemminger a écrit :
> Marvell Yukon 2 hardware supports hardware receive hash calculation.
> Now that Receive Packet Steering is available, add support
> to enable it.
> 
> Note: still experimental, tested on only a few variants.
> No performance testing has been done.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> ---
>  drivers/net/sky2.c |   75 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/net/sky2.h |   23 ++++++++++++++++
>  2 files changed, 96 insertions(+), 2 deletions(-)

I am wondering if introducing hardware computed rxhash wouldnt force us
to clear rxhash in several paths (tunneling...), so that we perform a
software recompute after decapsulation, to enable RFS

Not mandatory but recommended I would say...

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 2f302d3..3f0aba4 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -379,6 +379,7 @@ static int ipip_rcv(struct sk_buff *skb)
 		skb_dst_drop(skb);
 		nf_reset(skb);
 		ipip_ecn_decapsulate(iph, skb);
+		skb->rxhash = 0;
 		netif_rx(skb);
 		rcu_read_unlock();
 		return 0;



^ permalink raw reply related

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-06 12:34 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270486884-10905-1-git-send-email-timo.teras@iki.fi>

On Mon, Apr 05, 2010 at 08:01:24PM +0300, Timo Teras wrote:
> This allows to validate the cached object before returning it.
> It also allows to destruct object properly, if the last reference
> was held in flow cache. This is also a prepartion for caching
> bundles in the flow cache.
> 
> In return for virtualizing the methods, we save on:
> - not having to regenerate the whole flow cache on policy removal:
>   each flow matching a killed policy gets refreshed as the getter
>   function notices it smartly.
> - we do not have to call flow_cache_flush from policy gc, since the
>   flow cache now properly deletes the object if it had any references
> 
> Signed-off-by: Timo Teras <timo.teras@iki.fi>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks a lot for the patch!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Herbert Xu @ 2010-04-06 12:40 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270450824-2928-3-git-send-email-timo.teras@iki.fi>

On Mon, Apr 05, 2010 at 10:00:22AM +0300, Timo Teras wrote:
>
> @@ -623,33 +618,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
>  		schedule_work(&net->xfrm.policy_hash_work);
>  
>  	read_lock_bh(&xfrm_policy_lock);
> -	gc_list = NULL;
>  	entry = &policy->bydst;
> -	hlist_for_each_entry_continue(policy, entry, bydst) {
> -		struct dst_entry *dst;
> -
> -		write_lock(&policy->lock);
> -		dst = policy->bundles;
> -		if (dst) {
> -			struct dst_entry *tail = dst;
> -			while (tail->next)
> -				tail = tail->next;
> -			tail->next = gc_list;
> -			gc_list = dst;
> -
> -			policy->bundles = NULL;
> -		}
> -		write_unlock(&policy->lock);
> -	}
> +	hlist_for_each_entry_continue(policy, entry, bydst)
> +		atomic_inc(&policy->genid);

Do we still need this since we're invalidating the whole flow
cache?

The current code is necessary since otherwise the bundles won't
get freed.  But with your new code, this is essentially doing
nothing, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 11/11] drivers/uwb: Rename dev_info to wdi
From: David Vrabel @ 2010-04-06 12:42 UTC (permalink / raw)
  To: David Miller; +Cc: joe, akpm, netdev, linux-kernel
In-Reply-To: <20100405.145144.207421561.davem@davemloft.net>

David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Mon, 05 Apr 2010 14:44:18 -0700
> 
>> On Mon, 2010-04-05 at 12:05 -0700, Joe Perches wrote:
>>> There is a macro called dev_info that prints struct device specific
>>> information.  Having variables with the same name can be confusing and
>>> prevents conversion of the macro to a function.
>>>
>>> Rename the existing dev_info variables to something else in preparation
>>> to converting the dev_info macro to a function.
>> http://patchwork.ozlabs.org/patch/49421/
>>
>> This marked as RFC in patchwork.
>> It's not intended to be.
> 
> Because I can't apply the entire set, I'd like someone else
> to take this in since it's not really a networking specific
> patch.

I've taken it.

David
-- 
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park,  Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ                 http://www.csr.com/


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom

^ permalink raw reply

* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Timo Teräs @ 2010-04-06 12:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100406124014.GA24412@gondor.apana.org.au>

Herbert Xu wrote:
> On Mon, Apr 05, 2010 at 10:00:22AM +0300, Timo Teras wrote:
>> @@ -623,33 +618,11 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
>> +	hlist_for_each_entry_continue(policy, entry, bydst)
>> +		atomic_inc(&policy->genid);
> 
> Do we still need this since we're invalidating the whole flow
> cache?
> 
> The current code is necessary since otherwise the bundles won't
> get freed.  But with your new code, this is essentially doing
> nothing, no?

You are right. I completely missed the flushing there. It was
just systematic conversion of deleting the bundles to incrementing
the genid.

Which also makes me think of another issue. The resolver does
not get notice if the genid was outdated. So it might end up
the old policies from bundle after xfrm_policy_insert(). I think
we should explicitly call ops->delete() in flow_cache_lookup if
the flow genid was outdated. (I remember actually doing this,
but also removing it when I was hunting my the one hlist related
corruption bug.)

^ permalink raw reply

* Re: [PATCH v2] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-06 13:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004052248390.29212@pokey.mtv.corp.google.com>

Le lundi 05 avril 2010 à 22:56 -0700, Tom Herbert a écrit :
> Version 2:
> - added a u16 filler to pad rps_dev_flow structure
> - define RPS_NO_CPU as 0xffff
> - add inet_rps_save_rxhash helper function to copy skb's rxhash into inet_sk
> - add a "voidflow" which can be used get_rps_cpu does not return a flow (avoids some conditionals)
> - use raw_smp_processor_id in rps_record_sock_flow, this is no requirement to pr
> event preemption
> ---
> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or each receive queue in
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets. A CPU is selected on a per packet basis by hashing contents
> of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
> into the CPU mask.  The IPI mechanism is used to raise networking receive
> softirqs between CPUs.  This effectively emulates in software what a multi-queue
> NIC can provide, but is generic requiring no device support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering it to a remote CPU.
> 
> The CPU mask is set on a per device and per queue basis in the sysfs variable
> /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
> bit maps for receive queues in the device (numbered by <n>).  If a device
> does not support multi-queue, a single variable is used for the device (rx-0).
> 
> Generally, we have found this technique increases pps capabilities of a single
> queue device with good CPU utilization.  Optimal settings for the CPU mask
> seem to depend on architectures and cache hierarcy.  Below are some results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
> 
> e1000e on 8 core Intel
>    Without RPS: 108K tps at 33% CPU
>    With RPS:    311K tps at 64% CPU
> 
> forcedeth on 16 core AMD
>    Without RPS: 156K tps at 15% CPU
>    With RPS:    404K tps at 49% CPU
>    
> bnx2x on 16 core AMD
>    Without RPS  567K tps at 61% CPU (4 HW RX queues)
>    Without RPS  738K tps at 96% CPU (8 HW RX queues)
>    With RPS:    854K tps at 76% CPU (4 HW RX queues)
> 
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet.  In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance degradation.
> We have found that masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is changed
> this introduces the possibility of generating out of order packets.  It's
> probably best not change the masks too frequently.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---

Running on a preprod machine here, seems fine.

Some questions :

1) The need to add "rps_flow_entries=xxx" at boot time is problematic.
   Maybe we can allow it being dynamic (and use vmalloc() instead of
alloc_large_system_hash())

2) inet_rps_save_rxhash(sk, skb->rxhash);

	It should have a check to make sure some part of the stack doesnt feed
many different rxhash for a given socket (Make sure we dont pollute flow
table with pseudo random values)

3) UDP connected sockets dont benefit of RFS currently
   (Not sure many apps use connected UDP sockets, I do have some of them
in house)

I am trying following code for IPV4 only :

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7af756d..5c2d37a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1216,6 +1216,7 @@ int udp_disconnect(struct sock *sk, int flags)
 	sk->sk_state = TCP_CLOSE;
 	inet->inet_daddr = 0;
 	inet->inet_dport = 0;
+	inet_rps_save_rxhash(sk, 0);
 	sk->sk_bound_dev_if = 0;
 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
 		inet_reset_saddr(sk);
@@ -1257,8 +1258,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-	int rc = sock_queue_rcv_skb(sk, skb);
+	int rc;
+
+	if (inet_sk(sk)->inet_daddr)
+		inet_rps_save_rxhash(sk, skb->rxhash);
 
+	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 



^ permalink raw reply related

* Re: [PATCH 2/4] xfrm: cache bundles instead of policies for outgoing flows
From: Herbert Xu @ 2010-04-06 13:11 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BBB2F31.7090806@iki.fi>

On Tue, Apr 06, 2010 at 03:55:13PM +0300, Timo Teräs wrote:
>
> Which also makes me think of another issue. The resolver does
> not get notice if the genid was outdated. So it might end up
> the old policies from bundle after xfrm_policy_insert(). I think
> we should explicitly call ops->delete() in flow_cache_lookup if
> the flow genid was outdated. (I remember actually doing this,
> but also removing it when I was hunting my the one hlist related
> corruption bug.)

Right, that makes sense.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox