Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: NET: sb1250: Fix compile warning in driver
From: David Miller @ 2010-04-06 11:03 UTC (permalink / raw)
  To: ralf; +Cc: netdev
In-Reply-To: <20100406093320.GA31967@linux-mips.org>

From: Ralf Baechle <ralf@linux-mips.org>
Date: Tue, 6 Apr 2010 10:33:20 +0100

> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

Applied to net-next-2.6, thanks Ralf!

^ permalink raw reply

* Re: [PATCH net-next 00/12] tg3: Bugfix, msg fixups, and checkpatch cleanups
From: David Miller @ 2010-04-06 10:59 UTC (permalink / raw)
  To: mcarlson; +Cc: netdev, andy
In-Reply-To: <1270498770-23765-1-git-send-email-mcarlson@broadcom.com>

From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Mon, 5 Apr 2010 13:19:18 -0700

> This patchset fixes a minor APD bug, elaborates on the recent messaging
> improvements, and implements some checkpatch cleanups.

These all look fine, applied to net-next-2.6, thanks!

^ permalink raw reply

* NET: sb1250: Fix compile warning in driver
From: Ralf Baechle @ 2010-04-06  9:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

 drivers/net/sb1250-mac.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sb1250-mac.c b/drivers/net/sb1250-mac.c
index 9944e5d..142261b 100644
--- a/drivers/net/sb1250-mac.c
+++ b/drivers/net/sb1250-mac.c
@@ -2664,7 +2664,6 @@ static int sbmac_close(struct net_device *dev)
 static int sbmac_poll(struct napi_struct *napi, int budget)
 {
 	struct sbmac_softc *sc = container_of(napi, struct sbmac_softc, napi);
-	struct net_device *dev = sc->sbm_dev;
 	int work_done;
 
 	work_done = sbdma_rx_process(sc, &(sc->sbm_rxdma), budget, 1);

^ permalink raw reply related

* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Michael S. Tsirkin @ 2010-04-06  7:51 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BC1@shzsmsx502.ccr.corp.intel.com>

On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote:
> Michael,
> > >>> For the write logging, do you have a function in hand that we can
> > >>> recompute the log? If that, I think I can use it to recompute the
> > >>>log info when the logging is suddenly enabled.
> > >>> For the outstanding requests, do you mean all the user buffers have
> > >>>submitted before the logging ioctl changed? That may be a lot, and
> > >> >some of them are still in NIC ring descriptors. Waiting them to be
> > >>>finished may be need some time. I think when logging ioctl changed,
> > >> >then the logging is changed just after that is also reasonable.
>  
> > >>The key point is that after loggin ioctl returns, any
> > >>subsequent change to memory must be logged. It does not
> > >>matter when was the request submitted, otherwise we will
> > >>get memory corruption on migration.
> 
> > >The change to memory happens when vhost_add_used_and_signal(), right?
> > >So after ioctl returns, just recompute the log info to the events in the async queue,
> > >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
>  
> >> Thanks
> >> Xiaohui
> 
> >Yes, I think this will work.
> 
> Thanks, so do you have the function to recompute the log info in your hand that I can 
> use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

> > > Thanks
> > > Xiaohui
> > > 
> > >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > >  drivers/vhost/vhost.h |   10 +++
> > >  2 files changed, 192 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 22d5fef..2aafd90 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -17,11 +17,13 @@
> > >  #include <linux/workqueue.h>
> > >  #include <linux/rcupdate.h>
> > >  #include <linux/file.h>
> > > +#include <linux/aio.h>
> > >  
> > >  #include <linux/net.h>
> > >  #include <linux/if_packet.h>
> > >  #include <linux/if_arp.h>
> > >  #include <linux/if_tun.h>
> > > +#include <linux/mpassthru.h>
> > >  
> > >  #include <net/sock.h>
> > >  
> > > @@ -47,6 +49,7 @@ struct vhost_net {
> > >  	struct vhost_dev dev;
> > >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > > +	struct kmem_cache       *cache;
> > >  	/* Tells us whether we are polling a socket for TX.
> > >  	 * We only do this when socket buffer fills up.
> > >  	 * Protected by tx vq lock. */
> > > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > >  }
> > >  
> > > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > > +	if (!list_empty(&vq->notifier)) {
> > > +		iocb = list_first_entry(&vq->notifier,
> > > +				struct kiocb, ki_list);
> > > +		list_del(&iocb->ki_list);
> > > +	}
> > > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > > +	return iocb;
> > > +}
> > > +
> > > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	struct vhost_log *vq_log = NULL;
> > > +	int rx_total_len = 0;
> > > +	int log, size;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	if (vq->receiver)
> > > +		vq->receiver(vq);
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(
> > > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, iocb->ki_nbytes);
> > > +		log = (int)iocb->ki_user_data;
> > > +		size = iocb->ki_nbytes;
> > > +		rx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +		kmem_cache_free(net->cache, iocb);
> > > +
> > > +		if (unlikely(vq_log))
> > > +			vhost_log_write(vq, vq_log, log, size);
> > > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > > +					struct vhost_virtqueue *vq)
> > > +{
> > > +	struct kiocb *iocb = NULL;
> > > +	int tx_total_len = 0;
> > > +
> > > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > > +		return;
> > > +
> > > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > > +		vhost_add_used_and_signal(&net->dev, vq,
> > > +				iocb->ki_pos, 0);
> > > +		tx_total_len += iocb->ki_nbytes;
> > > +
> > > +		if (iocb->ki_dtor)
> > > +			iocb->ki_dtor(iocb);
> > > +
> > > +		kmem_cache_free(net->cache, iocb);
> > > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > > +			vhost_poll_queue(&vq->poll);
> > > +			break;
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  /* Expects to be always run from workqueue - which acts as
> > >   * read-size critical section for our kind of RCU. */
> > >  static void handle_tx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, s;
> > >  	struct msghdr msg = {
> > >  		.msg_name = NULL,
> > > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		tx_poll_stop(net);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > >  		/* Skip header. TODO: support TSO. */
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > >  		msg.msg_iovlen = out;
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->ki_pos = head;
> > > +			iocb->private = (void *)vq;
> > > +		}
> > > +
> > >  		len = iov_length(vq->iov, out);
> > >  		/* Sanity check */
> > >  		if (!len) {
> > > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > >  			break;
> > >  		}
> > >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > >  		if (unlikely(err < 0)) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			tx_poll_start(net, sock);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		if (err != len)
> > >  			pr_err("Truncated TX packet: "
> > >  			       " len %d != %zd\n", err, len);
> > > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_tx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > >  static void handle_rx(struct vhost_net *net)
> > >  {
> > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > >  	unsigned head, out, in, log, s;
> > >  	struct vhost_log *vq_log;
> > >  	struct msghdr msg = {
> > > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > >  	int err;
> > >  	size_t hdr_size;
> > >  	struct socket *sock = rcu_dereference(vq->private_data);
> > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> > >  		return;
> > >  
> > >  	use_mm(net->dev.mm);
> > > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > >  	vhost_disable_notify(vq);
> > >  	hdr_size = vq->hdr_size;
> > >  
> > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > +	/* In async cases, for write logging, the simple way is to get
> > > +	 * the log info always, and really logging is decided later.
> > > +	 * Thus, when logging enabled, we can get log, and when logging
> > > +	 * disabled, we can get log disabled accordingly.
> > > +	 */
> > > +
> > > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > >  		vq->log : NULL;
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	for (;;) {
> > >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >  					 ARRAY_SIZE(vq->iov),
> > > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > >  		msg.msg_iovlen = in;
> > >  		len = iov_length(vq->iov, in);
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > > +			if (!iocb)
> > > +				break;
> > > +			iocb->private = vq;
> > > +			iocb->ki_pos = head;
> > > +			iocb->ki_user_data = log;
> > > +		}
> > >  		/* Sanity check */
> > >  		if (!len) {
> > >  			vq_err(vq, "Unexpected header len for RX: "
> > > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > >  			       iov_length(vq->hdr, s), hdr_size);
> > >  			break;
> > >  		}
> > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > > +
> > > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> > >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> > >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> > >  		if (err < 0) {
> > >  			vhost_discard_vq_desc(vq);
> > >  			break;
> > >  		}
> > > +
> > > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > > +			continue;
> > > +
> > >  		/* TODO: Should check and handle checksum. */
> > >  		if (err > len) {
> > >  			pr_err("Discarded truncated rx packet: "
> > > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > >  		}
> > >  	}
> > >  
> > > +	handle_async_rx_events_notify(net, vq);
> > > +
> > >  	mutex_unlock(&vq->mutex);
> > >  	unuse_mm(net->dev.mm);
> > >  }
> > >  
> > > +
> > >  static void handle_tx_kick(struct work_struct *work)
> > >  {
> > >  	struct vhost_virtqueue *vq;
> > > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > > +	n->cache = NULL;
> > >  	return 0;
> > >  }
> > >  
> > > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > >  }
> > >  
> > > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > > +{
> > > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > > +	struct kiocb *iocb = NULL;
> > > +	if (n->cache) {
> > > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > > +			kmem_cache_free(n->cache, iocb);
> > > +		kmem_cache_destroy(n->cache);
> > > +	}
> > > +}
> > > +
> > >  static int vhost_net_release(struct inode *inode, struct file *f)
> > >  {
> > >  	struct vhost_net *n = f->private_data;
> > > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > >  	/* We do an extra flush before freeing memory,
> > >  	 * since jobs can re-queue themselves. */
> > >  	vhost_net_flush(n);
> > > +	vhost_notifier_cleanup(n);
> > >  	kfree(n);
> > >  	return 0;
> > >  }
> > > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > >  	return sock;
> > >  }
> > >  
> > > -static struct socket *get_socket(int fd)
> > > +static struct socket *get_mp_socket(int fd)
> > > +{
> > > +	struct file *file = fget(fd);
> > > +	struct socket *sock;
> > > +	if (!file)
> > > +		return ERR_PTR(-EBADF);
> > > +	sock = mp_get_socket(file);
> > > +	if (IS_ERR(sock))
> > > +		fput(file);
> > > +	return sock;
> > > +}
> > > +
> > > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > >  {
> > >  	struct socket *sock;
> > >  	if (fd == -1)
> > > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > >  	sock = get_tun_socket(fd);
> > >  	if (!IS_ERR(sock))
> > >  		return sock;
> > > +	sock = get_mp_socket(fd);
> > > +	if (!IS_ERR(sock)) {
> > > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > > +		return sock;
> > > +	}
> > >  	return ERR_PTR(-ENOTSOCK);
> > >  }
> > >  
> > > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > > +{
> > > +	struct vhost_virtqueue *vq = n->vqs + index;
> > > +
> > > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > > +		vq->receiver = NULL;
> > > +		INIT_LIST_HEAD(&vq->notifier);
> > > +		spin_lock_init(&vq->notify_lock);
> > > +		if (!n->cache) {
> > > +			n->cache = kmem_cache_create("vhost_kiocb",
> > > +					sizeof(struct kiocb), 0,
> > > +					SLAB_HWCACHE_ALIGN, NULL);
> > > +		}
> > > +	}
> > > +}
> > > +
> > >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  {
> > >  	struct socket *sock, *oldsock;
> > > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	}
> > >  	vq = n->vqs + index;
> > >  	mutex_lock(&vq->mutex);
> > > -	sock = get_socket(fd);
> > > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > > +	sock = get_socket(vq, fd);
> > >  	if (IS_ERR(sock)) {
> > >  		r = PTR_ERR(sock);
> > >  		goto err;
> > >  	}
> > >  
> > > +	vhost_init_link_state(n, index);
> > > +
> > >  	/* start polling new socket */
> > >  	oldsock = vq->private_data;
> > >  	if (sock == oldsock)
> > > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > >  	vhost_net_disable_vq(n, vq);
> > >  	rcu_assign_pointer(vq->private_data, sock);
> > >  	vhost_net_enable_vq(n, vq);
> > > -	mutex_unlock(&vq->mutex);
> > >  done:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	if (oldsock) {
> > >  		vhost_net_flush_vq(n, index);
> > > @@ -516,6 +690,7 @@ done:
> > >  	}
> > >  	return r;
> > >  err:
> > > +	mutex_unlock(&vq->mutex);
> > >  	mutex_unlock(&n->dev.mutex);
> > >  	return r;
> > >  }
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index d1f0453..cffe39a 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -43,6 +43,11 @@ struct vhost_log {
> > >  	u64 len;
> > >  };
> > >  
> > > +enum vhost_vq_link_state {
> > > +	VHOST_VQ_LINK_SYNC = 	0,
> > > +	VHOST_VQ_LINK_ASYNC = 	1,
> > > +};
> > > +
> > >  /* The virtqueue structure describes a queue attached to a device. */
> > >  struct vhost_virtqueue {
> > >  	struct vhost_dev *dev;
> > > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > >  	/* Log write descriptors */
> > >  	void __user *log_base;
> > >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > > +	/*Differiate async socket for 0-copy from normal*/
> > > +	enum vhost_vq_link_state link_state;
> > > +	struct list_head notifier;
> > > +	spinlock_t notify_lock;
> > > +	void (*receiver)(struct vhost_virtqueue *);
> > >  };
> > >  
> > >  struct vhost_dev {
> > > -- 
> > > 1.5.4.4

^ permalink raw reply

* Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-06  7:49 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@c2.user-mode-linux.org, yzhao81@gmail.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B5BB9@shzsmsx502.ccr.corp.intel.com>

On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote:
> Michael,
> >> 
> >>For the DOS issue, I'm not sure how much the limit get_user_pages()
> >> can pin is reasonable, should we compute the bindwidth to make it?
> 
> >There's a ulimit for locked memory. Can we use this, decreasing
> >the value for rlimit array? We can do this when backend is
> >enabled and re-increment when backend is disabled.
> 
> I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
> the initial value for it is 0x10000, after right shift PAGE_SHIFT,
> it's only 16 pages we can lock then, it seems too small, since the 
> guest virito-net driver may submit a lot requests one time.
> 
> 
> Thanks
> Xiaohui

Yes, that's the default, but system administrator can always increase
this value with ulimit if necessary.

-- 
MST

^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-06  7:39 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20100404205028H.fujita.tomonori@lab.ntt.co.jp>

Thanks, Fujita.

The patch looks fine. I'll run some regression tests on the patched driver to check that things still work and if it's ok we will ack it shortly.

vlad



> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Sunday, April 04, 2010 2:51 PM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; davem@davemloft.net;
> netdev@vger.kernel.org; Eilon Greenstein
> Subject: RE: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> On Sun, 4 Apr 2010 03:24:46 -0700
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
>
> > Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to
> > use the generic model (not just the mapping macros).
>
> I've attached the patch.
>
> There is one functional change: pci_alloc_consistent ->
> dma_alloc_coherent
>
> pci_alloc_consistent is a wrapper function of dma_alloc_coherent with
> GFP_ATOMIC flag (see include/asm-generic/pci-dma-compat.h).
>
> pci_alloc_consistent uses GFP_ATOMIC flag because of the compatibility
> for some broken drivers that use the function in interrupt. But
> GFP_ATOMIC should be avoided if possible. Looks like bnx2x doesn't use
> pci_alloc_consistent in interrupt so I replaced them with
> dma_alloc_coherent with GFP_KERNEL.
>
> Please check if that change works for bnx2x.
>
> > One last question: since which kernel version the generic DMA layer
> > may be used instead of PCI DMA layer?
>
> After 2.6.34-rc2.
>
> Well, on the majority of architectures, you have been able to use the
> generic DMA API over the PCI DMA API. The PCI DMA API is just the
> wrapper of the generic DMA API. But on some architectures, two APIs
> worked differently a bit. since 2.6.34-rc2, two API work in the exact
> same way on all the architectures.
>
>
> =
> From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Subject: [PATCH] bnx2x: use the DMA API instead of the pci equivalents
>
> The DMA API is preferred.
>
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
>  drivers/net/bnx2x.h      |    4 +-
>  drivers/net/bnx2x_main.c |  110
> +++++++++++++++++++++++----------------------
>  2 files changed, 58 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {
>                        \
>
>  struct sw_rx_bd {
>       struct sk_buff  *skb;
> -     DECLARE_PCI_UNMAP_ADDR(mapping)
> +     DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>
>  struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>
>  struct sw_rx_page {
>       struct page     *page;
> -     DECLARE_PCI_UNMAP_ADDR(mapping)
> +     DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>
>  union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index fa9275c..63a17d6 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -842,7 +842,7 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>       /* unmap first bd */
>       DP(BNX2X_MSG_OFF, "free bd_idx %d\n", bd_idx);
>       tx_start_bd = &fp->tx_desc_ring[bd_idx].start_bd;
> -     pci_unmap_single(bp->pdev, BD_UNMAP_ADDR(tx_start_bd),
> +     dma_unmap_single(&bp->pdev->dev, BD_UNMAP_ADDR(tx_start_bd),
>                        BD_UNMAP_LEN(tx_start_bd), PCI_DMA_TODEVICE);
>
>       nbd = le16_to_cpu(tx_start_bd->nbd) - 1;
> @@ -872,8 +872,8 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>
>               DP(BNX2X_MSG_OFF, "free frag bd_idx %d\n", bd_idx);
>               tx_data_bd = &fp->tx_desc_ring[bd_idx].reg_bd;
> -             pci_unmap_page(bp->pdev, BD_UNMAP_ADDR(tx_data_bd),
> -                            BD_UNMAP_LEN(tx_data_bd),
> PCI_DMA_TODEVICE);
> +             dma_unmap_page(&bp->pdev->dev,
> BD_UNMAP_ADDR(tx_data_bd),
> +                            BD_UNMAP_LEN(tx_data_bd), DMA_TO_DEVICE);
>               if (--nbd)
>                       bd_idx = TX_BD(NEXT_TX_IDX(bd_idx));
>       }
> @@ -1086,7 +1086,7 @@ static inline void
> bnx2x_free_rx_sge(struct bnx2x *bp,
>       if (!page)
>               return;
>
> -     pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> +     dma_unmap_page(&bp->pdev->dev, dma_unmap_addr(sw_buf, mapping),
>                      SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
>       __free_pages(page, PAGES_PER_SGE_SHIFT);
>
> @@ -1115,15 +1115,15 @@ static inline int
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
>       if (unlikely(page == NULL))
>               return -ENOMEM;
>
> -     mapping = pci_map_page(bp->pdev, page, 0,
> SGE_PAGE_SIZE*PAGES_PER_SGE,
> -                            PCI_DMA_FROMDEVICE);
> +     mapping = dma_map_page(&bp->pdev->dev, page, 0,
> +                            SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
>       if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>               __free_pages(page, PAGES_PER_SGE_SHIFT);
>               return -ENOMEM;
>       }
>
>       sw_buf->page = page;
> -     pci_unmap_addr_set(sw_buf, mapping, mapping);
> +     dma_unmap_addr_set(sw_buf, mapping, mapping);
>
>       sge->addr_hi = cpu_to_le32(U64_HI(mapping));
>       sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1143,15 +1143,15 @@ static inline int
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
>       if (unlikely(skb == NULL))
>               return -ENOMEM;
>
> -     mapping = pci_map_single(bp->pdev, skb->data, bp->rx_buf_size,
> -                              PCI_DMA_FROMDEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> bp->rx_buf_size,
> +                              DMA_FROM_DEVICE);
>       if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
>               dev_kfree_skb(skb);
>               return -ENOMEM;
>       }
>
>       rx_buf->skb = skb;
> -     pci_unmap_addr_set(rx_buf, mapping, mapping);
> +     dma_unmap_addr_set(rx_buf, mapping, mapping);
>
>       rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1173,13 +1173,13 @@ static void bnx2x_reuse_rx_skb(struct
> bnx2x_fastpath *fp,
>       struct eth_rx_bd *cons_bd = &fp->rx_desc_ring[cons];
>       struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>
> -     pci_dma_sync_single_for_device(bp->pdev,
> -
> pci_unmap_addr(cons_rx_buf, mapping),
> -                                    RX_COPY_THRESH,
> PCI_DMA_FROMDEVICE);
> +     dma_sync_single_for_device(&bp->pdev->dev,
> +                                dma_unmap_addr(cons_rx_buf, mapping),
> +                                RX_COPY_THRESH, DMA_FROM_DEVICE);
>
>       prod_rx_buf->skb = cons_rx_buf->skb;
> -     pci_unmap_addr_set(prod_rx_buf, mapping,
> -                        pci_unmap_addr(cons_rx_buf, mapping));
> +     dma_unmap_addr_set(prod_rx_buf, mapping,
> +                        dma_unmap_addr(cons_rx_buf, mapping));
>       *prod_bd = *cons_bd;
>  }
>
> @@ -1283,9 +1283,9 @@ static void bnx2x_tpa_start(struct
> bnx2x_fastpath *fp, u16 queue,
>
>       /* move empty skb from pool to prod and map it */
>       prod_rx_buf->skb = fp->tpa_pool[queue].skb;
> -     mapping = pci_map_single(bp->pdev,
> fp->tpa_pool[queue].skb->data,
> -                              bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> -     pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> +     mapping = dma_map_single(&bp->pdev->dev,
> fp->tpa_pool[queue].skb->data,
> +                              bp->rx_buf_size, DMA_FROM_DEVICE);
> +     dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>
>       /* move partial skb from cons to pool (don't unmap yet) */
>       fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,8 +1361,9 @@ static int bnx2x_fill_frag_skb(struct
> bnx2x *bp, struct bnx2x_fastpath *fp,
>               }
>
>               /* Unmap the page as we r going to pass it to
> the stack */
> -             pci_unmap_page(bp->pdev,
> pci_unmap_addr(&old_rx_pg, mapping),
> -                           SGE_PAGE_SIZE*PAGES_PER_SGE,
> PCI_DMA_FROMDEVICE);
> +             dma_unmap_page(&bp->pdev->dev,
> +                            dma_unmap_addr(&old_rx_pg, mapping),
> +                            SGE_PAGE_SIZE*PAGES_PER_SGE,
> DMA_FROM_DEVICE);
>
>               /* Add one frag and update the appropriate
> fields in the skb */
>               skb_fill_page_desc(skb, j, old_rx_pg.page, 0, frag_len);
> @@ -1389,8 +1390,8 @@ static void bnx2x_tpa_stop(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
>       /* Unmap skb in the pool anyway, as we are going to change
>          pool entry status to BNX2X_TPA_STOP even if new skb
> allocation
>          fails. */
> -     pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> -                      bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> +     dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf,
> mapping),
> +                      bp->rx_buf_size, DMA_FROM_DEVICE);
>
>       if (likely(new_skb)) {
>               /* fix ip xsum and give it to the stack */
> @@ -1620,10 +1621,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
>                               }
>                       }
>
> -                     pci_dma_sync_single_for_device(bp->pdev,
> -                                     pci_unmap_addr(rx_buf, mapping),
> -                                                    pad +
> RX_COPY_THRESH,
> -
> PCI_DMA_FROMDEVICE);
> +                     dma_sync_single_for_device(&bp->pdev->dev,
> +                                     dma_unmap_addr(rx_buf, mapping),
> +                                                pad + RX_COPY_THRESH,
> +                                                DMA_FROM_DEVICE);
>                       prefetch(skb);
>                       prefetch(((char *)(skb)) + 128);
>
> @@ -1665,10 +1666,10 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
>
>                       } else
>                       if (likely(bnx2x_alloc_rx_skb(bp, fp,
> bd_prod) == 0)) {
> -                             pci_unmap_single(bp->pdev,
> -                                     pci_unmap_addr(rx_buf, mapping),
> +                             dma_unmap_single(&bp->pdev->dev,
> +                                     dma_unmap_addr(rx_buf, mapping),
>                                                bp->rx_buf_size,
> -                                              PCI_DMA_FROMDEVICE);
> +                                              DMA_FROM_DEVICE);
>                               skb_reserve(skb, pad);
>                               skb_put(skb, len);
>
> @@ -4940,9 +4941,9 @@ static inline void
> bnx2x_free_tpa_pool(struct bnx2x *bp,
>               }
>
>               if (fp->tpa_state[i] == BNX2X_TPA_START)
> -                     pci_unmap_single(bp->pdev,
> -                                      pci_unmap_addr(rx_buf,
> mapping),
> -                                      bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> +                     dma_unmap_single(&bp->pdev->dev,
> +                                      dma_unmap_addr(rx_buf,
> mapping),
> +                                      bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
>               dev_kfree_skb(skb);
>               rx_buf->skb = NULL;
> @@ -4978,7 +4979,7 @@ static void bnx2x_init_rx_rings(struct
> bnx2x *bp)
>                                       fp->disable_tpa = 1;
>                                       break;
>                               }
> -                             pci_unmap_addr_set((struct sw_rx_bd *)
> +                             dma_unmap_addr_set((struct sw_rx_bd *)
>
> &bp->fp->tpa_pool[i],
>                                                  mapping, 0);
>                               fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -5658,8 +5659,8 @@ static void bnx2x_nic_init(struct bnx2x
> *bp, u32 load_code)
>
>  static int bnx2x_gunzip_init(struct bnx2x *bp)
>  {
> -     bp->gunzip_buf = pci_alloc_consistent(bp->pdev, FW_BUF_SIZE,
> -                                           &bp->gunzip_mapping);
> +     bp->gunzip_buf = dma_alloc_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> +
> &bp->gunzip_mapping, GFP_KERNEL);
>       if (bp->gunzip_buf  == NULL)
>               goto gunzip_nomem1;
>
> @@ -5679,8 +5680,8 @@ gunzip_nomem3:
>       bp->strm = NULL;
>
>  gunzip_nomem2:
> -     pci_free_consistent(bp->pdev, FW_BUF_SIZE, bp->gunzip_buf,
> -                         bp->gunzip_mapping);
> +     dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE, bp->gunzip_buf,
> +                       bp->gunzip_mapping);
>       bp->gunzip_buf = NULL;
>
>  gunzip_nomem1:
> @@ -5696,8 +5697,8 @@ static void bnx2x_gunzip_end(struct bnx2x *bp)
>       bp->strm = NULL;
>
>       if (bp->gunzip_buf) {
> -             pci_free_consistent(bp->pdev, FW_BUF_SIZE,
> bp->gunzip_buf,
> -                                 bp->gunzip_mapping);
> +             dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE,
> bp->gunzip_buf,
> +                               bp->gunzip_mapping);
>               bp->gunzip_buf = NULL;
>       }
>  }
> @@ -6692,7 +6693,7 @@ static void bnx2x_free_mem(struct bnx2x *bp)
>  #define BNX2X_PCI_FREE(x, y, size) \
>       do { \
>               if (x) { \
> -                     pci_free_consistent(bp->pdev, size, x, y); \
> +                     dma_free_coherent(&bp->pdev->dev, size, x, y); \
>                       x = NULL; \
>                       y = 0; \
>               } \
> @@ -6773,7 +6774,7 @@ static int bnx2x_alloc_mem(struct bnx2x *bp)
>
>  #define BNX2X_PCI_ALLOC(x, y, size) \
>       do { \
> -             x = pci_alloc_consistent(bp->pdev, size, y); \
> +             x = dma_alloc_coherent(&bp->pdev->dev, size, y,
> GFP_KERNEL); \
>               if (x == NULL) \
>                       goto alloc_mem_err; \
>               memset(x, 0, size); \
> @@ -6906,9 +6907,9 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
>                       if (skb == NULL)
>                               continue;
>
> -                     pci_unmap_single(bp->pdev,
> -                                      pci_unmap_addr(rx_buf,
> mapping),
> -                                      bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> +                     dma_unmap_single(&bp->pdev->dev,
> +                                      dma_unmap_addr(rx_buf,
> mapping),
> +                                      bp->rx_buf_size,
> DMA_FROM_DEVICE);
>
>                       rx_buf->skb = NULL;
>                       dev_kfree_skb(skb);
> @@ -10269,8 +10270,8 @@ static int bnx2x_run_loopback(struct
> bnx2x *bp, int loopback_mode, u8 link_up)
>
>       bd_prod = TX_BD(fp_tx->tx_bd_prod);
>       tx_start_bd = &fp_tx->tx_desc_ring[bd_prod].start_bd;
> -     mapping = pci_map_single(bp->pdev, skb->data,
> -                              skb_headlen(skb), PCI_DMA_TODEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> +                              skb_headlen(skb), DMA_TO_DEVICE);
>       tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
>       tx_start_bd->nbd = cpu_to_le16(2); /* start + pbd */
> @@ -11316,8 +11317,8 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
>               }
>       }
>
> -     mapping = pci_map_single(bp->pdev, skb->data,
> -                              skb_headlen(skb), PCI_DMA_TODEVICE);
> +     mapping = dma_map_single(&bp->pdev->dev, skb->data,
> +                              skb_headlen(skb), DMA_TO_DEVICE);
>
>       tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>       tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11374,8 +11375,9 @@ static netdev_tx_t
> bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
>               if (total_pkt_bd == NULL)
>                       total_pkt_bd =
> &fp->tx_desc_ring[bd_prod].reg_bd;
>
> -             mapping = pci_map_page(bp->pdev, frag->page,
> frag->page_offset,
> -                                    frag->size, PCI_DMA_TODEVICE);
> +             mapping = dma_map_page(&bp->pdev->dev, frag->page,
> +                                    frag->page_offset,
> +                                    frag->size, DMA_TO_DEVICE);
>
>               tx_data_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>               tx_data_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -11832,15 +11834,15 @@ static int __devinit
> bnx2x_init_dev(struct pci_dev *pdev,
>               goto err_out_release;
>       }
>
> -     if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
> +     if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) == 0) {
>               bp->flags |= USING_DAC_FLAG;
> -             if (pci_set_consistent_dma_mask(pdev,
> DMA_BIT_MASK(64)) != 0) {
> -                     pr_err("pci_set_consistent_dma_mask
> failed, aborting\n");
> +             if (dma_set_coherent_mask(&pdev->dev,
> DMA_BIT_MASK(64)) != 0) {
> +                     pr_err("dma_set_coherent_mask failed,
> aborting\n");
>                       rc = -EIO;
>                       goto err_out_release;
>               }
>
> -     } else if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
> +     } else if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(32)) != 0) {
>               pr_err("System does not support DMA, aborting\n");
>               rc = -EIO;
>               goto err_out_release;
> --
> 1.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: Jiri Pirko @ 2010-04-06  7:17 UTC (permalink / raw)
  To: David Miller; +Cc: yoshfuji, netdev
In-Reply-To: <20100406.001259.15002237.davem@davemloft.net>

Tue, Apr 06, 2010 at 09:12:59AM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jpirko@redhat.com>
>Date: Tue, 6 Apr 2010 09:09:23 +0200
>
>> Whoups, missed this bit. Thanks a lot.
>> 
>> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
>> 
>
>Applied, and patchwork doesn't know what "Rewieved-by" is so
>I fixed the typo and added it to the changelog :-)

Oh my :) Looks like I'm still sleeping...

^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: David Miller @ 2010-04-06  7:12 UTC (permalink / raw)
  To: jpirko; +Cc: yoshfuji, netdev
In-Reply-To: <20100406070922.GE2869@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Tue, 6 Apr 2010 09:09:23 +0200

> Whoups, missed this bit. Thanks a lot.
> 
> Rewieved-by: Jiri Pirko <jpirko@redhat.com>
> 

Applied, and patchwork doesn't know what "Rewieved-by" is so
I fixed the typo and added it to the changelog :-)

^ permalink raw reply

* Re: [PATCH] mac80211: Ensure initializing private mc_list in prepare_multicast().
From: Jiri Pirko @ 2010-04-06  7:09 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: davem, netdev
In-Reply-To: <201004050457.o354v6ec008492@94.43.138.210.xn.2iij.net>

Mon, Apr 05, 2010 at 05:59:30AM CEST, yoshfuji@linux-ipv6.org wrote:
>Fix kernel panic by NULL pointer dereference in the context of
>ieee80211_ops->prepare_multicast().
>
>This bug was introduced by commit 22bedad3c.. ("net: convert
>multicast list to list_head").
>
>Call __hw_addr_init() in ieee80211_alloc_hw() to initialize
>list_head of private device multicast list, like we do in
>bond_init().
>
>Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
>---
> net/mac80211/main.c |    3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
>diff --git a/net/mac80211/main.c b/net/mac80211/main.c
>index 84ad249..0b82cd2 100644
>--- a/net/mac80211/main.c
>+++ b/net/mac80211/main.c
>@@ -388,6 +388,9 @@ struct ieee80211_hw *ieee80211_alloc_hw(size_t priv_data_len,
> 	local->uapsd_max_sp_len = IEEE80211_DEFAULT_MAX_SP_LEN;
> 
> 	INIT_LIST_HEAD(&local->interfaces);
>+
>+	__hw_addr_init(&local->mc_list);
>+
> 	mutex_init(&local->iflist_mtx);
> 	mutex_init(&local->scan_mtx);
> 

Whoups, missed this bit. Thanks a lot.

Rewieved-by: Jiri Pirko <jpirko@redhat.com>

>-- 
>1.5.6.5
>

^ permalink raw reply

* RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Xin, Xiaohui @ 2010-04-06  6:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, mst@redhat.com,
	jdike@c2.user-mode-linux.org, davem@davemloft.net
In-Reply-To: <20100402085556.75a8ff7c@nehalam>


>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> 
>> The patch let host NIC driver to receive user space skb,
>> then the driver has chance to directly DMA to guest user
>> space buffers thru single ethX interface.
>> We want it to be more generic as a zero copy framework.
>> 
>> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
>> Signed-off-by: Zhao Yu <yzhao81@gmail.com>
>> Sigend-off-by: Jeff Dike <jdike@c2.user-mode-linux.org>
>> ---
>> 
>> We consider 2 way to utilize the user buffres, but not sure which one
>> is better. Please give any comments.
>> 
>> One:    Modify __alloc_skb() function a bit, it can only allocate a
>>         structure of sk_buff, and the data pointer is pointing to a
>>         user buffer which is coming from a page constructor API.
>>         Then the shinfo of the skb is also from guest.
>>         When packet is received from hardware, the skb->data is filled
>>         directly by h/w. What we have done is in this way.
>> 
>>         Pros:   We can avoid any copy here.
>>         Cons:   Guest virtio-net driver needs to allocate skb as almost
>>                 the same method with the host NIC drivers, say the size
>>                 of netdev_alloc_skb() and the same reserved space in the
>>                 head of skb. Many NIC drivers are the same with guest and
>>                 ok for this. But some lastest NIC drivers reserves special
>>                 room in skb head. To deal with it, we suggest to provide
>>                 a method in guest virtio-net driver to ask for parameter
>>                 we interest from the NIC driver when we know which device
>>                 we have bind to do zero-copy. Then we ask guest to do so.
>>                 Is that reasonable?
>> 
>> Two:    Modify driver to get user buffer allocated from a page constructor
>>         API(to substitute alloc_page()), the user buffer are used as payload
>>         buffers and filled by h/w directly when packet is received. Driver
>>         should associate the pages with skb (skb_shinfo(skb)->frags). For
>>         the head buffer side, let host allocates skb, and h/w fills it.
>>         After that, the data filled in host skb header will be copied into
>>         guest header buffer which is submitted together with the payload buffer.
>> 
>>         Pros:   We could less care the way how guest or host allocates their
>>                 buffers.
>>         Cons:   We still need a bit copy here for the skb header.
>> 
>> We are not sure which way is the better here. This is the first thing we want
>> to get comments from the community. We wish the modification to the network
>> part will be generic which not used by vhost-net backend only, but a user
>> application may use it as well when the zero-copy device may provides async
>> read/write operations later.
>> 
>> 
>> Thanks
>> Xiaohui

>How do you deal with the DoS problem of hostile user space app posting huge
>number of receives and never getting anything. 

That's a problem we are trying to deal with. It's critical for long term.
Currently, we tried to limit the pages it can pin, but not sure how much is reasonable.
For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent
just for now.

Thanks
Xiaohui

^ permalink raw reply

* RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-06  6:06 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, mst@redhat.com,
	jdike@c2.user-mode-linux.org, davem@davemloft.net
In-Reply-To: <1270252268.13897.14.camel@w-sridhar.beaverton.ibm.com>

Sridhar,

>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it. 
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>>vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.

>What is the advantage of this approach compared to PCI-passthrough
>of the host NIC to the guest?

PCI-passthrough needs hardware support, a kind of iommu engine will
help to translate guest physical address to host physical address.
And currently, a PCI-passthrough device cannot pass live migration.

The zero-copy is a pure software solution. It doesn't need special hardware support.
In theory, it can pass live migration.
 
>Does this require pinning of the entire guest memory? Or only the
>send/receive buffers?

We need only to pin the send/receive buffers.

Thanks
Xiaohui

>Thanks
>Sridhar
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?
> 
> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here. This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support
> 	To support GRO
> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [PATCH v2] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-06  5:56 UTC (permalink / raw)
  To: davem, netdev, eric.dumazet

Version 2:
- added a u16 filler to pad rps_dev_flow structure
- define RPS_NO_CPU as 0xffff
- add inet_rps_save_rxhash helper function to copy skb's rxhash into inet_sk
- add a "voidflow" which can be used get_rps_cpu does not return a flow (avoids some conditionals)
- use raw_smp_processor_id in rps_record_sock_flow, this is no requirement to pr
event preemption
---
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU
   
bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a343a21..09c8658 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,74 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+	u16 cpu;
+	u16 fill;
+	unsigned int last_qtail;
+};
+
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+	unsigned int mask;
+	struct rcu_head rcu;
+	struct work_struct free_work;
+	struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+    (_num * sizeof(struct rps_dev_flow)))
+
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+	unsigned int mask;
+	u16 *ents;
+};
+
+#define RPS_NO_CPU 0xffff
+
+static inline void rps_set_sock_flow(struct rps_sock_flow_table *table,
+				     u32 hash, int cpu)
+{
+	if (table->ents && hash) {
+		unsigned int index = hash & table->mask;
+
+		if (table->ents[index] != cpu)
+			table->ents[index] = cpu;
+	}
+}
+
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+					u32 hash)
+{
+	/* We only give a hint, preemption can change our cpu under us */
+	rps_set_sock_flow(table, hash, raw_smp_processor_id());
+}
+
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+				       u32 hash)
+{
+	rps_set_sock_flow(table, hash, RPS_NO_CPU);
+}
+
+extern struct rps_sock_flow_table rps_sock_flow_table;
+
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 	struct rps_map *rps_map;
+	struct rps_dev_flow_table *rps_flow_table;
 	struct kobject kobj;
 	struct netdev_rx_queue *first;
 	atomic_t count;
 } ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1331,8 +1391,9 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 
 	/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83fd344..801cd63 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -21,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/jhash.h>
+#include <linux/netdevice.h>
 
 #include <net/flow.h>
 #include <net/sock.h>
@@ -101,6 +102,7 @@ struct rtable;
  * @uc_ttl - Unicast TTL
  * @inet_sport - Source port
  * @inet_id - ID counter for DF pkts
+ * @rxhash - flow hash received from netif layer
  * @tos - TOS
  * @mc_ttl - Multicasting TTL
  * @is_icsk - is this an inet_connection_sock?
@@ -124,6 +126,9 @@ struct inet_sock {
 	__u16			cmsg_flags;
 	__be16			inet_sport;
 	__u16			inet_id;
+#ifdef CONFIG_RPS
+	__u32			rxhash;
+#endif
 
 	struct ip_options	*opt;
 	__u8			tos;
@@ -219,4 +224,27 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
 	return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
 }
 
+static inline void inet_rps_record_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	rps_record_sock_flow(&rps_sock_flow_table, inet_sk(sk)->rxhash);
+#endif
+}
+
+static inline void inet_rps_reset_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	rps_reset_sock_flow(&rps_sock_flow_table, inet_sk(sk)->rxhash);
+#endif
+}
+
+static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+{
+#ifdef CONFIG_RPS
+	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
+		inet_rps_reset_flow(sk);
+		inet_sk(sk)->rxhash = rxhash;
+	}
+#endif
+}
 #endif	/* _INET_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..8f4c625 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <linux/pci.h>
+#include <linux/bootmem.h>
 
 #include "net-sysfs.h"
 
@@ -2202,22 +2203,28 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 #ifdef CONFIG_RPS
+/* One global table that all flow-based protocols share. */
+struct rps_sock_flow_table rps_sock_flow_table;
+EXPORT_SYMBOL(rps_sock_flow_table);
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
+ * rcu_read_lock must be held on entry.
  */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+		       struct rps_dev_flow **rflowp)
 {
 	struct ipv6hdr *ip6;
 	struct iphdr *ip;
 	struct netdev_rx_queue *rxqueue;
 	struct rps_map *map;
+	struct rps_dev_flow_table *flow_table;
 	int cpu = -1;
 	u8 ip_proto;
+	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
-	rcu_read_lock();
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2232,7 +2239,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 	} else
 		rxqueue = dev->_rx;
 
-	if (!rxqueue->rps_map)
+	if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
 		goto done;
 
 	if (skb->rxhash)
@@ -2284,9 +2291,47 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 		skb->rxhash = 1;
 
 got_hash:
+	flow_table = rcu_dereference(rxqueue->rps_flow_table);
+	if (flow_table && rps_sock_flow_table.ents) {
+		u16 next_cpu;
+		struct rps_dev_flow *rflow;
+
+		rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
+		tcpu = rflow->cpu;
+
+		next_cpu = rps_sock_flow_table.ents[skb->rxhash &
+		    rps_sock_flow_table.mask];
+
+		/*
+		 * If the desired CPU (where last recvmsg was done) is
+		 * different from current CPU (one in the rx-queue flow
+		 * table entry), switch if one of the following holds:
+		 *   - Current CPU is unset (equal to RPS_NO_CPU).
+		 *   - Current CPU is offline.
+		 *   - The current CPU's queue tail has advanced beyond the
+		 *     last packet that was enqueued using this table entry.
+		 *     This guarantees that all previous packets for the flow
+		 *     have been dequeued, thus preserving in order delivery.
+		 */
+		if (unlikely(tcpu != next_cpu) &&
+		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
+		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
+		      rflow->last_qtail)) >= 0)) {
+			tcpu = rflow->cpu = next_cpu;
+			if (tcpu != RPS_NO_CPU)
+				rflow->last_qtail = per_cpu(softnet_data,
+				    tcpu).input_queue_head;
+		}
+		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
+			*rflowp = rflow;
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
 	map = rcu_dereference(rxqueue->rps_map);
 	if (map) {
-		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
 		if (cpu_online(tcpu)) {
 			cpu = tcpu;
@@ -2295,7 +2340,6 @@ got_hash:
 	}
 
 done:
-	rcu_read_unlock();
 	return cpu;
 }
 
@@ -2321,13 +2365,14 @@ static void trigger_softirq(void *data)
 	__napi_schedule(&queue->backlog);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
-#endif /* CONFIG_SMP */
+#endif /* CONFIG_RPS */
 
 /*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
  */
-static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
+			      unsigned int *qtail)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
@@ -2342,6 +2387,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 		if (queue->input_pkt_queue.qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
+#ifdef CONFIG_RPS
+			*qtail = queue->input_queue_head +
+			    queue->input_pkt_queue.qlen;
+#endif
 			rps_unlock(queue);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2356,11 +2405,10 @@ enqueue:
 
 				cpu_set(cpu, rcpus->mask[rcpus->select]);
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-			} else
-				__napi_schedule(&queue->backlog);
-#else
-			__napi_schedule(&queue->backlog);
+				goto enqueue;
+			}
 #endif
+			__napi_schedule(&queue->backlog);
 		}
 		goto enqueue;
 	}
@@ -2391,7 +2439,7 @@ enqueue:
 
 int netif_rx(struct sk_buff *skb)
 {
-	int cpu;
+	unsigned int qtail;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2401,14 +2449,24 @@ int netif_rx(struct sk_buff *skb)
 		net_timestamp(skb);
 
 #ifdef CONFIG_RPS
-	cpu = get_rps_cpu(skb->dev, skb);
-	if (cpu < 0)
-		cpu = smp_processor_id();
-#else
-	cpu = smp_processor_id();
+	{
+		struct rps_dev_flow voidflow, *rflow = &voidflow;
+		int cpu, err;
+
+		rcu_read_lock();
+
+		cpu = get_rps_cpu(skb->dev, skb, &rflow);
+		if (cpu >= 0) {
+			err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+			rcu_read_unlock();
+			return err;
+		}
+
+		rcu_read_unlock();
+	}
 #endif
 
-	return enqueue_to_backlog(skb, cpu);
+	return enqueue_to_backlog(skb, smp_processor_id(), &qtail);
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2775,17 +2833,22 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	int cpu;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
+	int cpu, err;
 
-	cpu = get_rps_cpu(skb->dev, skb);
+	rcu_read_lock();
 
-	if (cpu < 0)
-		return __netif_receive_skb(skb);
-	else
-		return enqueue_to_backlog(skb, cpu);
-#else
-	return __netif_receive_skb(skb);
+	cpu = get_rps_cpu(skb->dev, skb, &rflow);
+
+	if (cpu >= 0) {
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+		rcu_read_unlock();
+		return err;
+	}
+
+	rcu_read_unlock();
 #endif
+	return __netif_receive_skb(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
@@ -2801,6 +2864,9 @@ static void flush_backlog(void *arg)
 		if (skb->dev == dev) {
 			__skb_unlink(skb, &queue->input_pkt_queue);
 			kfree_skb(skb);
+#ifdef CONFIG_RPS
+			queue->input_queue_head++;
+#endif
 		}
 	rps_unlock(queue);
 }
@@ -3124,6 +3190,9 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			local_irq_enable();
 			break;
 		}
+#ifdef CONFIG_RPS
+		queue->input_queue_head++;
+#endif
 		rps_unlock(queue);
 		local_irq_enable();
 
@@ -5487,8 +5556,12 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
+#ifdef CONFIG_RPS
+		oldsd->input_queue_head++;
+#endif
+	}
 
 	return NOTIFY_OK;
 }
@@ -5669,6 +5742,42 @@ static struct pernet_operations __net_initdata default_device_ops = {
 	.exit_batch = default_device_exit_batch,
 };
 
+
+#ifdef CONFIG_RPS
+static __initdata unsigned long rps_sock_flow_entries;
+
+static int __init set_rps_sock_flow_entries(char *str)
+{
+	if (str)
+		rps_sock_flow_entries = simple_strtoul(str, &str, 0);
+
+	return 0;
+}
+
+__setup("rps_flow_entries=", set_rps_sock_flow_entries);
+
+static int alloc_rps_sock_flow_entries(void)
+{
+	unsigned int i, hash_size;
+
+	if (!rps_sock_flow_entries)
+		return 0;
+
+	rps_sock_flow_table.ents =
+	    alloc_large_system_hash("RPS flow table", sizeof(u16),
+	    rps_sock_flow_entries, 0, 0, &hash_size, NULL, 0);
+	hash_size = 1 << hash_size;
+	rps_sock_flow_table.mask = hash_size - 1;
+	for (i = 0; i < hash_size; i++)
+		rps_sock_flow_table.ents[i] = RPS_NO_CPU;
+
+	printk(KERN_INFO "RPS: flow table configured with %d entries\n",
+	    hash_size);
+
+	return 0;
+}
+#endif
+
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
  *	unhooks any devices that fail to initialise (normally hardware not
@@ -5689,6 +5798,11 @@ static int __init net_dev_init(void)
 	if (dev_proc_init())
 		goto out;
 
+#ifdef CONFIG_RPS
+	if (alloc_rps_sock_flow_entries())
+		goto out;
+#endif
+
 	if (netdev_kobject_init())
 		goto out;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1e7fdd6..95863b2 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -600,22 +600,105 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	return len;
 }
 
+static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+					   struct rx_queue_attribute *attr,
+					   char *buf)
+{
+	struct rps_dev_flow_table *flow_table;
+	unsigned int val = 0;
+
+	rcu_read_lock();
+	flow_table = rcu_dereference(queue->rps_flow_table);
+	if (flow_table)
+		val = flow_table->mask + 1;
+	rcu_read_unlock();
+
+	return sprintf(buf, "%u\n", val);
+}
+
+static void rps_dev_flow_table_release_work(struct work_struct *work)
+{
+	struct rps_dev_flow_table *table = container_of(work,
+	    struct rps_dev_flow_table, free_work);
+
+	vfree(table);
+}
+
+static void rps_dev_flow_table_release(struct rcu_head *rcu)
+{
+	struct rps_dev_flow_table *table = container_of(rcu,
+	    struct rps_dev_flow_table, rcu);
+
+	INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
+	schedule_work(&table->free_work);
+}
+
+ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+				     struct rx_queue_attribute *attr,
+				     const char *buf, size_t len)
+{
+	unsigned int count;
+	char *endp;
+	struct rps_dev_flow_table *table, *old_table;
+	static DEFINE_SPINLOCK(rps_dev_flow_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	count = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EINVAL;
+
+	if (count) {
+		int i;
+
+		count = roundup_pow_of_two(count);
+		table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
+		if (!table)
+			return -ENOMEM;
+
+		table->mask = count - 1;
+		for (i = 0; i < count; i++)
+			table->flows[i].cpu = RPS_NO_CPU;
+	} else
+		table = NULL;
+
+	spin_lock(&rps_dev_flow_lock);
+	old_table = queue->rps_flow_table;
+	rcu_assign_pointer(queue->rps_flow_table, table);
+	spin_unlock(&rps_dev_flow_lock);
+
+	if (old_table)
+		call_rcu(&old_table->rcu, rps_dev_flow_table_release);
+
+	return len;
+}
+
 static struct rx_queue_attribute rps_cpus_attribute =
 	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
 
+
+static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
+	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
+	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+
 static struct attribute *rx_queue_default_attrs[] = {
 	&rps_cpus_attribute.attr,
+	&rps_dev_flow_table_cnt_attribute.attr,
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
-	struct rps_map *map = queue->rps_map;
 	struct netdev_rx_queue *first = queue->first;
 
-	if (map)
-		call_rcu(&map->rcu, rps_map_release);
+	if (queue->rps_map)
+		call_rcu(&queue->rps_map->rcu, rps_map_release);
+
+	if (queue->rps_flow_table)
+		call_rcu(&queue->rps_flow_table->rcu,
+		    rps_dev_flow_table_release);
 
 	if (atomic_dec_and_test(&first->count))
 		kfree(first);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 55e1190..eb6155a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -418,6 +418,8 @@ int inet_release(struct socket *sock)
 	if (sk) {
 		long timeout;
 
+		inet_rps_reset_flow(sk);
+
 		/* Applications forget to leave groups before exiting */
 		ip_mc_drop_socket(sk);
 
@@ -714,6 +716,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -722,12 +726,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-
 static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 			     size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -737,6 +742,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 	return sock_no_sendpage(sock, page, offset, size, flags);
 }
 
+int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		 size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	inet_rps_record_flow(sk);
+
+	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
+				   flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(inet_recvmsg);
 
 int inet_shutdown(struct socket *sock, int how)
 {
@@ -866,7 +887,7 @@ const struct proto_ops inet_stream_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -893,7 +914,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -923,7 +944,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4df5f9..2f40fe0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1674,6 +1674,8 @@ process:
 
 	skb->dev = NULL;
 
+	inet_rps_save_rxhash(sk, skb->rxhash);
+
 	bh_lock_sock_nested(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {

^ permalink raw reply related

* RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Xin, Xiaohui @ 2010-04-06  5:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <20100404114028.GF3189@redhat.com>

Michael,
> >>> For the write logging, do you have a function in hand that we can
> >>> recompute the log? If that, I think I can use it to recompute the
> >>>log info when the logging is suddenly enabled.
> >>> For the outstanding requests, do you mean all the user buffers have
> >>>submitted before the logging ioctl changed? That may be a lot, and
> >> >some of them are still in NIC ring descriptors. Waiting them to be
> >>>finished may be need some time. I think when logging ioctl changed,
> >> >then the logging is changed just after that is also reasonable.
 
> >>The key point is that after loggin ioctl returns, any
> >>subsequent change to memory must be logged. It does not
> >>matter when was the request submitted, otherwise we will
> >>get memory corruption on migration.

> >The change to memory happens when vhost_add_used_and_signal(), right?
> >So after ioctl returns, just recompute the log info to the events in the async queue,
> >is ok. Since the ioctl and write log operations are all protected by vq->mutex.
 
>> Thanks
>> Xiaohui

>Yes, I think this will work.

Thanks, so do you have the function to recompute the log info in your hand that I can 
use? I have weakly remembered that you have noticed it before some time.

> > Thanks
> > Xiaohui
> > 
> >  drivers/vhost/net.c   |  189 +++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/vhost/vhost.h |   10 +++
> >  2 files changed, 192 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 22d5fef..2aafd90 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -17,11 +17,13 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/rcupdate.h>
> >  #include <linux/file.h>
> > +#include <linux/aio.h>
> >  
> >  #include <linux/net.h>
> >  #include <linux/if_packet.h>
> >  #include <linux/if_arp.h>
> >  #include <linux/if_tun.h>
> > +#include <linux/mpassthru.h>
> >  
> >  #include <net/sock.h>
> >  
> > @@ -47,6 +49,7 @@ struct vhost_net {
> >  	struct vhost_dev dev;
> >  	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> >  	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > +	struct kmem_cache       *cache;
> >  	/* Tells us whether we are polling a socket for TX.
> >  	 * We only do this when socket buffer fills up.
> >  	 * Protected by tx vq lock. */
> > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> >  	net->tx_poll_state = VHOST_NET_POLL_STARTED;
> >  }
> >  
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&vq->notify_lock, flags);
> > +	if (!list_empty(&vq->notifier)) {
> > +		iocb = list_first_entry(&vq->notifier,
> > +				struct kiocb, ki_list);
> > +		list_del(&iocb->ki_list);
> > +	}
> > +	spin_unlock_irqrestore(&vq->notify_lock, flags);
> > +	return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	struct vhost_log *vq_log = NULL;
> > +	int rx_total_len = 0;
> > +	int log, size;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	if (vq->receiver)
> > +		vq->receiver(vq);
> > +
> > +	vq_log = unlikely(vhost_has_feature(
> > +				&net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, iocb->ki_nbytes);
> > +		log = (int)iocb->ki_user_data;
> > +		size = iocb->ki_nbytes;
> > +		rx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +		kmem_cache_free(net->cache, iocb);
> > +
> > +		if (unlikely(vq_log))
> > +			vhost_log_write(vq, vq_log, log, size);
> > +		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > +					struct vhost_virtqueue *vq)
> > +{
> > +	struct kiocb *iocb = NULL;
> > +	int tx_total_len = 0;
> > +
> > +	if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > +		return;
> > +
> > +	while ((iocb = notify_dequeue(vq)) != NULL) {
> > +		vhost_add_used_and_signal(&net->dev, vq,
> > +				iocb->ki_pos, 0);
> > +		tx_total_len += iocb->ki_nbytes;
> > +
> > +		if (iocb->ki_dtor)
> > +			iocb->ki_dtor(iocb);
> > +
> > +		kmem_cache_free(net->cache, iocb);
> > +		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > +			vhost_poll_queue(&vq->poll);
> > +			break;
> > +		}
> > +	}
> > +}
> > +
> >  /* Expects to be always run from workqueue - which acts as
> >   * read-size critical section for our kind of RCU. */
> >  static void handle_tx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, s;
> >  	struct msghdr msg = {
> >  		.msg_name = NULL,
> > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> >  		tx_poll_stop(net);
> >  	hdr_size = vq->hdr_size;
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> >  		/* Skip header. TODO: support TSO. */
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> >  		msg.msg_iovlen = out;
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->ki_pos = head;
> > +			iocb->private = (void *)vq;
> > +		}
> > +
> >  		len = iov_length(vq->iov, out);
> >  		/* Sanity check */
> >  		if (!len) {
> > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> >  			break;
> >  		}
> >  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> > -		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > +		err = sock->ops->sendmsg(iocb, sock, &msg, len);
> >  		if (unlikely(err < 0)) {
> >  			vhost_discard_vq_desc(vq);
> >  			tx_poll_start(net, sock);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		if (err != len)
> >  			pr_err("Truncated TX packet: "
> >  			       " len %d != %zd\n", err, len);
> > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_tx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> >  static void handle_rx(struct vhost_net *net)
> >  {
> >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> >  	unsigned head, out, in, log, s;
> >  	struct vhost_log *vq_log;
> >  	struct msghdr msg = {
> > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> >  	int err;
> >  	size_t hdr_size;
> >  	struct socket *sock = rcu_dereference(vq->private_data);
> > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > +	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > +			vq->link_state == VHOST_VQ_LINK_SYNC))
> >  		return;
> >  
> >  	use_mm(net->dev.mm);
> > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> >  	vhost_disable_notify(vq);
> >  	hdr_size = vq->hdr_size;
> >  
> > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > +	/* In async cases, for write logging, the simple way is to get
> > +	 * the log info always, and really logging is decided later.
> > +	 * Thus, when logging enabled, we can get log, and when logging
> > +	 * disabled, we can get log disabled accordingly.
> > +	 */
> > +
> > +	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > +		(vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> >  		vq->log : NULL;
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	for (;;) {
> >  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >  					 ARRAY_SIZE(vq->iov),
> > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> >  		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> >  		msg.msg_iovlen = in;
> >  		len = iov_length(vq->iov, in);
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +			iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > +			if (!iocb)
> > +				break;
> > +			iocb->private = vq;
> > +			iocb->ki_pos = head;
> > +			iocb->ki_user_data = log;
> > +		}
> >  		/* Sanity check */
> >  		if (!len) {
> >  			vq_err(vq, "Unexpected header len for RX: "
> > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> >  			       iov_length(vq->hdr, s), hdr_size);
> >  			break;
> >  		}
> > -		err = sock->ops->recvmsg(NULL, sock, &msg,
> > +
> > +		err = sock->ops->recvmsg(iocb, sock, &msg,
> >  					 len, MSG_DONTWAIT | MSG_TRUNC);
> >  		/* TODO: Check specific error and bomb out unless EAGAIN? */
> >  		if (err < 0) {
> >  			vhost_discard_vq_desc(vq);
> >  			break;
> >  		}
> > +
> > +		if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > +			continue;
> > +
> >  		/* TODO: Should check and handle checksum. */
> >  		if (err > len) {
> >  			pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> >  		}
> >  	}
> >  
> > +	handle_async_rx_events_notify(net, vq);
> > +
> >  	mutex_unlock(&vq->mutex);
> >  	unuse_mm(net->dev.mm);
> >  }
> >  
> > +
> >  static void handle_tx_kick(struct work_struct *work)
> >  {
> >  	struct vhost_virtqueue *vq;
> > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> >  	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> >  	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > +	n->cache = NULL;
> >  	return 0;
> >  }
> >  
> > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> >  	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> >  }
> >  
> > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > +{
> > +	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > +	struct kiocb *iocb = NULL;
> > +	if (n->cache) {
> > +		while ((iocb = notify_dequeue(vq)) != NULL)
> > +			kmem_cache_free(n->cache, iocb);
> > +		kmem_cache_destroy(n->cache);
> > +	}
> > +}
> > +
> >  static int vhost_net_release(struct inode *inode, struct file *f)
> >  {
> >  	struct vhost_net *n = f->private_data;
> > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >  	/* We do an extra flush before freeing memory,
> >  	 * since jobs can re-queue themselves. */
> >  	vhost_net_flush(n);
> > +	vhost_notifier_cleanup(n);
> >  	kfree(n);
> >  	return 0;
> >  }
> > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> >  	return sock;
> >  }
> >  
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > +	struct file *file = fget(fd);
> > +	struct socket *sock;
> > +	if (!file)
> > +		return ERR_PTR(-EBADF);
> > +	sock = mp_get_socket(file);
> > +	if (IS_ERR(sock))
> > +		fput(file);
> > +	return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> >  {
> >  	struct socket *sock;
> >  	if (fd == -1)
> > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> >  	sock = get_tun_socket(fd);
> >  	if (!IS_ERR(sock))
> >  		return sock;
> > +	sock = get_mp_socket(fd);
> > +	if (!IS_ERR(sock)) {
> > +		vq->link_state = VHOST_VQ_LINK_ASYNC;
> > +		return sock;
> > +	}
> >  	return ERR_PTR(-ENOTSOCK);
> >  }
> >  
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > +	struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > +	WARN_ON(!mutex_is_locked(&vq->mutex));
> > +	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > +		vq->receiver = NULL;
> > +		INIT_LIST_HEAD(&vq->notifier);
> > +		spin_lock_init(&vq->notify_lock);
> > +		if (!n->cache) {
> > +			n->cache = kmem_cache_create("vhost_kiocb",
> > +					sizeof(struct kiocb), 0,
> > +					SLAB_HWCACHE_ALIGN, NULL);
> > +		}
> > +	}
> > +}
> > +
> >  static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  {
> >  	struct socket *sock, *oldsock;
> > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	}
> >  	vq = n->vqs + index;
> >  	mutex_lock(&vq->mutex);
> > -	sock = get_socket(fd);
> > +	vq->link_state = VHOST_VQ_LINK_SYNC;
> > +	sock = get_socket(vq, fd);
> >  	if (IS_ERR(sock)) {
> >  		r = PTR_ERR(sock);
> >  		goto err;
> >  	}
> >  
> > +	vhost_init_link_state(n, index);
> > +
> >  	/* start polling new socket */
> >  	oldsock = vq->private_data;
> >  	if (sock == oldsock)
> > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> >  	vhost_net_disable_vq(n, vq);
> >  	rcu_assign_pointer(vq->private_data, sock);
> >  	vhost_net_enable_vq(n, vq);
> > -	mutex_unlock(&vq->mutex);
> >  done:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	if (oldsock) {
> >  		vhost_net_flush_vq(n, index);
> > @@ -516,6 +690,7 @@ done:
> >  	}
> >  	return r;
> >  err:
> > +	mutex_unlock(&vq->mutex);
> >  	mutex_unlock(&n->dev.mutex);
> >  	return r;
> >  }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..cffe39a 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> >  	u64 len;
> >  };
> >  
> > +enum vhost_vq_link_state {
> > +	VHOST_VQ_LINK_SYNC = 	0,
> > +	VHOST_VQ_LINK_ASYNC = 	1,
> > +};
> > +
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >  	struct vhost_dev *dev;
> > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> >  	/* Log write descriptors */
> >  	void __user *log_base;
> >  	struct vhost_log log[VHOST_NET_MAX_SG];
> > +	/*Differiate async socket for 0-copy from normal*/
> > +	enum vhost_vq_link_state link_state;
> > +	struct list_head notifier;
> > +	spinlock_t notify_lock;
> > +	void (*receiver)(struct vhost_virtqueue *);
> >  };
> >  
> >  struct vhost_dev {
> > -- 
> > 1.5.4.4

^ permalink raw reply

* Re: [RFC PATCH 1/2] netdev: buffer infrastructure to log network driver's information
From: Koki Sanagi @ 2010-04-06  5:43 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, eric.dumazet, netdev, izumi.taku, kaneshige.kenji,
	jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	alexander.h.duyck, peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <20100406001034.GA2156@localhost.localdomain>

(2010/04/06 9:10), Neil Horman wrote:
> On Mon, Apr 05, 2010 at 12:31:55PM -0700, David Miller wrote:
>> From: Eric Dumazet<eric.dumazet@gmail.com>
>> Date: Mon, 05 Apr 2010 10:42:26 +0200
>>
>>> Le lundi 05 avril 2010 à 15:52 +0900, Koki Sanagi a écrit :
>>>> This patch implements buffer infrastructure under driver/net.
>>>> This buffer records information from network driver.
>>>>
>>>> Signed-off-by: Koki Sanagi<sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>>    drivers/net/Kconfig     |    8 +
>>>>    drivers/net/Makefile    |    1 +
>>>>    drivers/net/ndrvbuf.c   |  535 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    include/linux/ndrvbuf.h |   57 +++++
>>>>    4 files changed, 601 insertions(+), 0 deletions(-)
>>>>
>>>
>>> Wow, 600 lines... thats what I call bloat...
>>
>> And we have all sorts of facilities for creating filesystem
>> streams and ring buffers of debug information.
>>
>> You could even hook into 'perf' to log and process these
>> events in probably like 12 lines of code.
>>
> I'm still having a hard time understanding why this approach is preferable to
> the previous approach you took using tracepoints.  Granted you can't get driver
> internal state as easily, but its generic and doesn't do...this.
> Neil
>
>
>

We can get below information with this patch

1. Driver operates normaly or not
2. Tx ring's state

About 1, the preivous approach meets, but about 2, some hooks need in driver
code like this patch. If we get it, it is available to solute "Tx Unit Hung"
message. This message indicates that tx descriptor ring's process is not smooth.
When a countermeasure was taken to system that outputs "Tx Unit Hung" message,
this state information is available to evaluate a countermeasure.
But what you say is true, this patch is not generic.
it may be good to rebase the previous approach to focus on 1.
And it is better to consider separately about 2.



^ permalink raw reply

* RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-06  5:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@c2.user-mode-linux.org, yzhao81@gmail.com
In-Reply-To: <20100401110841.GE3323@redhat.com>

Michael,
>> 
>>For the DOS issue, I'm not sure how much the limit get_user_pages()
>> can pin is reasonable, should we compute the bindwidth to make it?

>There's a ulimit for locked memory. Can we use this, decreasing
>the value for rlimit array? We can do this when backend is
>enabled and re-increment when backend is disabled.

I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
the initial value for it is 0x10000, after right shift PAGE_SHIFT,
it's only 16 pages we can lock then, it seems too small, since the 
guest virito-net driver may submit a lot requests one time.


Thanks
Xiaohui


^ permalink raw reply

* Re: [RFC PATCH 2/2] netdev: an usage example on igb
From: Koki Sanagi @ 2010-04-06  5:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, izumi.taku, kaneshige.kenji, davem, nhorman,
	jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	alexander.h.duyck, peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <1270456224.1971.15.camel@edumazet-laptop>

(2010/04/05 17:30), Eric Dumazet wrote:
> Le lundi 05 avril 2010 à 15:54 +0900, Koki Sanagi a écrit :
>> This patch is usage example of previous patch's buffer on igb.
>> The output is like below.
>>
>> # cat /sys/kernel/debug/ndrvbuf/igb-trace-0000\:03\:00.0/buffer
>> [  1] 50462.369207: clean_tx qidx=1 ntu=154->156
>> [  0] 50462.369241: clean_rx qidx=0 ntu=111->112
>> [  0] 50462.369250: xmit qidx=1 ntu=156->158
>> [  1] 50462.369256: clean_tx qidx=1 ntu=156->158
>> [  1] 50462.369342: clean_rx qidx=0 ntu=113->114
>> [  1] 50462.369439: clean_rx qidx=0 ntu=114->115
>>
>> This example outputs original print style, because it sets original print
>> function(igb_trace_read) when registered.
>>
>> register_ndrvbuf(buname, 1000000, igb_trace_read);
>>
>> If you set NULL to arg3, outputs by ndrvbuf default style.
>> If you set 0 to size(arg2), recording is disabled at first(but small buffer is
>> alloced).
>> When you set non-zero to size, recording becomes enabled.
>>
>> Signed-off-by: Koki Sanagi<sanagi.koki@jp.fujitsu.com>
>> ---
>>    drivers/net/igb/Makefile    |    2 +-
>>    drivers/net/igb/igb.h       |    1 +
>>    drivers/net/igb/igb_main.c  |   10 +++++-
>>    drivers/net/igb/igb_trace.c |   81 +++++++++++++++++++++++++++++++++++++++++++
>>    drivers/net/igb/igb_trace.h |   21 +++++++++++
>>    5 files changed, 113 insertions(+), 2 deletions(-)
>>
>
> This depends on NDRVBUF, yet I see no Kconfig change in this patch.
>
This igb can exist without ndrvbuf.
If ndrvbuf modules is not loaded, igb operates originally.
So this doesn't depend on ndrvbuf.
  



^ permalink raw reply

* Re: [PATCH 1/2] TIPC: Updated topology subscriptionprotocol according to latest spec
From: Suryanarayana.Garlapati @ 2010-04-06  5:11 UTC (permalink / raw)
  To: jon.maloy, davem; +Cc: netdev, tipc-discussion, Jon
In-Reply-To: <1270516572-27789-1-git-send-email-jon.maloy@ericsson.com>

Hi Jon,
Is this patch portable to versions TIPC 1.7.x?

Regards
Surya

> -----Original Message-----
> From: Jon Maloy [mailto:jon.maloy@ericsson.com] 
> Sent: Tuesday, April 06, 2010 6:46 AM
> To: David Miller
> Cc: Maloy; netdev@vger.kernel.org; 
> tipc-discussion@lists.sourceforge.net; Jon
> Subject: [tipc-discussion] [PATCH 1/2] TIPC: Updated topology 
> subscriptionprotocol according to latest spec
> 
> ---
>  include/linux/tipc.h |   30 ++++++++++++------------------
>  net/tipc/core.c      |    2 +-
>  net/tipc/subscr.c    |   15 ++++++++++-----
>  3 files changed, 23 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/tipc.h b/include/linux/tipc.h 
> index 3d92396..9536d8a 100644
> --- a/include/linux/tipc.h
> +++ b/include/linux/tipc.h
> @@ -127,23 +127,17 @@ static inline unsigned int tipc_node(__u32 addr)
>   * TIPC topology subscription service definitions
>   */
>  
> -#define TIPC_SUB_PORTS     	0x01  	/* filter for port 
> availability */
> -#define TIPC_SUB_SERVICE     	0x02  	/* filter for 
> service availability */
> -#define TIPC_SUB_CANCEL         0x04    /* cancel a subscription */
> -#if 0
> -/* The following filter options are not currently implemented */
> -#define TIPC_SUB_NO_BIND_EVTS	0x04	/* filter out 
> "publish" events */
> -#define TIPC_SUB_NO_UNBIND_EVTS	0x08	/* filter out 
> "withdraw" events */
> -#define TIPC_SUB_SINGLE_EVT	0x10	/* expire after first event */
> -#endif
> +#define TIPC_SUB_SERVICE     	0x00  	/* Filter for 
> service availability    */
> +#define TIPC_SUB_PORTS     	0x01  	/* Filter for port 
> availability  */
> +#define TIPC_SUB_CANCEL         0x04    /* Cancel a 
> subscription         */
>  
>  #define TIPC_WAIT_FOREVER	~0	/* timeout for 
> permanent subscription */
>  
>  struct tipc_subscr {
> -	struct tipc_name_seq seq;	/* name sequence of interest */
> -	__u32 timeout;			/* subscription 
> duration (in ms) */
> -        __u32 filter;   		/* bitmask of filter options */
> -	char usr_handle[8];		/* available for 
> subscriber use */
> +	struct tipc_name_seq seq;	/* NBO. Name sequence 
> of interest */
> +	__u32 timeout;			/* NBO. Subscription 
> duration (in ms) */
> +        __u32 filter;   		/* NBO. Bitmask of 
> filter options */
> +	char usr_handle[8];		/* Opaque. Available 
> for subscriber use */
>  };
>  
>  #define TIPC_PUBLISHED		1	/* publication event */
> @@ -151,11 +145,11 @@ struct tipc_subscr {
>  #define TIPC_SUBSCR_TIMEOUT	3	/* subscription timeout event */
>  
>  struct tipc_event {
> -	__u32 event;			/* event type */
> -	__u32 found_lower;		/* matching name seq 
> instances */
> -	__u32 found_upper;		/*    "      "    "     
> "      */
> -	struct tipc_portid port;	/* associated port */
> -	struct tipc_subscr s;		/* associated subscription */
> +	__u32 event;			/* NBO. Event type, as 
> defined above */
> +	__u32 found_lower;		/* NBO. Matching name 
> seq instances  */
> +	__u32 found_upper;		/*  "      "       "   
> "    "        */
> +	struct tipc_portid port;	/* NBO. Associated port 
>              */
> +	struct tipc_subscr s;		/* Original, associated 
> subscription */
>  };
>  
>  /*
> diff --git a/net/tipc/core.c b/net/tipc/core.c index 
> 52c571f..4e84c84 100644
> --- a/net/tipc/core.c
> +++ b/net/tipc/core.c
> @@ -49,7 +49,7 @@
>  #include "config.h"
>  
>  
> -#define TIPC_MOD_VER "1.6.4"
> +#define TIPC_MOD_VER "2.0.0"
>  
>  #ifndef CONFIG_TIPC_ZONES
>  #define CONFIG_TIPC_ZONES 3
> diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c index 
> ff123e5..ab6eab4 100644
> --- a/net/tipc/subscr.c
> +++ b/net/tipc/subscr.c
> @@ -274,7 +274,7 @@ static void subscr_cancel(struct 
> tipc_subscr *s,  {
>  	struct subscription *sub;
>  	struct subscription *sub_temp;
> -	__u32 type, lower, upper;
> +	__u32 type, lower, upper, timeout, filter;
>  	int found = 0;
>  
>  	/* Find first matching subscription, exit if not found 
> */ @@ -282,12 +282,18 @@ static void subscr_cancel(struct 
> tipc_subscr *s,
>  	type = ntohl(s->seq.type);
>  	lower = ntohl(s->seq.lower);
>  	upper = ntohl(s->seq.upper);
> +	timeout = ntohl(s->timeout);
> +	filter = ntohl(s->filter) & ~TIPC_SUB_CANCEL;
>  
>  	list_for_each_entry_safe(sub, sub_temp, 
> &subscriber->subscription_list,
>  				 subscription_list) {
>  			if ((type == sub->seq.type) &&
>  			    (lower == sub->seq.lower) &&
> -			    (upper == sub->seq.upper)) {
> +			    (upper == sub->seq.upper) &&
> +			    (timeout == sub->timeout) &&
> +                            (filter == sub->filter) &&
> +                             
> !memcmp(s->usr_handle,sub->evt.s.usr_handle,
> +				     sizeof(s->usr_handle)) ){
>  				found = 1;
>  				break;
>  			}
> @@ -304,7 +310,7 @@ static void subscr_cancel(struct tipc_subscr *s,
>  		k_term_timer(&sub->timer);
>  		spin_lock_bh(subscriber->lock);
>  	}
> -	dbg("Cancel: removing sub %u,%u,%u from subscriber %x list\n",
> +	dbg("Cancel: removing sub %u,%u,%u from subscriber %p list\n",
>  	    sub->seq.type, sub->seq.lower, sub->seq.upper, subscriber);
>  	subscr_del(sub);
>  }
> @@ -352,8 +358,7 @@ static struct subscription 
> *subscr_subscribe(struct tipc_subscr *s,
>  	sub->seq.upper = ntohl(s->seq.upper);
>  	sub->timeout = ntohl(s->timeout);
>  	sub->filter = ntohl(s->filter);
> -	if ((!(sub->filter & TIPC_SUB_PORTS) ==
> -	     !(sub->filter & TIPC_SUB_SERVICE)) ||
> +	if ((sub->filter && (sub->filter != TIPC_SUB_PORTS)) ||
>  	    (sub->seq.lower > sub->seq.upper)) {
>  		warn("Subscription rejected, illegal request\n");
>  		kfree(sub);
> --
> 1.5.4.3
> 
> 
> --------------------------------------------------------------
> ----------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
> 

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

^ permalink raw reply

* [PATCH] net/irda: Add SuperH IrDA driver support
From: Kuninori Morimoto @ 2010-04-06  4:46 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller

This is very simple driver for SuperH Mobile IrDA
which support SIR/MIR/FIR.
This patch add only SIR support for now.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 drivers/net/irda/Kconfig   |    6 +
 drivers/net/irda/Makefile  |    1 +
 drivers/net/irda/sh_irda.c |  865 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 872 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/irda/sh_irda.c

diff --git a/drivers/net/irda/Kconfig b/drivers/net/irda/Kconfig
index af10e97..25bb2a0 100644
--- a/drivers/net/irda/Kconfig
+++ b/drivers/net/irda/Kconfig
@@ -397,5 +397,11 @@ config MCS_FIR
 	  To compile it as a module, choose M here: the module will be called
 	  mcs7780.
 
+config SH_IRDA
+	tristate "SuperH IrDA driver"
+	depends on IRDA && ARCH_SHMOBILE
+	help
+	  Say Y here if your want to enable SuperH IrDA devices.
+
 endmenu
 
diff --git a/drivers/net/irda/Makefile b/drivers/net/irda/Makefile
index e030d47..dfc6453 100644
--- a/drivers/net/irda/Makefile
+++ b/drivers/net/irda/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_VIA_FIR)		+= via-ircc.o
 obj-$(CONFIG_PXA_FICP)	        += pxaficp_ir.o
 obj-$(CONFIG_MCS_FIR)	        += mcs7780.o
 obj-$(CONFIG_AU1000_FIR)	+= au1k_ir.o
+obj-$(CONFIG_SH_IRDA)		+= sh_irda.o
 # SIR drivers
 obj-$(CONFIG_IRTTY_SIR)		+= irtty-sir.o	sir-dev.o
 obj-$(CONFIG_BFIN_SIR)		+= bfin_sir.o
diff --git a/drivers/net/irda/sh_irda.c b/drivers/net/irda/sh_irda.c
new file mode 100644
index 0000000..9a828b0
--- /dev/null
+++ b/drivers/net/irda/sh_irda.c
@@ -0,0 +1,865 @@
+/*
+ * SuperH IrDA Driver
+ *
+ * Copyright (C) 2010 Renesas Solutions Corp.
+ * Kuninori Morimoto <morimoto.kuninori@renesas.com>
+ *
+ * Based on sh_sir.c
+ * Copyright (C) 2009 Renesas Solutions Corp.
+ * Copyright 2006-2009 Analog Devices Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/*
+ * CAUTION
+ *
+ * This driver is very simple.
+ * So, it doesn't have below support now
+ *  - MIR/FIR support
+ *  - DMA transfer support
+ *  - FIFO mode support
+ */
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/clk.h>
+#include <net/irda/wrapper.h>
+#include <net/irda/irda_device.h>
+
+#define DRIVER_NAME "sh_irda"
+
+#if defined(CONFIG_ARCH_SH7367) || defined(CONFIG_ARCH_SH7377)
+#define __IRDARAM_LEN	0x13FF
+#else
+#define __IRDARAM_LEN	0x1039
+#endif
+
+#define IRTMR		0x1F00 /* Transfer mode */
+#define IRCFR		0x1F02 /* Configuration */
+#define IRCTR		0x1F04 /* IR control */
+#define IRTFLR		0x1F20 /* Transmit frame length */
+#define IRTCTR		0x1F22 /* Transmit control */
+#define IRRFLR		0x1F40 /* Receive frame length */
+#define IRRCTR		0x1F42 /* Receive control */
+#define SIRISR		0x1F60 /* SIR-UART mode interrupt source */
+#define SIRIMR		0x1F62 /* SIR-UART mode interrupt mask */
+#define SIRICR		0x1F64 /* SIR-UART mode interrupt clear */
+#define SIRBCR		0x1F68 /* SIR-UART mode baud rate count */
+#define MFIRISR		0x1F70 /* MIR/FIR mode interrupt source */
+#define MFIRIMR		0x1F72 /* MIR/FIR mode interrupt mask */
+#define MFIRICR		0x1F74 /* MIR/FIR mode interrupt clear */
+#define CRCCTR		0x1F80 /* CRC engine control */
+#define CRCIR		0x1F86 /* CRC engine input data */
+#define CRCCR		0x1F8A /* CRC engine calculation */
+#define CRCOR		0x1F8E /* CRC engine output data */
+#define FIFOCP		0x1FC0 /* FIFO current pointer */
+#define FIFOFP		0x1FC2 /* FIFO follow pointer */
+#define FIFORSMSK	0x1FC4 /* FIFO receive status mask */
+#define FIFORSOR	0x1FC6 /* FIFO receive status OR */
+#define FIFOSEL		0x1FC8 /* FIFO select */
+#define FIFORS		0x1FCA /* FIFO receive status */
+#define FIFORFL		0x1FCC /* FIFO receive frame length */
+#define FIFORAMCP	0x1FCE /* FIFO RAM current pointer */
+#define FIFORAMFP	0x1FD0 /* FIFO RAM follow pointer */
+#define BIFCTL		0x1FD2 /* BUS interface control */
+#define IRDARAM		0x0000 /* IrDA buffer RAM */
+#define IRDARAM_LEN	__IRDARAM_LEN /* - 8/16/32 (read-only for 32) */
+
+/* IRTMR */
+#define TMD_MASK	(0x3 << 14) /* Transfer Mode */
+#define TMD_SIR		(0x0 << 14)
+#define TMD_MIR		(0x3 << 14)
+#define TMD_FIR		(0x2 << 14)
+
+#define FIFORIM		(1 << 8) /* FIFO receive interrupt mask */
+#define MIM		(1 << 4) /* MIR/FIR Interrupt Mask */
+#define SIM		(1 << 0) /* SIR Interrupt Mask */
+#define xIM_MASK	(FIFORIM | MIM | SIM)
+
+/* IRCFR */
+#define RTO_SHIFT	8 /* shift for Receive Timeout */
+#define RTO		(0x3 << RTO_SHIFT)
+
+/* IRTCTR */
+#define ARMOD		(1 << 15) /* Auto-Receive Mode */
+#define TE		(1 <<  0) /* Transmit Enable */
+
+/* IRRFLR */
+#define RFL_MASK	(0x1FFF) /* mask for Receive Frame Length */
+
+/* IRRCTR */
+#define RE		(1 <<  0) /* Receive Enable */
+
+/*
+ * SIRISR,  SIRIMR,  SIRICR,
+ * MFIRISR, MFIRIMR, MFIRICR
+ */
+#define FRE		(1 << 15) /* Frame Receive End */
+#define TROV		(1 << 11) /* Transfer Area Overflow */
+#define xIR_9		(1 << 9)
+#define TOT		xIR_9     /* for SIR     Timeout */
+#define ABTD		xIR_9     /* for MIR/FIR Abort Detection */
+#define xIR_8		(1 << 8)
+#define FER		xIR_8     /* for SIR     Framing Error */
+#define CRCER		xIR_8     /* for MIR/FIR CRC error */
+#define FTE		(1 << 7)  /* Frame Transmit End */
+#define xIR_MASK	(FRE | TROV | xIR_9 | xIR_8 | FTE)
+
+/* SIRBCR */
+#define BRC_MASK	(0x3F) /* mask for Baud Rate Count */
+
+/* CRCCTR */
+#define CRC_RST		(1 << 15) /* CRC Engine Reset */
+#define CRC_CT_MASK	0x0FFF    /* mask for CRC Engine Input Data Count */
+
+/* CRCIR */
+#define CRC_IN_MASK	0x0FFF    /* mask for CRC Engine Input Data */
+
+/************************************************************************
+
+
+			enum / structure
+
+
+************************************************************************/
+enum sh_irda_mode {
+	SH_IRDA_NONE = 0,
+	SH_IRDA_SIR,
+	SH_IRDA_MIR,
+	SH_IRDA_FIR,
+};
+
+struct sh_irda_self;
+struct sh_irda_xir_func {
+	int (*xir_fre)	(struct sh_irda_self *self);
+	int (*xir_trov)	(struct sh_irda_self *self);
+	int (*xir_9)	(struct sh_irda_self *self);
+	int (*xir_8)	(struct sh_irda_self *self);
+	int (*xir_fte)	(struct sh_irda_self *self);
+};
+
+struct sh_irda_self {
+	void __iomem		*membase;
+	unsigned int		 irq;
+	struct clk		*clk;
+
+	struct net_device	*ndev;
+
+	struct irlap_cb		*irlap;
+	struct qos_info		qos;
+
+	iobuff_t		tx_buff;
+	iobuff_t		rx_buff;
+
+	enum sh_irda_mode	mode;
+	spinlock_t		lock;
+
+	struct sh_irda_xir_func	*xir_func;
+};
+
+/************************************************************************
+
+
+			common function
+
+
+************************************************************************/
+static void sh_irda_write(struct sh_irda_self *self, u32 offset, u16 data)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&self->lock, flags);
+	iowrite16(data, self->membase + offset);
+	spin_unlock_irqrestore(&self->lock, flags);
+}
+
+static u16 sh_irda_read(struct sh_irda_self *self, u32 offset)
+{
+	unsigned long flags;
+	u16 ret;
+
+	spin_lock_irqsave(&self->lock, flags);
+	ret = ioread16(self->membase + offset);
+	spin_unlock_irqrestore(&self->lock, flags);
+
+	return ret;
+}
+
+static void sh_irda_update_bits(struct sh_irda_self *self, u32 offset,
+			       u16 mask, u16 data)
+{
+	unsigned long flags;
+	u16 old, new;
+
+	spin_lock_irqsave(&self->lock, flags);
+	old = ioread16(self->membase + offset);
+	new = (old & ~mask) | data;
+	if (old != new)
+		iowrite16(data, self->membase + offset);
+	spin_unlock_irqrestore(&self->lock, flags);
+}
+
+/************************************************************************
+
+
+			mode function
+
+
+************************************************************************/
+/*=====================================
+ *
+ *		common
+ *
+ *=====================================*/
+static void sh_irda_rcv_ctrl(struct sh_irda_self *self, int enable)
+{
+	struct device *dev = &self->ndev->dev;
+
+	sh_irda_update_bits(self, IRRCTR, RE, enable ? RE : 0);
+	dev_dbg(dev, "recv %s\n", enable ? "enable" : "disable");
+}
+
+static int sh_irda_set_timeout(struct sh_irda_self *self, int interval)
+{
+	struct device *dev = &self->ndev->dev;
+
+	if (SH_IRDA_SIR != self->mode)
+		interval = 0;
+
+	if (interval < 0 || interval > 2) {
+		dev_err(dev, "unsupported timeout interval\n");
+		return -EINVAL;
+	}
+
+	sh_irda_update_bits(self, IRCFR, RTO, interval << RTO_SHIFT);
+	return 0;
+}
+
+static int sh_irda_set_baudrate(struct sh_irda_self *self, int baudrate)
+{
+	struct device *dev = &self->ndev->dev;
+	u16 val;
+
+	if (baudrate < 0)
+		return 0;
+
+	if (SH_IRDA_SIR != self->mode) {
+		dev_err(dev, "it is not SIR mode\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * Baud rate (bits/s) =
+	 *   (48 MHz / 26) / (baud rate counter value + 1) x 16
+	 */
+	val = (48000000 / 26 / 16 / baudrate) - 1;
+	dev_dbg(dev, "baudrate = %d,  val = 0x%02x\n", baudrate, val);
+
+	sh_irda_update_bits(self, SIRBCR, BRC_MASK, val);
+
+	return 0;
+}
+
+static int xir_get_rcv_length(struct sh_irda_self *self)
+{
+	return RFL_MASK & sh_irda_read(self, IRRFLR);
+}
+
+/*=====================================
+ *
+ *		NONE MODE
+ *
+ *=====================================*/
+static int xir_fre(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	dev_err(dev, "none mode: frame recv\n");
+	return 0;
+}
+
+static int xir_trov(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	dev_err(dev, "none mode: buffer ram over\n");
+	return 0;
+}
+
+static int xir_9(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	dev_err(dev, "none mode: time over\n");
+	return 0;
+}
+
+static int xir_8(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	dev_err(dev, "none mode: framing error\n");
+	return 0;
+}
+
+static int xir_fte(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	dev_err(dev, "none mode: frame transmit end\n");
+	return 0;
+}
+
+static struct sh_irda_xir_func xir_func = {
+	.xir_fre	= xir_fre,
+	.xir_trov	= xir_trov,
+	.xir_9		= xir_9,
+	.xir_8		= xir_8,
+	.xir_fte	= xir_fte,
+};
+
+/*=====================================
+ *
+ *		MIR/FIR MODE
+ *
+ * MIR/FIR are not supported now
+ *=====================================*/
+static struct sh_irda_xir_func mfir_func = {
+	.xir_fre	= xir_fre,
+	.xir_trov	= xir_trov,
+	.xir_9		= xir_9,
+	.xir_8		= xir_8,
+	.xir_fte	= xir_fte,
+};
+
+/*=====================================
+ *
+ *		SIR MODE
+ *
+ *=====================================*/
+static int sir_fre(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	u16 data16;
+	u8  *data = (u8 *)&data16;
+	int len = xir_get_rcv_length(self);
+	int i, j;
+
+	if (len > IRDARAM_LEN)
+		len = IRDARAM_LEN;
+
+	dev_dbg(dev, "frame recv length = %d\n", len);
+
+	for (i = 0; i < len; i++) {
+		j = i % 2;
+		if (!j)
+			data16 = sh_irda_read(self, IRDARAM + i);
+
+		async_unwrap_char(self->ndev, &self->ndev->stats,
+				  &self->rx_buff, data[j]);
+	}
+	self->ndev->last_rx = jiffies;
+
+	sh_irda_rcv_ctrl(self, 1);
+
+	return 0;
+}
+
+static int sir_trov(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+
+	dev_err(dev, "buffer ram over\n");
+	sh_irda_rcv_ctrl(self, 1);
+	return 0;
+}
+
+static int sir_tot(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+
+	dev_err(dev, "time over\n");
+	sh_irda_set_baudrate(self, 9600);
+	sh_irda_rcv_ctrl(self, 1);
+	return 0;
+}
+
+static int sir_fer(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+
+	dev_err(dev, "framing error\n");
+	sh_irda_rcv_ctrl(self, 1);
+	return 0;
+}
+
+static int sir_fte(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+
+	dev_dbg(dev, "frame transmit end\n");
+	netif_wake_queue(self->ndev);
+
+	return 0;
+}
+
+static struct sh_irda_xir_func sir_func = {
+	.xir_fre	= sir_fre,
+	.xir_trov	= sir_trov,
+	.xir_9		= sir_tot,
+	.xir_8		= sir_fer,
+	.xir_fte	= sir_fte,
+};
+
+static void sh_irda_set_mode(struct sh_irda_self *self, enum sh_irda_mode mode)
+{
+	struct device *dev = &self->ndev->dev;
+	struct sh_irda_xir_func	*func;
+	const char *name;
+	u16 data;
+
+	switch (mode) {
+	case SH_IRDA_SIR:
+		name	= "SIR";
+		data	= TMD_SIR;
+		func	= &sir_func;
+		break;
+	case SH_IRDA_MIR:
+		name	= "MIR";
+		data	= TMD_MIR;
+		func	= &mfir_func;
+		break;
+	case SH_IRDA_FIR:
+		name	= "FIR";
+		data	= TMD_FIR;
+		func	= &mfir_func;
+		break;
+	default:
+		name = "NONE";
+		data = 0;
+		func = &xir_func;
+		break;
+	}
+
+	self->mode = mode;
+	self->xir_func = func;
+	sh_irda_update_bits(self, IRTMR, TMD_MASK, data);
+
+	dev_dbg(dev, "switch to %s mode", name);
+}
+
+/************************************************************************
+
+
+			irq function
+
+
+************************************************************************/
+static void sh_irda_set_irq_mask(struct sh_irda_self *self)
+{
+	u16 tmr_hole;
+	u16 xir_reg;
+
+	/* set all mask */
+	sh_irda_update_bits(self, IRTMR,   xIM_MASK, xIM_MASK);
+	sh_irda_update_bits(self, SIRIMR,  xIR_MASK, xIR_MASK);
+	sh_irda_update_bits(self, MFIRIMR, xIR_MASK, xIR_MASK);
+
+	/* clear irq */
+	sh_irda_update_bits(self, SIRICR,  xIR_MASK, xIR_MASK);
+	sh_irda_update_bits(self, MFIRICR, xIR_MASK, xIR_MASK);
+
+	switch (self->mode) {
+	case SH_IRDA_SIR:
+		tmr_hole	= SIM;
+		xir_reg		= SIRIMR;
+		break;
+	case SH_IRDA_MIR:
+	case SH_IRDA_FIR:
+		tmr_hole	= MIM;
+		xir_reg		= MFIRIMR;
+		break;
+	default:
+		tmr_hole	= 0;
+		xir_reg		= 0;
+		break;
+	}
+
+	/* open mask */
+	if (xir_reg) {
+		sh_irda_update_bits(self, IRTMR, tmr_hole, 0);
+		sh_irda_update_bits(self, xir_reg, xIR_MASK, 0);
+	}
+}
+
+static irqreturn_t sh_irda_irq(int irq, void *dev_id)
+{
+	struct sh_irda_self *self = dev_id;
+	struct sh_irda_xir_func	*func = self->xir_func;
+	u16 isr = sh_irda_read(self, SIRISR);
+
+	/* clear irq */
+	sh_irda_write(self, SIRICR, isr);
+
+	if (isr & FRE)
+		func->xir_fre(self);
+	if (isr & TROV)
+		func->xir_trov(self);
+	if (isr & xIR_9)
+		func->xir_9(self);
+	if (isr & xIR_8)
+		func->xir_8(self);
+	if (isr & FTE)
+		func->xir_fte(self);
+
+	return IRQ_HANDLED;
+}
+
+/************************************************************************
+
+
+			CRC function
+
+
+************************************************************************/
+static void sh_irda_crc_reset(struct sh_irda_self *self)
+{
+	sh_irda_write(self, CRCCTR, CRC_RST);
+}
+
+static void sh_irda_crc_add(struct sh_irda_self *self, u16 data)
+{
+	sh_irda_write(self, CRCIR, data & CRC_IN_MASK);
+}
+
+static u16 sh_irda_crc_cnt(struct sh_irda_self *self)
+{
+	return CRC_CT_MASK & sh_irda_read(self, CRCCTR);
+}
+
+static u16 sh_irda_crc_out(struct sh_irda_self *self)
+{
+	return sh_irda_read(self, CRCOR);
+}
+
+static int sh_irda_crc_init(struct sh_irda_self *self)
+{
+	struct device *dev = &self->ndev->dev;
+	int ret = -EIO;
+	u16 val;
+
+	sh_irda_crc_reset(self);
+
+	sh_irda_crc_add(self, 0xCC);
+	sh_irda_crc_add(self, 0xF5);
+	sh_irda_crc_add(self, 0xF1);
+	sh_irda_crc_add(self, 0xA7);
+
+	val = sh_irda_crc_cnt(self);
+	if (4 != val) {
+		dev_err(dev, "CRC count error %x\n", val);
+		goto crc_init_out;
+	}
+
+	val = sh_irda_crc_out(self);
+	if (0x51DF != val) {
+		dev_err(dev, "CRC result error%x\n", val);
+		goto crc_init_out;
+	}
+
+	ret = 0;
+
+crc_init_out:
+
+	sh_irda_crc_reset(self);
+	return ret;
+}
+
+/************************************************************************
+
+
+			iobuf function
+
+
+************************************************************************/
+static void sh_irda_remove_iobuf(struct sh_irda_self *self)
+{
+	kfree(self->rx_buff.head);
+
+	self->tx_buff.head = NULL;
+	self->tx_buff.data = NULL;
+	self->rx_buff.head = NULL;
+	self->rx_buff.data = NULL;
+}
+
+static int sh_irda_init_iobuf(struct sh_irda_self *self, int rxsize, int txsize)
+{
+	if (self->rx_buff.head ||
+	    self->tx_buff.head) {
+		dev_err(&self->ndev->dev, "iobuff has already existed.");
+		return -EINVAL;
+	}
+
+	/* rx_buff */
+	self->rx_buff.head = kmalloc(rxsize, GFP_KERNEL);
+	if (!self->rx_buff.head)
+		return -ENOMEM;
+
+	self->rx_buff.truesize	= rxsize;
+	self->rx_buff.in_frame	= FALSE;
+	self->rx_buff.state	= OUTSIDE_FRAME;
+	self->rx_buff.data	= self->rx_buff.head;
+
+	/* tx_buff */
+	self->tx_buff.head	= self->membase + IRDARAM;
+	self->tx_buff.truesize	= IRDARAM_LEN;
+
+	return 0;
+}
+
+/************************************************************************
+
+
+			net_device_ops function
+
+
+************************************************************************/
+static int sh_irda_hard_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct sh_irda_self *self = netdev_priv(ndev);
+	struct device *dev = &self->ndev->dev;
+	int speed = irda_get_next_speed(skb);
+	int ret;
+
+	dev_dbg(dev, "hard xmit\n");
+
+	netif_stop_queue(ndev);
+	sh_irda_rcv_ctrl(self, 0);
+
+	ret = sh_irda_set_baudrate(self, speed);
+	if (ret < 0)
+		return ret;
+
+	self->tx_buff.len = 0;
+	if (skb->len) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&self->lock, flags);
+		self->tx_buff.len = async_wrap_skb(skb,
+						   self->tx_buff.head,
+						   self->tx_buff.truesize);
+		spin_unlock_irqrestore(&self->lock, flags);
+
+		if (self->tx_buff.len > self->tx_buff.truesize)
+			self->tx_buff.len = self->tx_buff.truesize;
+
+		sh_irda_write(self, IRTFLR, self->tx_buff.len);
+		sh_irda_write(self, IRTCTR, ARMOD | TE);
+	}
+
+	dev_kfree_skb(skb);
+
+	return 0;
+}
+
+static int sh_irda_ioctl(struct net_device *ndev, struct ifreq *ifreq, int cmd)
+{
+	/*
+	 * FIXME
+	 *
+	 * This function is needed for irda framework.
+	 * But nothing to do now
+	 */
+	return 0;
+}
+
+static struct net_device_stats *sh_irda_stats(struct net_device *ndev)
+{
+	struct sh_irda_self *self = netdev_priv(ndev);
+
+	return &self->ndev->stats;
+}
+
+static int sh_irda_open(struct net_device *ndev)
+{
+	struct sh_irda_self *self = netdev_priv(ndev);
+	int err;
+
+	clk_enable(self->clk);
+	err = sh_irda_crc_init(self);
+	if (err)
+		goto open_err;
+
+	sh_irda_set_mode(self, SH_IRDA_SIR);
+	sh_irda_set_timeout(self, 2);
+	sh_irda_set_baudrate(self, 9600);
+
+	self->irlap = irlap_open(ndev, &self->qos, DRIVER_NAME);
+	if (!self->irlap) {
+		err = -ENODEV;
+		goto open_err;
+	}
+
+	netif_start_queue(ndev);
+	sh_irda_rcv_ctrl(self, 1);
+	sh_irda_set_irq_mask(self);
+
+	dev_info(&ndev->dev, "opened\n");
+
+	return 0;
+
+open_err:
+	clk_disable(self->clk);
+
+	return err;
+}
+
+static int sh_irda_stop(struct net_device *ndev)
+{
+	struct sh_irda_self *self = netdev_priv(ndev);
+
+	/* Stop IrLAP */
+	if (self->irlap) {
+		irlap_close(self->irlap);
+		self->irlap = NULL;
+	}
+
+	netif_stop_queue(ndev);
+
+	dev_info(&ndev->dev, "stoped\n");
+
+	return 0;
+}
+
+static const struct net_device_ops sh_irda_ndo = {
+	.ndo_open		= sh_irda_open,
+	.ndo_stop		= sh_irda_stop,
+	.ndo_start_xmit		= sh_irda_hard_xmit,
+	.ndo_do_ioctl		= sh_irda_ioctl,
+	.ndo_get_stats		= sh_irda_stats,
+};
+
+/************************************************************************
+
+
+			platform_driver function
+
+
+************************************************************************/
+static int __devinit sh_irda_probe(struct platform_device *pdev)
+{
+	struct net_device *ndev;
+	struct sh_irda_self *self;
+	struct resource *res;
+	char clk_name[8];
+	unsigned int irq;
+	int err = -ENOMEM;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	irq = platform_get_irq(pdev, 0);
+	if (!res || irq < 0) {
+		dev_err(&pdev->dev, "Not enough platform resources.\n");
+		goto exit;
+	}
+
+	ndev = alloc_irdadev(sizeof(*self));
+	if (!ndev)
+		goto exit;
+
+	self = netdev_priv(ndev);
+	self->membase = ioremap_nocache(res->start, resource_size(res));
+	if (!self->membase) {
+		err = -ENXIO;
+		dev_err(&pdev->dev, "Unable to ioremap.\n");
+		goto err_mem_1;
+	}
+
+	err = sh_irda_init_iobuf(self, IRDA_SKB_MAX_MTU, IRDA_SIR_MAX_FRAME);
+	if (err)
+		goto err_mem_2;
+
+	snprintf(clk_name, sizeof(clk_name), "irda%d", pdev->id);
+	self->clk = clk_get(&pdev->dev, clk_name);
+	if (IS_ERR(self->clk)) {
+		dev_err(&pdev->dev, "cannot get clock \"%s\"\n", clk_name);
+		goto err_mem_3;
+	}
+
+	irda_init_max_qos_capabilies(&self->qos);
+
+	ndev->netdev_ops	= &sh_irda_ndo;
+	ndev->irq		= irq;
+
+	self->ndev			= ndev;
+	self->qos.baud_rate.bits	&= IR_9600; /* FIXME */
+	self->qos.min_turn_time.bits	= 1; /* 10 ms or more */
+	spin_lock_init(&self->lock);
+
+	irda_qos_bits_to_value(&self->qos);
+
+	err = register_netdev(ndev);
+	if (err)
+		goto err_mem_4;
+
+	platform_set_drvdata(pdev, ndev);
+
+	if (request_irq(irq, sh_irda_irq, IRQF_DISABLED, "sh_irda", self)) {
+		dev_warn(&pdev->dev, "Unable to attach sh_irda interrupt\n");
+		goto err_mem_4;
+	}
+
+	dev_info(&pdev->dev, "SuperH IrDA probed\n");
+
+	goto exit;
+
+err_mem_4:
+	clk_put(self->clk);
+err_mem_3:
+	sh_irda_remove_iobuf(self);
+err_mem_2:
+	iounmap(self->membase);
+err_mem_1:
+	free_netdev(ndev);
+exit:
+	return err;
+}
+
+static int __devexit sh_irda_remove(struct platform_device *pdev)
+{
+	struct net_device *ndev = platform_get_drvdata(pdev);
+	struct sh_irda_self *self = netdev_priv(ndev);
+
+	if (!self)
+		return 0;
+
+	unregister_netdev(ndev);
+	clk_put(self->clk);
+	sh_irda_remove_iobuf(self);
+	iounmap(self->membase);
+	free_netdev(ndev);
+	platform_set_drvdata(pdev, NULL);
+
+	return 0;
+}
+
+static struct platform_driver sh_irda_driver = {
+	.probe   = sh_irda_probe,
+	.remove  = __devexit_p(sh_irda_remove),
+	.driver  = {
+		.name = DRIVER_NAME,
+	},
+};
+
+static int __init sh_irda_init(void)
+{
+	return platform_driver_register(&sh_irda_driver);
+}
+
+static void __exit sh_irda_exit(void)
+{
+	platform_driver_unregister(&sh_irda_driver);
+}
+
+module_init(sh_irda_init);
+module_exit(sh_irda_exit);
+
+MODULE_AUTHOR("Kuninori Morimoto <morimoto.kuninori@renesas.com>");
+MODULE_DESCRIPTION("SuperH IrDA driver");
+MODULE_LICENSE("GPL");
-- 
1.6.3.3


^ permalink raw reply related

* Re: net-next: 2.6.34-rc1 regression: panic when running diagnostic on interface with IPv6
From: Stephen Hemminger @ 2010-04-06  4:39 UTC (permalink / raw)
  To: emil.s.tantilov; +Cc: David Miller, netdev
In-Reply-To: <20100405.165317.89399272.davem@davemloft.net>

I can not reproduce this with current net-next and e1000e.

Please recheck that you are running the right code.

# ip addr show dev eth3
6: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:15:17:c3:a9:fb brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:0:f101::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::215:17ff:fec3:a9fb/64 scope link 
       valid_lft forever preferred_lft forever
# ip addr show dev eth3
6: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:15:17:c3:a9:fb brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.12/24 brd 192.168.1.255 scope global eth3
    inet6 2001:db8:0:f101::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::215:17ff:fec3:a9fb/64 scope link 
       valid_lft forever preferred_lft forever

# ethtool -t eth3
The test result is PASS
The test extra info:
Register test  (offline)	 0
Eeprom test    (offline)	 0
Interrupt test (offline)	 0
Loopback test  (offline)	 0
Link test   (on/offline)	 0

No failures, no backtrace, nothing?

$ git log net/ipv6/addrconf.c

commit 4b97efdf392563bf03b4917a0b5add2df65de39a
Author: Patrick McHardy <kaber@trash.net>
Date:   Fri Mar 26 20:27:49 2010 -0700

    net: fix netlink address dumping in IPv4/IPv6
    
    When a dump is interrupted at the last device in a hash chain and
    then continued, "idx" won't get incremented past s_idx, so s_ip_idx
    is not reset when moving on to the next device. This means of all
    following devices only the last n - s_ip_idx addresses are dumped.
    
    Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
    Signed-off-by: Patrick McHardy <kaber@trash.net>

commit b79d1d54cf0672f764402fe4711ef5306f917bd3
Author: David S. Miller <davem@davemloft.net>
Date:   Thu Mar 25 21:39:21 2010 -0700

    ipv6: Fix result generation in ipv6_get_ifaddr().
    
    Finishing naturally from hlist_for_each_entry(x, ...) does not result
    in 'x' being NULL.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit b54c9b98bbfb4836b1f7441c5a9db24affd3c2e9
Author: David S. Miller <davem@davemloft.net>
Date:   Thu Mar 25 21:25:30 2010 -0700

    ipv6: Preserve pervious behavior in ipv6_link_dev_addr().
    
    Use list_add_tail() to get the behavior we had before
    the list_head conversion for ipv6 address lists.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH 2/2] net/irda: sh_sir: Modify iounmap wrong execution
From: Kuninori Morimoto @ 2010-04-06  4:43 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller
In-Reply-To: <u39z96vk2.wl%kuninori.morimoto.gx@renesas.com>

On sh_sir_probe function, there was a possibility that
iounmap is executed even though self->membase was NULL when error case.
This patch modify it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 drivers/net/irda/sh_sir.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/irda/sh_sir.c b/drivers/net/irda/sh_sir.c
index a4677b8..bdfefa0 100644
--- a/drivers/net/irda/sh_sir.c
+++ b/drivers/net/irda/sh_sir.c
@@ -720,7 +720,6 @@ static int __devinit sh_sir_probe(struct platform_device *pdev)
 	struct sh_sir_self *self;
 	struct resource *res;
 	char clk_name[8];
-	void __iomem *base;
 	unsigned int irq;
 	int err = -ENOMEM;
 
@@ -735,14 +734,14 @@ static int __devinit sh_sir_probe(struct platform_device *pdev)
 	if (!ndev)
 		goto exit;
 
-	base = ioremap_nocache(res->start, resource_size(res));
-	if (!base) {
+	self = netdev_priv(ndev);
+	self->membase = ioremap_nocache(res->start, resource_size(res));
+	if (!self->membase) {
 		err = -ENXIO;
 		dev_err(&pdev->dev, "Unable to ioremap.\n");
 		goto err_mem_1;
 	}
 
-	self = netdev_priv(ndev);
 	err = sh_sir_init_iobuf(self, IRDA_SKB_MAX_MTU, IRDA_SIR_MAX_FRAME);
 	if (err)
 		goto err_mem_2;
@@ -759,7 +758,6 @@ static int __devinit sh_sir_probe(struct platform_device *pdev)
 	ndev->netdev_ops	= &sh_sir_ndo;
 	ndev->irq		= irq;
 
-	self->membase			= base;
 	self->ndev			= ndev;
 	self->qos.baud_rate.bits	&= IR_9600; /* FIXME */
 	self->qos.min_turn_time.bits	= 1; /* 10 ms or more */
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 1/2] net/irda: sh_sir: fixup err return value on sh_sir_open
From: Kuninori Morimoto @ 2010-04-06  4:43 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller
In-Reply-To: <u39z96vk2.wl%kuninori.morimoto.gx@renesas.com>

On sh_sir_open function, there was a possibility that
err variable didn't have value even though it is return value.
This patch modify it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 drivers/net/irda/sh_sir.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/irda/sh_sir.c b/drivers/net/irda/sh_sir.c
index d7c983d..761ed01 100644
--- a/drivers/net/irda/sh_sir.c
+++ b/drivers/net/irda/sh_sir.c
@@ -645,8 +645,10 @@ static int sh_sir_open(struct net_device *ndev)
 	sh_sir_set_baudrate(self, 9600);

 	self->irlap = irlap_open(ndev, &self->qos, DRIVER_NAME);
-	if (!self->irlap)
+	if (!self->irlap) {
+		err = -ENODEV;
 		goto open_err;
+	}

 	/*
 	 * Now enable the interrupt then start the queue
-- 
1.6.3.3

^ permalink raw reply related

* [PATCH 0/2] net/irda: sh_sir: Bug fix patches
From: Kuninori Morimoto @ 2010-04-06  4:42 UTC (permalink / raw)
  To: netdev; +Cc: Samuel Ortiz, David S. Miller


Dear David

Kuninori Morimoto (2):
      net/irda: sh_sir: fixup err return value on sh_sir_open
      net/irda: sh_sir: Modify iounmap wrong execution

These 2 patches are bug fix of sh_sir driver.

Best regards
--
Kuninori Morimoto
 

^ permalink raw reply

* Re: [v2 Patch 3/3] bonding: make bonding support netpoll
From: Cong Wang @ 2010-04-06  4:38 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: linux-kernel, Matt Mackall, netdev, bridge, Andy Gospodarek,
	Neil Horman, Jeff Moyer, Stephen Hemminger, bonding-devel,
	Jay Vosburgh, David Miller
In-Reply-To: <4BBA9FDB.4040909@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 279 bytes --]

Cong Wang wrote:
> Before I try to reproduce it, could you please try to replace the 
> 'read_lock()'
> in slaves_support_netpoll() with 'read_lock_bh()'? (read_unlock() too) 
> Try if this helps.
> 

Confirmed. Please use the attached patch instead, for your testing.

Thanks!


[-- Attachment #2: bonding-support-netpoll.diff --]
[-- Type: text/x-patch, Size: 4225 bytes --]

Index: linux-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- linux-2.6.orig/drivers/net/bonding/bond_main.c
+++ linux-2.6/drivers/net/bonding/bond_main.c
@@ -59,6 +59,7 @@
 #include <linux/uaccess.h>
 #include <linux/errno.h>
 #include <linux/netdevice.h>
+#include <linux/netpoll.h>
 #include <linux/inetdevice.h>
 #include <linux/igmp.h>
 #include <linux/etherdevice.h>
@@ -430,7 +431,18 @@ int bond_dev_queue_xmit(struct bonding *
 	}
 
 	skb->priority = 1;
-	dev_queue_xmit(skb);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	if (bond->dev->priv_flags & IFF_IN_NETPOLL) {
+		struct netpoll *np = bond->dev->npinfo->netpoll;
+		slave_dev->npinfo = bond->dev->npinfo;
+		np->real_dev = np->dev = skb->dev;
+		slave_dev->priv_flags |= IFF_IN_NETPOLL;
+		netpoll_send_skb(np, skb);
+		slave_dev->priv_flags &= ~IFF_IN_NETPOLL;
+		np->dev = bond->dev;
+	} else
+#endif
+		dev_queue_xmit(skb);
 
 	return 0;
 }
@@ -1329,6 +1341,61 @@ static void bond_detach_slave(struct bon
 	bond->slave_cnt--;
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/*
+ * You must hold read lock of bond->lock before calling this.
+ */
+static bool slaves_support_netpoll(struct net_device *bond_dev)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	struct slave *slave;
+	int i = 0;
+	bool ret = true;
+
+	bond_for_each_slave(bond, slave, i) {
+		if ((slave->dev->priv_flags & IFF_DISABLE_NETPOLL)
+				|| !slave->dev->netdev_ops->ndo_poll_controller)
+			ret = false;
+	}
+	return i != 0 && ret;
+}
+
+static void bond_poll_controller(struct net_device *bond_dev)
+{
+	struct net_device *dev = bond_dev->npinfo->netpoll->real_dev;
+	if (dev != bond_dev)
+		netpoll_poll_dev(dev);
+}
+
+static void bond_netpoll_cleanup(struct net_device *bond_dev)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	struct slave *slave;
+	const struct net_device_ops *ops;
+	int i;
+
+	read_lock(&bond->lock);
+	bond_dev->npinfo = NULL;
+	bond_for_each_slave(bond, slave, i) {
+		if (slave->dev) {
+			ops = slave->dev->netdev_ops;
+			if (ops->ndo_netpoll_cleanup)
+				ops->ndo_netpoll_cleanup(slave->dev);
+			else
+				slave->dev->npinfo = NULL;
+		}
+	}
+	read_unlock(&bond->lock);
+}
+
+#else
+
+static void bond_netpoll_cleanup(struct net_device *bond_dev)
+{
+}
+
+#endif
+
 /*---------------------------------- IOCTL ----------------------------------*/
 
 static int bond_sethwaddr(struct net_device *bond_dev,
@@ -1735,6 +1802,18 @@ int bond_enslave(struct net_device *bond
 
 	bond_set_carrier(bond);
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	if (slaves_support_netpoll(bond_dev)) {
+		bond_dev->priv_flags &= ~IFF_DISABLE_NETPOLL;
+		if (bond_dev->npinfo)
+			slave_dev->npinfo = bond_dev->npinfo;
+	} else if (!(bond_dev->priv_flags & IFF_DISABLE_NETPOLL)) {
+		bond_dev->priv_flags |= IFF_DISABLE_NETPOLL;
+		pr_info("New slave device %s does not support netpoll\n",
+			slave_dev->name);
+		pr_info("Disabling netpoll support for %s\n", bond_dev->name);
+	}
+#endif
 	read_unlock(&bond->lock);
 
 	res = bond_create_slave_symlinks(bond_dev, slave_dev);
@@ -1929,6 +2008,17 @@ int bond_release(struct net_device *bond
 
 	netdev_set_master(slave_dev, NULL);
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	read_lock_bh(&bond->lock);
+	if (slaves_support_netpoll(bond_dev))
+		bond_dev->priv_flags &= ~IFF_DISABLE_NETPOLL;
+	read_unlock_bh(&bond->lock);
+	if (slave_dev->netdev_ops->ndo_netpoll_cleanup)
+		slave_dev->netdev_ops->ndo_netpoll_cleanup(slave_dev);
+	else
+		slave_dev->npinfo = NULL;
+#endif
+
 	/* close slave before restoring its mac address */
 	dev_close(slave_dev);
 
@@ -4448,6 +4538,10 @@ static const struct net_device_ops bond_
 	.ndo_vlan_rx_register	= bond_vlan_rx_register,
 	.ndo_vlan_rx_add_vid 	= bond_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	.ndo_netpoll_cleanup	= bond_netpoll_cleanup,
+	.ndo_poll_controller	= bond_poll_controller,
+#endif
 };
 
 static void bond_setup(struct net_device *bond_dev)
@@ -4533,6 +4627,8 @@ static void bond_uninit(struct net_devic
 {
 	struct bonding *bond = netdev_priv(bond_dev);
 
+	bond_netpoll_cleanup(bond_dev);
+
 	/* Release the bonded slaves */
 	bond_release_all(bond_dev);
 

^ permalink raw reply

* Re: [PATCH] IPVS: replace sprintf to snprintf to avoid stack buffer overflow
From: Simon Horman @ 2010-04-06  3:26 UTC (permalink / raw)
  To: Changli Gao
  Cc: wzt.wzt, linux-kernel, wensong, Julian Anastasov, netdev,
	lvs-devel, Patrick McHardy
In-Reply-To: <k2l412e6f7f1004051958h7c50f439r6b7a84379c35cba5@mail.gmail.com>

On Tue, Apr 06, 2010 at 10:58:28AM +0800, Changli Gao wrote:
> On Tue, Apr 6, 2010 at 10:50 AM,  <wzt.wzt@gmail.com> wrote:
> > IPVS not check the length of pp->name, use sprintf will cause stack buffer overflow.
> > struct ip_vs_protocol{} declare name as char *, if register a protocol as:
> > struct ip_vs_protocol ip_vs_test = {
> >        .name =                 "aaaaaaaa....128...aaa",
> >        .debug_packet =         ip_vs_tcpudp_debug_packet,
> > };
> >
> > when called ip_vs_tcpudp_debug_packet(), sprintf(buf, "%s TRUNCATED", pp->name);
> > will cause stack buffer overflow.
> >
> 
> Long messages will be truncated instead of buffer overflow. We need to
> find a way to handle long messages elegantly.

Its really a corner case. In practice protocol modules don't have really
long names. And if one was merged that did, the buffer size could be increased
at that time.

So while I think its reasonable to protect against something unexpected
in a protocol-module name crashing the system. Especially as that
can be achieved without any real overhead. I don't think we need
to sanitise the output.

^ permalink raw reply

* Re: [PATCH] IPVS: replace sprintf to snprintf to avoid stack buffer overflow
From: Simon Horman @ 2010-04-06  3:22 UTC (permalink / raw)
  To: wzt.wzt
  Cc: linux-kernel, Wensong Zhang, Julian Anastasov, netdev, lvs-devel,
	Patrick McHardy
In-Reply-To: <20100406025020.GA2741@localhost.localdomain>

On Tue, Apr 06, 2010 at 10:50:20AM +0800, wzt.wzt@gmail.com wrote:
> IPVS not check the length of pp->name, use sprintf will cause stack buffer overflow.
> struct ip_vs_protocol{} declare name as char *, if register a protocol as:
> struct ip_vs_protocol ip_vs_test = {
>         .name =			"aaaaaaaa....128...aaa",
> 	.debug_packet =         ip_vs_tcpudp_debug_packet,
> };
> 
> when called ip_vs_tcpudp_debug_packet(), sprintf(buf, "%s TRUNCATED", pp->name); 
> will cause stack buffer overflow.
>
> Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>

I think that the simple answer is, don't do that.
But your patch seems entirely reasonable to me.

Acked-by: Simon Horman <horms@verge.net.au>

Patrick, please consider merging this.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox