Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next-2.6] net: make sure rtnl is held in rtnl_fill_ifinfo()
From: Eric Dumazet @ 2011-05-18  9:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Commit e67f88dd12f610 (net: dont hold rtnl mutex during netlink dump
callbacks) added a problem in rtnl_fill_ifinfo() not being always called
with RTNL held, which is racy.

1) This patch extends rtnl_mutex helper functions so that :

rtnl_lock() is able to BUG_ON() if recursively called.
[This was only provided if LOCKDEP was on]

rtnl_is_locked() is able to check if current thread owns rtnl_mutex
[ASSERT_RTNL() gets this added feature too]

2) Add one ASSERT_RTNL() in rtnl_fill_ifinfo()

3) Make sure rtnl is held in rtnl_dump_ifinfo()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/rtnetlink.c |   19 ++++++++++++++++---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d2ba259..4e09f70 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -59,15 +59,19 @@ struct rtnl_link {
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
+static struct thread_info  *rtnl_owner;
 
 void rtnl_lock(void)
 {
+	BUG_ON(rtnl_owner == current_thread_info());
 	mutex_lock(&rtnl_mutex);
+	rtnl_owner = current_thread_info();
 }
 EXPORT_SYMBOL(rtnl_lock);
 
 void __rtnl_unlock(void)
 {
+	rtnl_owner = NULL;
 	mutex_unlock(&rtnl_mutex);
 }
 
@@ -80,13 +84,17 @@ EXPORT_SYMBOL(rtnl_unlock);
 
 int rtnl_trylock(void)
 {
-	return mutex_trylock(&rtnl_mutex);
+	int ret = mutex_trylock(&rtnl_mutex);
+
+	if (ret)
+		rtnl_owner = current_thread_info();
+	return ret;
 }
 EXPORT_SYMBOL(rtnl_trylock);
 
 int rtnl_is_locked(void)
 {
-	return mutex_is_locked(&rtnl_mutex);
+	return mutex_is_locked(&rtnl_mutex) && rtnl_owner == current_thread_info();
 }
 EXPORT_SYMBOL(rtnl_is_locked);
 
@@ -850,6 +858,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	struct nlattr *attr, *af_spec;
 	struct rtnl_af_ops *af_ops;
 
+	ASSERT_RTNL();
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
 	if (nlh == NULL)
 		return -EMSGSIZE;
@@ -1003,10 +1012,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net_device *dev;
 	struct hlist_head *head;
 	struct hlist_node *node;
+	int rtnl_was_locked = rtnl_is_locked();
 
 	s_h = cb->args[0];
 	s_idx = cb->args[1];
-
+	if (!rtnl_was_locked)
+		rtnl_lock();
 	rcu_read_lock();
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
@@ -1025,6 +1036,8 @@ cont:
 	}
 out:
 	rcu_read_unlock();
+	if (!rtnl_was_locked)
+		rtnl_unlock();
 	cb->args[1] = idx;
 	cb->args[0] = h;
 



^ permalink raw reply related

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-18  9:27 UTC (permalink / raw)
  To: netdev
In-Reply-To: <54ec5cd14e5e5c76aa06c2e6899299ce@visp.net.lb>

 On Wed, 18 May 2011 01:16:29 +0300, Denys Fedoryshchenko wrote:
> Just got recently. 32Bit, PPPoE NAS, shapers, firewall, NAT
> Kernel i mention in subject, 2.6.39-rc7-git11
> If required i can give more information
>
> sharanal (sorry for ugly name) is libpcap based traffic analyser,
> sure userspace
>
 Here is some info, i hope it will be a little useful

 (gdb)  l *(cleanup_once + 0x49)
 0xc02e85cc is in cleanup_once (include/linux/list.h:88).
 83       * This is only for internal list manipulation where we know
 84       * the prev/next entries already!
 85       */
 86      static inline void __list_del(struct list_head * prev, struct 
 list_head * next)
 87      {
 88              next->prev = prev;
 89              prev->next = next;
 90      }
 91
 92      /**

 (gdb)  l *(inet_getpeer + 0x2ab)
 0xc02e8ae8 is in inet_getpeer (net/ipv4/inetpeer.c:530).
 525             if (base->total >= inet_peer_threshold)
 526                     /* Remove one less-recently-used entry. */
 527                     cleanup_once(0, stack);
 528
 529             return p;
 530     }
 531
 532     static int compute_total(void)
 533     {
 534             return v4_peers.total + v6_peers.total;



^ permalink raw reply

* Re: packet received in a wrong rx-queue?
From: David Miller @ 2011-05-18  9:13 UTC (permalink / raw)
  To: Jon.Zhou; +Cc: e1000-devel, netdev
In-Reply-To: <4A6A2125329CFD4D8CC40C9E8ABCAB9F250D748939@MILEXCH2.ds.jdsu.net>

From: Jon Zhou <Jon.Zhou@jdsu.com>
Date: Wed, 18 May 2011 02:00:50 -0700

> There are 2 packets in the traffic
> 
> #1 create PDP context request, IPV4--UDP--GTP, src_ip=A and dst_ip=B,src_port=C,dst_port=D
> 
> #2 create PDP context response, IPV4--UDP--GTP,src_ip=B, dst_ip=A, src_port=D,dst_port=C
> 
> I suppose both of them will be received in same rx-queue but actually it doesn't

Well, for hardware flow steering, it's not expected to.

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18  9:06 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305675865.10756.63.camel@localhost.localdomain>

W dniu 18 maja 2011 01:44 użytkownik Shirley Ma <mashirle@us.ibm.com> napisał:
> On Wed, 2011-05-18 at 00:58 +0200, Michał Mirosław wrote:
>> W dniu 18 maja 2011 00:28 użytkownik Shirley Ma <mashirle@us.ibm.com>
>> napisał:
>> > On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > Looks like to use a new flag requires more time/work. I am
>> thinking
>> >> > whether we can just use HIGHDMA flag to enable zero-copy in
>> macvtap
>> >> to
>> >> > avoid the new flag for now since mavctap uses real NICs as lower
>> >> device?
>> >>
>> >> Is there any other restriction besides requiring driver to not
>> recycle
>> >> the skb? Are there any drivers that recycle TX skbs?
>> > Not more other restrictions, skb clone is OK. pskb_expand_head()
>> looks
>> > OK to me from code review.
>>
>> > Currently there is no drivers recycle TX skbs.
>>
>> So why do you require the target device to have some flags at all?
> We could use macvtap to check lower device HIGHDMA to enable zero-copy,
> but I am not sure whether it is sufficient. If it's sufficient then we
> don't need to use a new flag here. To be safe, it's better to use a new
> flag to enable each device who can pass zero-copy test.

>> Do I understand correctly, that this zero-copy feature is about
>> packets received from VMs?
> Yes, packets sent from VMs, and received in local host for TX zero-copy
> here.

What is the zero-copy test? On some arches the HIGHDMA is not needed
at all so might be not enabled on anything. It looks like the correct
test would be per-packet check of !illegal_highdma() or maybe
NETIF_F_SG as returned from harmonize_features(). For virtual devices
or other software forwarding this might lead to skb_linearize() in
some cases, but is it that bad?

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Wei Yongjun @ 2011-05-18  9:02 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Eric Dumazet, netdev, Vlad Yasevich
In-Reply-To: <1305707358.2983.14.camel@edumazet-laptop>


> Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
>> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>>> If you're removing items from this list, you must be a writer here, with
>>> exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
>> I could agree to some extend ... but strict RCU section IMO is needed here.
>> I can check this if the issue exists.
>>
> I can tell you for sure rcu_read_lock() is not needed here. Its only
> showing confusion from code's author.
>
> Please read Documentation/RCU/listRCU.txt for concise explanations,
> line 117.
>
>
>>> Therefore, I guess following code is better :
>>>
>>> list_for_each_entry(addr, &bp->address_list, list) {
>>>        list_del_rcu(&addr->list);
>>>        call_rcu(&addr->rcu, sctp_local_addr_free);
>>>        SCTP_DBG_OBJCNT_DEC(addr);
>>> }
>>>
>>> Then, why dont you fix sctp_bind_addr_clean() instead ?
>>>
>>> if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
>>> be protected as well.
>> The _clean() as claimed by Vlad is called many times from various places
>> in code and this could give a overhead. I guess Vlad would need to comment.
> I guess a full review of this code is needed. You'll have to prove
> sctp_bind_addr_clean() is always called after one RCU grace period if
> you want to leave it as is.
>
> You cant get RCU for free, some rules must be followed or you risk
> crashes.
>

fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
need to remove the socket from the port hash before empty the bind address list.
some thing like this, not test.

diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index c8cc24e..924d846 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
 
 	/* Cleanup. */
 	sctp_inq_free(&ep->base.inqueue);
-	sctp_bind_addr_free(&ep->base.bind_addr);
 
 	/* Remove and free the port */
 	if (sctp_sk(ep->base.sk)->bind_hash)
 		sctp_put_port(ep->base.sk);
 
+	sctp_bind_addr_free(&ep->base.bind_addr);
+
 	/* Give up our hold on the sock. */
 	if (ep->base.sk)
 		sock_put(ep->base.sk);



^ permalink raw reply related

* packet received in a wrong rx-queue?
From: Jon Zhou @ 2011-05-18  9:00 UTC (permalink / raw)
  To: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org

There are 2 packets in the traffic

#1 create PDP context request, IPV4--UDP--GTP, src_ip=A and dst_ip=B,src_port=C,dst_port=D

#2 create PDP context response, IPV4--UDP--GTP,src_ip=B, dst_ip=A, src_port=D,dst_port=C

I suppose both of them will be received in same rx-queue but actually it doesn't
Anything need to check?

ethtool -i eth5
driver: ixgbe
version: 3.3.9-NAPI
firmware-version: 0.9-3
bus-info: 0000:08:00.1

regards
jon





------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:45 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

On Tue, May 17, 2011 at 10:14:34PM -0700, Shirley Ma wrote:
> Yes, fixed it already. I might change it to PAGE_SIZE for now since the
> small message sizes regression issue.

If that fixes it, let's do that for now.
-- 
MST

^ permalink raw reply

* Re: [PATCH V6 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:43 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305699383.10756.100.camel@localhost.localdomain>

On Tue, May 17, 2011 at 11:16:23PM -0700, Shirley Ma wrote:
> Hello Michael,
> 
> Here is the update the patch based on all of your review comments except
> the completion/wait for cleanup since I am worried about outstanding
> DMAs would prevent vhost from shutting down.

Don't see what you are worried about. Doing this in a timely fashion
is just one of the things driver commits to when it published the zcopy flag.
Otherwise guest networking will get stuck as entries in the tx ring
don't free up, which is also a problem. Let's just add a comment documenting this.

> I am sending out this for your review, and test it out later.

So let's use a completion to cleanly flush outstanding requests.
Any chance kref can be reused? It's not a must.

One other piece that is currently missing is flushing
requests on memory table update. Maybe we can stick some
info in high bits of the desc field for that?

> For error handling, I update macvtap.c so we can discard the desc even
> in zero-copy case.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>  drivers/vhost/net.c   |   42 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |   13 +++++++++++++
>  3 files changed, 103 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..e87a1f8 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,10 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>  
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_PEND 128
> +#define VHOST_GOODCOPY_LEN PAGE_SIZE
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -129,6 +133,7 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	struct skb_ubuf_info pend;
>  
>  	/* TODO: check that we are running from vhost_worker? */
>  	sock = rcu_dereference_check(vq->private_data, 1);
> @@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = vq->vhost_hlen;
>  
>  	for (;;) {
> +		/* Release DMAs done buffers first */
> +		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
> +			vhost_zerocopy_signal_used(vq, false);
> +
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> @@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
>  				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
>  				break;
>  			}
> +			/* If more outstanding DMAs, queue the work */
> +			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
>  			if (unlikely(vhost_enable_notify(vq))) {
>  				vhost_disable_notify(vq);
>  				continue;
> @@ -188,6 +203,30 @@ static void handle_tx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> +		/* use msg_control to pass vhost zerocopy ubuf info to skb */
> +		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> +			vq->heads[vq->upend_idx].id = head;
> +			if (len < VHOST_GOODCOPY_LEN)
> +				/* copy don't need to wait for DMA done */
> +				vq->heads[vq->upend_idx].len =
> +							VHOST_DMA_DONE_LEN;
> +			else {
> +				vq->heads[vq->upend_idx].len = len;
> +				pend.callback = vhost_zerocopy_callback;
> +				pend.arg = vq;
> +				pend.desc = vq->upend_idx;
> +				msg.msg_control = &pend;
> +				msg.msg_controllen = sizeof(pend);
> +			}
> +			atomic_inc(&vq->refcnt);
> +			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
> +			/* if upend_idx is full, then wait for free more */
> +			if (vq->upend_idx == vq->done_idx) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
> +		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> @@ -198,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..f4c2730 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->upend_idx = 0;
> +	vq->done_idx = 0;
> +	atomic_set(&vq->refcnt, 0);
>  }
>  
>  static int vhost_worker(void *data)
> @@ -385,16 +388,49 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
>  	return 0;
>  }
>  
> +/* In case of DMA done not in order in lower device driver for some reason.
> + * upend_idx is used to track end of used idx, done_idx is used to track head
> + * of used idx. Once lower device DMA done contiguously, we will signal KVM
> + * guest used idx.
> + */
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> +	int i, j = 0;
> +
> +	for (i = vq->done_idx; i != vq->upend_idx; i = i++ % UIO_MAXIOV) {
> +		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
> +			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
> +			vhost_add_used_and_signal(vq->dev, vq,
> +						  vq->heads[i].id, 0);
> +			++j;
> +		} else
> +			break;
> +	}
> +	if (j) {
> +		vq->done_idx = i;
> +		atomic_sub(j, &vq->refcnt);
> +	}
> +}
> +
>  /* Caller should have device mutex */
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> +	unsigned long begin = jiffies;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
>  			vhost_poll_stop(&dev->vqs[i].poll);
>  			vhost_poll_flush(&dev->vqs[i].poll);
>  		}
> +		/* Wait for all lower device DMAs done, then notify virtio_net
> +		 * or just notify it without waiting for all DMA done here ?
> +		 * in case of DMAs never done for some reason */
> +		if (atomic_read(&dev->vqs[i].refcnt)) {
> +			/* how long should we wait ? */
> +			msleep(1000);
> +			vhost_zerocopy_signal_used(&dev->vqs[i], true);
> +		}
>  		if (dev->vqs[i].error_ctx)
>  			eventfd_ctx_put(dev->vqs[i].error_ctx);
>  		if (dev->vqs[i].error)
> @@ -603,6 +639,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* clean up lower device outstanding DMAs, before setting ring */
> +	if (atomic_read(&vq->refcnt))
> +		vhost_zerocopy_signal_used(vq, true);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1416,3 +1456,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
>  		vq_err(vq, "Failed to enable notification at %p: %d\n",
>  		       &vq->used->flags, r);
>  }
> +
> +void vhost_zerocopy_callback(struct sk_buff *skb)
> +{
> +	int idx = skb_shinfo(skb)->ubuf.desc;
> +	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;


Let's do some sanity checking of idx here in case
there's a driver bug.

> +
> +	/* set len = 1 to mark this desc buffers done DMA */
> +	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..d0e7ac6 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,11 @@
>  #include <linux/virtio_ring.h>
>  #include <asm/atomic.h>
>  
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN	1
> +#define VHOST_DMA_CLEAR_LEN	0
> +
>  struct vhost_device;
>  
>  struct vhost_work;
> @@ -108,6 +113,12 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* vhost zerocopy support */
> +	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> +	/* last used idx for outstanding DMA zerocopy buffers */
> +	int upend_idx;
> +	/* first used idx for DMA done zerocopy buffers */
> +	int done_idx;
>  };
>  
>  struct vhost_dev {
> @@ -154,6 +165,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(struct sk_buff *skb);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
>  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> 
> 

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

On Tue, May 17, 2011 at 10:14:34PM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 00:28 +0300, Michael S. Tsirkin wrote:
> > On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> > > Resubmit the patch with most update. This patch passed some
> > > live-migration test against RHEL6.2. I will run more stress test w/i
> > > live migration.
> > > 
> > > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> > 
> > Cool. cleanup path needs a fix - are you use you can
> > not use kobj or some other existing refcounting?
> I will look at this.
> 
> > Is perf regressiion caused by tx ring overrun gone now?
> Nope.
> 
> > I added some comments about how we might be aqble
> > to complete requests out of order but it's not a must.
> Ok.
> 
> > > ---
> > > 
> > >  drivers/vhost/net.c   |   37 +++++++++++++++++++++++++++++++-
> > >  drivers/vhost/vhost.c |   55
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  drivers/vhost/vhost.h |   12 ++++++++++
> > >  3 files changed, 101 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 2f7c76a..6bd6e28 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -32,6 +32,9 @@
> > >   * Using this limit prevents one virtqueue from starving others. */
> > >  #define VHOST_NET_WEIGHT 0x80000
> > >  
> > > +/* MAX number of TX used buffers for outstanding zerocopy */
> > > +#define VHOST_MAX_ZEROCOPY_PEND 128 
> > > +
> > >  enum {
> > >       VHOST_NET_VQ_RX = 0,
> > >       VHOST_NET_VQ_TX = 1,
> > > @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> > >       int err, wmem;
> > >       size_t hdr_size;
> > >       struct socket *sock;
> > > +     struct skb_ubuf_info pend;
> > >  
> > >       /* TODO: check that we are running from vhost_worker? */
> > >       sock = rcu_dereference_check(vq->private_data, 1);
> > > @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> > >       hdr_size = vq->vhost_hlen;
> > >  
> > >       for (;;) {
> > > +             /* Release DMAs done buffers first */
> > > +             if (atomic_read(&vq->refcnt) >
> > VHOST_MAX_ZEROCOPY_PEND)
> > > +                     vhost_zerocopy_signal_used(vq, false);
> > > +
> > >               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >                                        ARRAY_SIZE(vq->iov),
> > >                                        &out, &in,
> > > @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> > >                               set_bit(SOCK_ASYNC_NOSPACE,
> > &sock->flags);
> > >                               break;
> > >                       }
> > > +                     /* If more outstanding DMAs, queue the work */
> > > +                     if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> > > +                         (atomic_read(&vq->refcnt) >
> > VHOST_MAX_ZEROCOPY_PEND)) {
> > > +                             tx_poll_start(net, sock);
> > > +                             set_bit(SOCK_ASYNC_NOSPACE,
> > &sock->flags);
> > > +                             break;
> > > +                     }
> > >                       if (unlikely(vhost_enable_notify(vq))) {
> > >                               vhost_disable_notify(vq);
> > >                               continue;
> > > @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> > >                              iov_length(vq->hdr, s), hdr_size);
> > >                       break;
> > >               }
> > > +             /* use msg_control to pass vhost zerocopy ubuf info to
> > skb */
> > > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> > > +                     vq->heads[vq->upend_idx].id = head;
> > > +                     if (len <= 128)
> > 
> > I thought we have a constant for that?
> 
> Yes, fixed it already. I might change it to PAGE_SIZE for now since the
> small message sizes regression issue.
> 
> > > +                             vq->heads[vq->upend_idx].len =
> > VHOST_DMA_DONE_LEN;
> > > +                     else {
> > > +                             vq->heads[vq->upend_idx].len = len;
> > > +                             pend.callback =
> > vhost_zerocopy_callback;
> > > +                             pend.arg = vq;
> > > +                             pend.desc = vq->upend_idx;
> > > +                             msg.msg_control = &pend;
> > > +                             msg.msg_controllen = sizeof(pend);
> > > +                     }
> > > +                     atomic_inc(&vq->refcnt);
> > > +                     vq->upend_idx = (vq->upend_idx + 1) %
> > UIO_MAXIOV;
> > 
> > Ok, so we deal with a cyclic ring apparently? What guarantees we don't
> > overrun it?
> 
> VHOST_MAX_PEND (which is 128) should prevent it from overrun normally.
> In the case of none DMAs can be done, the maximum entries are the ring
> size, which is much smaller or equal to UIO_MAXIOV for current
> implementation. 
> 
> Maybe I should add some condition check if it is overrun then exits?

If this can't happen, maybe add a comment why. A check + WARN_ON
can't hurt either.

> > 
> > > +             }
> > >               /* TODO: Check specific error and bomb out unless
> > ENOBUFS? */
> > >               err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > >               if (unlikely(err < 0)) {
> > > -                     vhost_discard_vq_desc(vq, 1);
> > > +                     if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > > +                             vhost_discard_vq_desc(vq, 1);
> > 
> > How are errors handled with zerocopy?
> 
> I thought about it before: if error after macvtap skb is allocated, then
> that skb has a callback which will set DMA done for add_used, if the
> error before macvtap skb is allocated, then it can reverse as copy case.
> To avoid the complexity check, I decided to not handle it.

I think we need to handle it, errors do happen.
macvtap should either don't invoke callback on error or return success.
Then it's simple.

> > 
> > >                       tx_poll_start(net, sock);
> > >                       break;
> > >               }
> > >               if (err != len)
> > >                       pr_debug("Truncated TX packet: "
> > >                                " len %d != %zd\n", err, len);
> > > -             vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > > +             if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > > +                     vhost_add_used_and_signal(&net->dev, vq, head,
> > 0);
> > >               total_len += len;
> > >               if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > >                       vhost_poll_queue(&vq->poll);
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 2ab2912..ce799d6 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev
> > *dev,
> > >       vq->call_ctx = NULL;
> > >       vq->call = NULL;
> > >       vq->log_ctx = NULL;
> > > +     vq->upend_idx = 0;
> > > +     vq->done_idx = 0;
> > > +     atomic_set(&vq->refcnt, 0);
> > >  }
> > >  
> > >  static int vhost_worker(void *data)
> > > @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct
> > vhost_dev *dev)
> > >                                              UIO_MAXIOV,
> > GFP_KERNEL);
> > >               dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log *
> > UIO_MAXIOV,
> > >                                         GFP_KERNEL);
> > > -             dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads
> > *
> > > +             dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads
> > *
> > >                                           UIO_MAXIOV, GFP_KERNEL);
> > 
> > Which fields need to be initialized actually?
> 
> Nope, already fixed it with kmalloc.
> 
> > >  
> > >               if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> > > @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev
> > *dev)
> > >       return 0;
> > >  }
> > >  
> > > +/* 
> > > +     comments
> > > +*/
> > 
> > Hmm.
> 
> Fixed it in previous patch already.
> 
> > > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> > shutdown)
> > > +{
> > > +     int i, j = 0;
> > > +
> > > +     i = vq->done_idx;
> > > +     while (i != vq->upend_idx) {
> > 
> > A for loop might be clearer.
> Ok.
> 
> > > +             if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) ||
> > shutdown) {
> > 
> > On shutdown, we signal all buffers used to the guest?
> > Why?
> 
> We signal all outstand DMAs in the case of driver has some DMA issue to
> prevent us from shutting down. I am afraid vhost cleanup could wait
> forever.

Yes but then
1. guest can reuse the buffers for something else, before device reads them
2. vhost-net module can go away
etc

IMO that this happens in finite time is one of the things driver should
guarantee when it sets the zcopy flags.

BTW we must flush these on memory table change too, right?

> > > +                     /* reset len = 0 */
> > 
> > comment not very helpful.
> > Could you explain what this does instead?
> > Or better use some constant instead of 0 ...
> 
> Fixed it already.
> 
> > > +                     vq->heads[i].len = 0;
> > > +                     i = (i + 1) % UIO_MAXIOV;
> > > +                     ++j;
> > > +             } else
> > > +                     break;
> > 
> > Hmm so if the 1st entry does not complete, you do not signal anything?
> 
> No used buffers to guest, should we signal still?

This is just myself noting that we still force used entries
to be in order. It's ok from correctness pov but likely
is at least part of the reason you still see
TX ring overruns.

> > > +     }
> > 
> > Looking at this loop, done idx is the consumer and used idx
> > is the producer, right?
> 
> Yes.
> 
> > > +     if (j) {
> > > +             /* comments */
> > 
> > Yes?
> > 
> > > +             if (i > vq->done_idx)
> > > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > j);
> > > +             else {
> > > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > > +                                      UIO_MAXIOV - vq->done_idx);
> > > +                     vhost_add_used_n(vq, vq->heads, i);
> > > +             }
> > > +             vq->done_idx = i;
> > > +             vhost_signal(vq->dev, vq);
> > > +             atomic_sub(j, &vq->refcnt);
> > 
> > Code will likely be simpler if you call vhost_add_used once for
> > each head in the first loop. Possibly add_used_signal might be
> > a good idea too.
> 
> Ok.
> 
> > > +     }
> > > +}
> > > +
> > >  /* Caller should have device mutex */
> > >  void vhost_dev_cleanup(struct vhost_dev *dev)
> > >  {
> > > @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >                       vhost_poll_stop(&dev->vqs[i].poll);
> > >                       vhost_poll_flush(&dev->vqs[i].poll);
> > >               }
> > > +             /* wait for all lower device DMAs done, then notify
> > guest */
> > > +             if (atomic_read(&dev->vqs[i].refcnt)) {
> > > +                     msleep(1000);
> > > +                     vhost_zerocopy_signal_used(&dev->vqs[i],
> > true);
> > > +             }
> > 
> > This needs to be fixed somehow. Use a completion object and wait
> > on it?
> 
> Worried about what if the driver has some DMAs issue, which would
> prevent vhost from shutting down.

This needs to be fixed in the driver then.

> > >               if (dev->vqs[i].error_ctx)
> > >                       eventfd_ctx_put(dev->vqs[i].error_ctx);
> > >               if (dev->vqs[i].error)
> > > @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev
> > *d, int ioctl, void __user *argp)
> > >  
> > >       mutex_lock(&vq->mutex);
> > >  
> > > +     /* force all lower device DMAs done */
> > > +     if (atomic_read(&vq->refcnt)) 
> > > +             vhost_zerocopy_signal_used(vq, true);
> > > +
> > >       switch (ioctl) {
> > >       case VHOST_SET_VRING_NUM:
> > >               /* Resizing ring with an active backend?
> > > @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct
> > vhost_virtqueue *vq)
> > >               vq_err(vq, "Failed to enable notification at %p: %d
> > \n",
> > >                      &vq->used->flags, r);
> > >  }
> > > +
> > > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > > +{
> > > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > > +
> > > +     /* set len = 1 to mark this desc buffers done DMA */
> > 
> > this comment can now go.
> 
> Yes, it's gone already.
> 
> > > +     vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> > > +}
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index b3363ae..8e3ecc7 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -13,6 +13,10 @@
> > >  #include <linux/virtio_ring.h>
> > >  #include <asm/atomic.h>
> > >  
> > > +/* This is for zerocopy, used buffer len is set to 1 when lower
> > device DMA
> > > + * done */
> > > +#define VHOST_DMA_DONE_LEN   1
> > > +
> > >  struct vhost_device;
> > >  
> > >  struct vhost_work;
> > > @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> > >       /* Log write descriptors */
> > >       void __user *log_base;
> > >       struct vhost_log *log;
> > > +     /* vhost zerocopy support */
> > > +     atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> > 
> > future enhancement idea: this is used apparently under vq lock
> > so no need for an atomic?
> 
> It is also used in skb vhost zerocopy callback.
> 
> > > +     /* copy of avail idx to monitor outstanding DMA zerocopy
> > buffers */
> > 
> > looking at code upend_idx seems to be calculated independently
> > of guest avail idx - could you clarify pls?
> 
> Yes, you are right. Should change it to: upend_idx is used to track
> vring ids for outstanding zero-copy DMA buffers?

mention producer-consumer index in a head array used as
a circular ring datastructure.

> > > +     int upend_idx;
> > > +     /* copy of used idx to monintor DMA done zerocopy buffers */
> > 
> > monitor
> 
> Ok.
> 
> > > +     int done_idx;
> > 
> > 
> > I think in reality these are just producer and consumer
> > in the head structure which for zero copy is used
> Yes.
> 
> > 
> > >  };
> > >  
> > >  struct vhost_dev {
> > > @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue
> > *);
> > >  
> > >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log
> > *log,
> > >                   unsigned int log_num, u64 len);
> > > +void vhost_zerocopy_callback(struct sk_buff *skb);
> > > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> > shutdown);
> > >  
> > >  #define vq_err(vq, fmt, ...) do {
> > \
> > >               pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > > 

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18  8:29 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: netdev, Vlad Yasevich
In-Reply-To: <BANLkTik75iqfgvWGKLJYu6E7mheOwOZX+A@mail.gmail.com>

Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:

> > If you're removing items from this list, you must be a writer here, with
> > exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
> 
> I could agree to some extend ... but strict RCU section IMO is needed here.
> I can check this if the issue exists.
> 

I can tell you for sure rcu_read_lock() is not needed here. Its only
showing confusion from code's author.

Please read Documentation/RCU/listRCU.txt for concise explanations,
line 117.


> > Therefore, I guess following code is better :
> >
> > list_for_each_entry(addr, &bp->address_list, list) {
> >        list_del_rcu(&addr->list);
> >        call_rcu(&addr->rcu, sctp_local_addr_free);
> >        SCTP_DBG_OBJCNT_DEC(addr);
> > }
> >
> > Then, why dont you fix sctp_bind_addr_clean() instead ?
> >
> > if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
> > be protected as well.
> 
> The _clean() as claimed by Vlad is called many times from various places
> in code and this could give a overhead. I guess Vlad would need to comment.

I guess a full review of this code is needed. You'll have to prove
sctp_bind_addr_clean() is always called after one RCU grace period if
you want to leave it as is.

You cant get RCU for free, some rules must be followed or you risk
crashes.




^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18  8:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1305704885.2983.4.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 09:01 +0200, Jacek Luczak a écrit :
>> During the sctp_close() call, we do not use rcu primitives to
>> destroy the address list attached to the endpoint.  At the same
>> time, we do the removal of addresses from this list before
>> attempting to remove the socket from the port hash
>>
>> As a result, it is possible for another process to find the socket
>> in the port hash that is in the process of being closed.  It then
>> proceeds to traverse the address list to find the conflict, only
>> to have that address list suddenly disappear without rcu() critical
>> section.
>>
>> This can result in a kernel crash with general protection fault or
>> kernel NULL pointer dereference.
>>
>> Fix issue by closing address list removal inside RCU critical
>> section.
>>
>> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
>> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>>
>> ---
>>  bind_addr.c |   12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index faf71d1..19d1329 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>  /* Dispose of an SCTP_bind_addr structure  */
>>  void sctp_bind_addr_free(struct sctp_bind_addr *bp)
>>  {
>> -       /* Empty the bind address list. */
>> -       sctp_bind_addr_clean(bp);
>> +       struct sctp_sockaddr_entry *addr;
>> +
>> +       /* Empty the bind address list inside RCU section. */
>> +       rcu_read_lock();
>> +       list_for_each_entry_rcu(addr, &bp->address_list, list) {
>> +               list_del_rcu(&addr->list);
>> +               call_rcu(&addr->rcu, sctp_local_addr_free);
>> +               SCTP_DBG_OBJCNT_DEC(addr);
>> +       }
>> +       rcu_read_unlock();
>>
>
> Sorry this looks odd.
>
> If you're removing items from this list, you must be a writer here, with
> exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.

I could agree to some extend ... but strict RCU section IMO is needed here.
I can check this if the issue exists.

> Therefore, I guess following code is better :
>
> list_for_each_entry(addr, &bp->address_list, list) {
>        list_del_rcu(&addr->list);
>        call_rcu(&addr->rcu, sctp_local_addr_free);
>        SCTP_DBG_OBJCNT_DEC(addr);
> }
>
> Then, why dont you fix sctp_bind_addr_clean() instead ?
>
> if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
> be protected as well.

The _clean() as claimed by Vlad is called many times from various places
in code and this could give a overhead. I guess Vlad would need to comment.

-Jacek

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18  7:48 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: netdev, Vlad Yasevich
In-Reply-To: <BANLkTikeWzuE-384uT6RhZR6Wn=DBK+CNQ@mail.gmail.com>

Le mercredi 18 mai 2011 à 09:01 +0200, Jacek Luczak a écrit :
> During the sctp_close() call, we do not use rcu primitives to
> destroy the address list attached to the endpoint.  At the same
> time, we do the removal of addresses from this list before
> attempting to remove the socket from the port hash
> 
> As a result, it is possible for another process to find the socket
> in the port hash that is in the process of being closed.  It then
> proceeds to traverse the address list to find the conflict, only
> to have that address list suddenly disappear without rcu() critical
> section.
> 
> This can result in a kernel crash with general protection fault or
> kernel NULL pointer dereference.
> 
> Fix issue by closing address list removal inside RCU critical
> section.
> 
> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> 
> ---
>  bind_addr.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..19d1329 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>  /* Dispose of an SCTP_bind_addr structure  */
>  void sctp_bind_addr_free(struct sctp_bind_addr *bp)
>  {
> -       /* Empty the bind address list. */
> -       sctp_bind_addr_clean(bp);
> +       struct sctp_sockaddr_entry *addr;
> +
> +       /* Empty the bind address list inside RCU section. */
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(addr, &bp->address_list, list) {
> +               list_del_rcu(&addr->list);
> +               call_rcu(&addr->rcu, sctp_local_addr_free);
> +               SCTP_DBG_OBJCNT_DEC(addr);
> +       }
> +       rcu_read_unlock();
> 

Sorry this looks odd.

If you're removing items from this list, you must be a writer here, with
exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.

Therefore, I guess following code is better :

list_for_each_entry(addr, &bp->address_list, list) {
	list_del_rcu(&addr->list);
	call_rcu(&addr->rcu, sctp_local_addr_free);
	SCTP_DBG_OBJCNT_DEC(addr);
}

Then, why dont you fix sctp_bind_addr_clean() instead ?

if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
be protected as well.




^ permalink raw reply

* [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18  7:01 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

During the sctp_close() call, we do not use rcu primitives to
destroy the address list attached to the endpoint.  At the same
time, we do the removal of addresses from this list before
attempting to remove the socket from the port hash

As a result, it is possible for another process to find the socket
in the port hash that is in the process of being closed.  It then
proceeds to traverse the address list to find the conflict, only
to have that address list suddenly disappear without rcu() critical
section.

This can result in a kernel crash with general protection fault or
kernel NULL pointer dereference.

Fix issue by closing address list removal inside RCU critical
section.

Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>

---
 bind_addr.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..19d1329 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 /* Dispose of an SCTP_bind_addr structure  */
 void sctp_bind_addr_free(struct sctp_bind_addr *bp)
 {
-       /* Empty the bind address list. */
-       sctp_bind_addr_clean(bp);
+       struct sctp_sockaddr_entry *addr;
+
+       /* Empty the bind address list inside RCU section. */
+       rcu_read_lock();
+       list_for_each_entry_rcu(addr, &bp->address_list, list) {
+               list_del_rcu(&addr->list);
+               call_rcu(&addr->rcu, sctp_local_addr_free);
+               SCTP_DBG_OBJCNT_DEC(addr);
+       }
+       rcu_read_unlock();

        if (bp->malloced) {
                kfree(bp);

[-- Attachment #2: sctp_fix_close_vs_conflict_race_in_bind_addr.patch --]
[-- Type: application/octet-stream, Size: 732 bytes --]

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..19d1329 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 /* Dispose of an SCTP_bind_addr structure  */
 void sctp_bind_addr_free(struct sctp_bind_addr *bp)
 {
-	/* Empty the bind address list. */
-	sctp_bind_addr_clean(bp);
+	struct sctp_sockaddr_entry *addr;
+
+	/* Empty the bind address list inside RCU section. */
+	rcu_read_lock();
+	list_for_each_entry_rcu(addr, &bp->address_list, list) {
+		list_del_rcu(&addr->list);
+		call_rcu(&addr->rcu, sctp_local_addr_free);
+		SCTP_DBG_OBJCNT_DEC(addr);
+	}
+	rcu_read_unlock();
 
 	if (bp->malloced) {
 		kfree(bp);

^ permalink raw reply related

* Re: [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  6:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110518.023725.441358922110721042.davem@davemloft.net>

Op 2011-05-18 8:37, David Miller schreef:
> From: Micha Nelissen<micha@neli.hopto.org>
> Date: Wed, 18 May 2011 08:32:35 +0200
>
>> I'm confused. Against which tree/commit do you want it then?
>
> Linus's current tree would be fine as would:
>
> 	git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git

Ok I see, thanks. The patch should apply just fine to your tree, there 
is only a spelling change since 2.6.38 which does not conflict.

Micha

^ permalink raw reply

* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-18  6:37 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, netdev
In-Reply-To: <BANLkTimmvKn2e5Yrkd5TtktyxTWCRxqinw@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 18 May 2011 07:59:05 +0800

> On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote:
>>
>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> 
> No. Linux emits fragments first to last now. You should check the
> current code. :)

I forgot that we rearranged this, thanks :-)

So maybe the original idea can indeed work.

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: David Miller @ 2011-05-18  6:37 UTC (permalink / raw)
  To: micha; +Cc: netdev
In-Reply-To: <4DD36803.1020001@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Wed, 18 May 2011 08:32:35 +0200

> Op 2011-05-18 8:07, David Miller schreef:
>> From: Micha Nelissen<micha@neli.hopto.org>
>> Date: Wed, 18 May 2011 07:22:41 +0200
>>
>>> David, forgive my clumsiness, the previous patch did not compile. This
>>> one applies against 2.6.38 and does compile.
>>
>> Please send patches against the current code, not some arbitrary older
>> source tree.
> 
> I'm confused. Against which tree/commit do you want it then?

Linus's current tree would be fine as would:

	git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  6:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110518.020716.447947355233874982.davem@davemloft.net>

Op 2011-05-18 8:07, David Miller schreef:
> From: Micha Nelissen<micha@neli.hopto.org>
> Date: Wed, 18 May 2011 07:22:41 +0200
>
>> David, forgive my clumsiness, the previous patch did not compile. This
>> one applies against 2.6.38 and does compile.
>
> Please send patches against the current code, not some arbitrary older
> source tree.

I'm confused. Against which tree/commit do you want it then?

Micha

^ permalink raw reply

* Re: [PATCH net-2.6] net: add skb_dst_force() in sock_queue_err_skb()
From: David Miller @ 2011-05-18  6:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, baryluk, shemminger
In-Reply-To: <1305673846.6741.35.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 May 2011 01:10:46 +0200

> Commit 7fee226ad239 (add a noref bit on skb dst) forgot to use
> skb_dst_force() on packets queued in sk_error_queue
> 
> This triggers following warning, for applications using IP_CMSG_PKTINFO
> receiving one error status
 ...
> Close bug https://bugzilla.kernel.org/show_bug.cgi?id=34622
> 
> Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: David Miller @ 2011-05-18  6:07 UTC (permalink / raw)
  To: micha; +Cc: netdev
In-Reply-To: <1305696161-18277-1-git-send-email-micha@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Wed, 18 May 2011 07:22:41 +0200

> David, forgive my clumsiness, the previous patch did not compile. This
> one applies against 2.6.38 and does compile.

Please send patches against the current code, not some arbitrary older
source tree.

Thanks.

^ permalink raw reply

* Re: [PATCH V6 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-18  6:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

Hello Michael,

Here is the update the patch based on all of your review comments except
the completion/wait for cleanup since I am worried about outstanding
DMAs would prevent vhost from shutting down. I am sending out this for
your review, and test it out later.

For error handling, I update macvtap.c so we can discard the desc even
in zero-copy case.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
 drivers/vhost/net.c   |   42 +++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |   13 +++++++++++++
 3 files changed, 103 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e87a1f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,10 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+/* MAX number of TX used buffers for outstanding zerocopy */
+#define VHOST_MAX_PEND 128
+#define VHOST_GOODCOPY_LEN PAGE_SIZE
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -129,6 +133,7 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	struct skb_ubuf_info pend;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
+		/* Release DMAs done buffers first */
+		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
+			vhost_zerocopy_signal_used(vq, false);
+
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
@@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
+			/* If more outstanding DMAs, queue the work */
+			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
+				tx_poll_start(net, sock);
+				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
 				continue;
@@ -188,6 +203,30 @@ static void handle_tx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
+		/* use msg_control to pass vhost zerocopy ubuf info to skb */
+		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+			vq->heads[vq->upend_idx].id = head;
+			if (len < VHOST_GOODCOPY_LEN)
+				/* copy don't need to wait for DMA done */
+				vq->heads[vq->upend_idx].len =
+							VHOST_DMA_DONE_LEN;
+			else {
+				vq->heads[vq->upend_idx].len = len;
+				pend.callback = vhost_zerocopy_callback;
+				pend.arg = vq;
+				pend.desc = vq->upend_idx;
+				msg.msg_control = &pend;
+				msg.msg_controllen = sizeof(pend);
+			}
+			atomic_inc(&vq->refcnt);
+			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+			/* if upend_idx is full, then wait for free more */
+			if (vq->upend_idx == vq->done_idx) {
+				tx_poll_start(net, sock);
+				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+			}
+		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
@@ -198,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..f4c2730 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	vq->upend_idx = 0;
+	vq->done_idx = 0;
+	atomic_set(&vq->refcnt, 0);
 }
 
 static int vhost_worker(void *data)
@@ -385,16 +388,49 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
+{
+	int i, j = 0;
+
+	for (i = vq->done_idx; i != vq->upend_idx; i = i++ % UIO_MAXIOV) {
+		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
+			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+			vhost_add_used_and_signal(vq->dev, vq,
+						  vq->heads[i].id, 0);
+			++j;
+		} else
+			break;
+	}
+	if (j) {
+		vq->done_idx = i;
+		atomic_sub(j, &vq->refcnt);
+	}
+}
+
 /* Caller should have device mutex */
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
+	unsigned long begin = jiffies;
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
 			vhost_poll_stop(&dev->vqs[i].poll);
 			vhost_poll_flush(&dev->vqs[i].poll);
 		}
+		/* Wait for all lower device DMAs done, then notify virtio_net
+		 * or just notify it without waiting for all DMA done here ?
+		 * in case of DMAs never done for some reason */
+		if (atomic_read(&dev->vqs[i].refcnt)) {
+			/* how long should we wait ? */
+			msleep(1000);
+			vhost_zerocopy_signal_used(&dev->vqs[i], true);
+		}
 		if (dev->vqs[i].error_ctx)
 			eventfd_ctx_put(dev->vqs[i].error_ctx);
 		if (dev->vqs[i].error)
@@ -603,6 +639,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 
 	mutex_lock(&vq->mutex);
 
+	/* clean up lower device outstanding DMAs, before setting ring */
+	if (atomic_read(&vq->refcnt))
+		vhost_zerocopy_signal_used(vq, true);
+
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		/* Resizing ring with an active backend?
@@ -1416,3 +1456,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+	int idx = skb_shinfo(skb)->ubuf.desc;
+	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+	/* set len = 1 to mark this desc buffers done DMA */
+	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..d0e7ac6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,6 +13,11 @@
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
 
+/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
+ * done */
+#define VHOST_DMA_DONE_LEN	1
+#define VHOST_DMA_CLEAR_LEN	0
+
 struct vhost_device;
 
 struct vhost_work;
@@ -108,6 +113,12 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	/* vhost zerocopy support */
+	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+	/* last used idx for outstanding DMA zerocopy buffers */
+	int upend_idx;
+	/* first used idx for DMA done zerocopy buffers */
+	int done_idx;
 };
 
 struct vhost_dev {
@@ -154,6 +165,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \




^ permalink raw reply related

* Re: [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-18  5:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87tycsn9lt.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Wed, May 18, 2011 at 09:49:42AM +0930, Rusty Russell wrote:
> On Tue, 17 May 2011 09:10:31 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Well one can imagine a driver doing:
> > 
> > 	while (virtqueue_get_buf()) {
> > 		virtqueue_add_buf()
> > 	}
> > 	virtqueue_kick()
> > 
> > which looks sensible (batch kicks) but might
> > process any number of bufs between kicks.
> 
> No, we currently only expose the buffers in the kick, so it can only
> fill the ring doing that.
> 
> We could change that (and maybe that's worth looking at)...

Yes, I think we should - this way host and guest can process
data in parallel without a kick.

My patchset included that simply because it's one index
less to be confused about.


> > If we look at drivers closely enough, I think none
> > of them do the equivalent of the above, but not 100% sure.
> 
> I'm pretty sure we don't have this kind of 'echo' driver yet.  Drivers
> tend to take OS requests and queue them.  The only one which does
> anything even partially sophisticated is the net driver...
> 
> Thanks,
> Rusty.

^ permalink raw reply

* [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  5:22 UTC (permalink / raw)
  To: netdev; +Cc: Micha Nelissen, David Miller
In-Reply-To: <1305405386-25187-1-git-send-email-micha@neli.hopto.org>

David, forgive my clumsiness, the previous patch did not compile. This
one applies against 2.6.38 and does compile.

Currently the ip auto configuration has a hardcoded delay of 1 second.
When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
nfs root may not be found.

Remove the hardcoded delay, and wait for carrier on at least one network
device.

Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
Cc: David Miller <davem@davemloft.net>
---
 net/ipv4/ipconfig.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 2b09775..4225f3f 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -87,8 +87,8 @@
 #endif
 
 /* Define the friendly delay before and after opening net devices */
-#define CONF_PRE_OPEN		500	/* Before opening: 1/2 second */
-#define CONF_POST_OPEN		1	/* After opening: 1 second */
+#define CONF_POST_OPEN		10	/* After opening: 10 msecs */
+#define CONF_CARRIER_TIMEOUT	120000	/* Wait for carrier timeout */
 
 /* Define the timeout for waiting for a DHCP/BOOTP/RARP reply */
 #define CONF_OPEN_RETRIES 	2	/* (Re)open devices twice */
@@ -188,14 +188,14 @@ struct ic_device {
 static struct ic_device *ic_first_dev __initdata = NULL;/* List of open device */
 static struct net_device *ic_dev __initdata = NULL;	/* Selected device */
 
-static bool __init ic_device_match(struct net_device *dev)
+static bool __init ic_is_init_dev(struct net_device *dev)
 {
-	if (user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
+	if (dev->flags & IFF_LOOPBACK)
+		return 0;
+	return user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
 	    (!(dev->flags & IFF_LOOPBACK) &&
 	     (dev->flags & (IFF_POINTOPOINT|IFF_BROADCAST)) &&
-	     strncmp(dev->name, "dummy", 5)))
-		return true;
-	return false;
+	     strncmp(dev->name, "dummy", 5));
 }
 
 static int __init ic_open_devs(void)
@@ -203,6 +203,7 @@ static int __init ic_open_devs(void)
 	struct ic_device *d, **last;
 	struct net_device *dev;
 	unsigned short oflags;
+	unsigned long start;
 
 	last = &ic_first_dev;
 	rtnl_lock();
@@ -216,9 +217,7 @@ static int __init ic_open_devs(void)
 	}
 
 	for_each_netdev(&init_net, dev) {
-		if (dev->flags & IFF_LOOPBACK)
-			continue;
-		if (ic_device_match(dev)) {
+		if (ic_is_init_dev(dev)) {
 			int able = 0;
 			if (dev->mtu >= 364)
 				able |= IC_BOOTP;
@@ -252,6 +251,17 @@ static int __init ic_open_devs(void)
 				dev->name, able, d->xid));
 		}
 	}
+
+	/* wait for a carrier on at least one device */
+	start = jiffies;
+	while (jiffies - start < msecs_to_jiffies(CONF_CARRIER_TIMEOUT)) {
+		for_each_netdev(&init_net, dev)
+			if (ic_is_init_dev(dev) && netif_carrier_ok(dev))
+				goto have_carrier;
+
+		msleep(1);
+	}
+have_carrier:
 	rtnl_unlock();
 
 	*last = NULL;
@@ -1324,14 +1334,13 @@ static int __init wait_for_devices(void)
 {
 	int i;
 
-	msleep(CONF_PRE_OPEN);
 	for (i = 0; i < DEVICE_WAIT_MAX; i++) {
 		struct net_device *dev;
 		int found = 0;
 
 		rtnl_lock();
 		for_each_netdev(&init_net, dev) {
-			if (ic_device_match(dev)) {
+			if (ic_is_init_dev(dev)) {
 				found = 1;
 				break;
 			}
@@ -1378,7 +1387,7 @@ static int __init ip_auto_config(void)
 		return err;
 
 	/* Give drivers a chance to settle */
-	ssleep(CONF_POST_OPEN);
+	msleep(CONF_POST_OPEN);
 
 	/*
 	 * If the config information is insufficient (e.g., our IP address or
-- 
1.7.4.4


^ permalink raw reply related

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-18  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <20110517212827.GC7589@redhat.com>

On Wed, 2011-05-18 at 00:28 +0300, Michael S. Tsirkin wrote:
> On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> > Resubmit the patch with most update. This patch passed some
> > live-migration test against RHEL6.2. I will run more stress test w/i
> > live migration.
> > 
> > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> 
> Cool. cleanup path needs a fix - are you use you can
> not use kobj or some other existing refcounting?
I will look at this.

> Is perf regressiion caused by tx ring overrun gone now?
Nope.

> I added some comments about how we might be aqble
> to complete requests out of order but it's not a must.
Ok.

> > ---
> > 
> >  drivers/vhost/net.c   |   37 +++++++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.c |   55
> ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.h |   12 ++++++++++
> >  3 files changed, 101 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 2f7c76a..6bd6e28 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,9 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >  
> > +/* MAX number of TX used buffers for outstanding zerocopy */
> > +#define VHOST_MAX_ZEROCOPY_PEND 128 
> > +
> >  enum {
> >       VHOST_NET_VQ_RX = 0,
> >       VHOST_NET_VQ_TX = 1,
> > @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> >       int err, wmem;
> >       size_t hdr_size;
> >       struct socket *sock;
> > +     struct skb_ubuf_info pend;
> >  
> >       /* TODO: check that we are running from vhost_worker? */
> >       sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> >       hdr_size = vq->vhost_hlen;
> >  
> >       for (;;) {
> > +             /* Release DMAs done buffers first */
> > +             if (atomic_read(&vq->refcnt) >
> VHOST_MAX_ZEROCOPY_PEND)
> > +                     vhost_zerocopy_signal_used(vq, false);
> > +
> >               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >                                        ARRAY_SIZE(vq->iov),
> >                                        &out, &in,
> > @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> >                               set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> >                               break;
> >                       }
> > +                     /* If more outstanding DMAs, queue the work */
> > +                     if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> > +                         (atomic_read(&vq->refcnt) >
> VHOST_MAX_ZEROCOPY_PEND)) {
> > +                             tx_poll_start(net, sock);
> > +                             set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> > +                             break;
> > +                     }
> >                       if (unlikely(vhost_enable_notify(vq))) {
> >                               vhost_disable_notify(vq);
> >                               continue;
> > @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> >                              iov_length(vq->hdr, s), hdr_size);
> >                       break;
> >               }
> > +             /* use msg_control to pass vhost zerocopy ubuf info to
> skb */
> > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> > +                     vq->heads[vq->upend_idx].id = head;
> > +                     if (len <= 128)
> 
> I thought we have a constant for that?

Yes, fixed it already. I might change it to PAGE_SIZE for now since the
small message sizes regression issue.

> > +                             vq->heads[vq->upend_idx].len =
> VHOST_DMA_DONE_LEN;
> > +                     else {
> > +                             vq->heads[vq->upend_idx].len = len;
> > +                             pend.callback =
> vhost_zerocopy_callback;
> > +                             pend.arg = vq;
> > +                             pend.desc = vq->upend_idx;
> > +                             msg.msg_control = &pend;
> > +                             msg.msg_controllen = sizeof(pend);
> > +                     }
> > +                     atomic_inc(&vq->refcnt);
> > +                     vq->upend_idx = (vq->upend_idx + 1) %
> UIO_MAXIOV;
> 
> Ok, so we deal with a cyclic ring apparently? What guarantees we don't
> overrun it?

VHOST_MAX_PEND (which is 128) should prevent it from overrun normally.
In the case of none DMAs can be done, the maximum entries are the ring
size, which is much smaller or equal to UIO_MAXIOV for current
implementation. 

Maybe I should add some condition check if it is overrun then exits?

> 
> > +             }
> >               /* TODO: Check specific error and bomb out unless
> ENOBUFS? */
> >               err = sock->ops->sendmsg(NULL, sock, &msg, len);
> >               if (unlikely(err < 0)) {
> > -                     vhost_discard_vq_desc(vq, 1);
> > +                     if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                             vhost_discard_vq_desc(vq, 1);
> 
> How are errors handled with zerocopy?

I thought about it before: if error after macvtap skb is allocated, then
that skb has a callback which will set DMA done for add_used, if the
error before macvtap skb is allocated, then it can reverse as copy case.
To avoid the complexity check, I decided to not handle it.

> 
> >                       tx_poll_start(net, sock);
> >                       break;
> >               }
> >               if (err != len)
> >                       pr_debug("Truncated TX packet: "
> >                                " len %d != %zd\n", err, len);
> > -             vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +             if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                     vhost_add_used_and_signal(&net->dev, vq, head,
> 0);
> >               total_len += len;
> >               if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >                       vhost_poll_queue(&vq->poll);
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 2ab2912..ce799d6 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev
> *dev,
> >       vq->call_ctx = NULL;
> >       vq->call = NULL;
> >       vq->log_ctx = NULL;
> > +     vq->upend_idx = 0;
> > +     vq->done_idx = 0;
> > +     atomic_set(&vq->refcnt, 0);
> >  }
> >  
> >  static int vhost_worker(void *data)
> > @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct
> vhost_dev *dev)
> >                                              UIO_MAXIOV,
> GFP_KERNEL);
> >               dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log *
> UIO_MAXIOV,
> >                                         GFP_KERNEL);
> > -             dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads
> *
> > +             dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads
> *
> >                                           UIO_MAXIOV, GFP_KERNEL);
> 
> Which fields need to be initialized actually?

Nope, already fixed it with kmalloc.

> >  
> >               if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> > @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev
> *dev)
> >       return 0;
> >  }
> >  
> > +/* 
> > +     comments
> > +*/
> 
> Hmm.

Fixed it in previous patch already.

> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> shutdown)
> > +{
> > +     int i, j = 0;
> > +
> > +     i = vq->done_idx;
> > +     while (i != vq->upend_idx) {
> 
> A for loop might be clearer.
Ok.

> > +             if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) ||
> shutdown) {
> 
> On shutdown, we signal all buffers used to the guest?
> Why?

We signal all outstand DMAs in the case of driver has some DMA issue to
prevent us from shutting down. I am afraid vhost cleanup could wait
forever.

> > +                     /* reset len = 0 */
> 
> comment not very helpful.
> Could you explain what this does instead?
> Or better use some constant instead of 0 ...

Fixed it already.

> > +                     vq->heads[i].len = 0;
> > +                     i = (i + 1) % UIO_MAXIOV;
> > +                     ++j;
> > +             } else
> > +                     break;
> 
> Hmm so if the 1st entry does not complete, you do not signal anything?

No used buffers to guest, should we signal still?


> > +     }
> 
> Looking at this loop, done idx is the consumer and used idx
> is the producer, right?

Yes.

> > +     if (j) {
> > +             /* comments */
> 
> Yes?
> 
> > +             if (i > vq->done_idx)
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> j);
> > +             else {
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > +                                      UIO_MAXIOV - vq->done_idx);
> > +                     vhost_add_used_n(vq, vq->heads, i);
> > +             }
> > +             vq->done_idx = i;
> > +             vhost_signal(vq->dev, vq);
> > +             atomic_sub(j, &vq->refcnt);
> 
> Code will likely be simpler if you call vhost_add_used once for
> each head in the first loop. Possibly add_used_signal might be
> a good idea too.

Ok.

> > +     }
> > +}
> > +
> >  /* Caller should have device mutex */
> >  void vhost_dev_cleanup(struct vhost_dev *dev)
> >  {
> > @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >                       vhost_poll_stop(&dev->vqs[i].poll);
> >                       vhost_poll_flush(&dev->vqs[i].poll);
> >               }
> > +             /* wait for all lower device DMAs done, then notify
> guest */
> > +             if (atomic_read(&dev->vqs[i].refcnt)) {
> > +                     msleep(1000);
> > +                     vhost_zerocopy_signal_used(&dev->vqs[i],
> true);
> > +             }
> 
> This needs to be fixed somehow. Use a completion object and wait
> on it?

Worried about what if the driver has some DMAs issue, which would
prevent vhost from shutting down.

> >               if (dev->vqs[i].error_ctx)
> >                       eventfd_ctx_put(dev->vqs[i].error_ctx);
> >               if (dev->vqs[i].error)
> > @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev
> *d, int ioctl, void __user *argp)
> >  
> >       mutex_lock(&vq->mutex);
> >  
> > +     /* force all lower device DMAs done */
> > +     if (atomic_read(&vq->refcnt)) 
> > +             vhost_zerocopy_signal_used(vq, true);
> > +
> >       switch (ioctl) {
> >       case VHOST_SET_VRING_NUM:
> >               /* Resizing ring with an active backend?
> > @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct
> vhost_virtqueue *vq)
> >               vq_err(vq, "Failed to enable notification at %p: %d
> \n",
> >                      &vq->used->flags, r);
> >  }
> > +
> > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > +{
> > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > +
> > +     /* set len = 1 to mark this desc buffers done DMA */
> 
> this comment can now go.

Yes, it's gone already.

> > +     vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> > +}
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index b3363ae..8e3ecc7 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -13,6 +13,10 @@
> >  #include <linux/virtio_ring.h>
> >  #include <asm/atomic.h>
> >  
> > +/* This is for zerocopy, used buffer len is set to 1 when lower
> device DMA
> > + * done */
> > +#define VHOST_DMA_DONE_LEN   1
> > +
> >  struct vhost_device;
> >  
> >  struct vhost_work;
> > @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> >       /* Log write descriptors */
> >       void __user *log_base;
> >       struct vhost_log *log;
> > +     /* vhost zerocopy support */
> > +     atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> 
> future enhancement idea: this is used apparently under vq lock
> so no need for an atomic?

It is also used in skb vhost zerocopy callback.

> > +     /* copy of avail idx to monitor outstanding DMA zerocopy
> buffers */
> 
> looking at code upend_idx seems to be calculated independently
> of guest avail idx - could you clarify pls?

Yes, you are right. Should change it to: upend_idx is used to track
vring ids for outstanding zero-copy DMA buffers?

> > +     int upend_idx;
> > +     /* copy of used idx to monintor DMA done zerocopy buffers */
> 
> monitor

Ok.

> > +     int done_idx;
> 
> 
> I think in reality these are just producer and consumer
> in the head structure which for zero copy is used
Yes.

> 
> >  };
> >  
> >  struct vhost_dev {
> > @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue
> *);
> >  
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log
> *log,
> >                   unsigned int log_num, u64 len);
> > +void vhost_zerocopy_callback(struct sk_buff *skb);
> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> shutdown);
> >  
> >  #define vq_err(vq, fmt, ...) do {
> \
> >               pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > 

^ permalink raw reply

* Re: [PATCH] Check the value of doi before referencing it
From: David Miller @ 2011-05-18  5:04 UTC (permalink / raw)
  To: huzaifas; +Cc: netdev, kaber, yoshfuji, jmorris, pekkas, kuznet
In-Reply-To: <1305692980-4730-1-git-send-email-huzaifas@redhat.com>

From: Huzaifa Sidhpurwala <huzaifas@redhat.com>
Date: Wed, 18 May 2011 09:59:40 +0530

> Value of doi is not checked before referencing it.
> Though this does not cause any null pointer dereference since
> all the callers of cipso_v4_doi_add check the value of doi
> before calling the function, but it would be a good programming
> practice to do so anyways :)
> 
> Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com>

I don't think we should fix bugs that do not exist.

^ permalink raw reply

* [PATCH 2/2] bna: Add Generic Netlink Interface
From: Rasesh Mody @ 2011-05-18  4:57 UTC (permalink / raw)
  To: davem, netdev; +Cc: adapter_linux_open_src_team, Rasesh Mody, Debashis Dutt
In-Reply-To: <1305694621-28023-1-git-send-email-rmody@brocade.com>

This patch adds the generic netlink communication interface to BNA driver. The
in-kernel generic netlink infrastructure can be used to collect hardware
specific control information and control attributes. The driver makes use of
the "doit" handler provided by the generic netlink layer to accomplish this.

Signed-off-by: Debashis Dutt <ddutt@brocade.com>
Signed-off-by: Rasesh Mody <rmody@brocade.com>
---
 drivers/net/bna/Makefile    |    2 +-
 drivers/net/bna/bfa_defs.h  |   18 +++
 drivers/net/bna/bfa_ioc.c   |    8 +-
 drivers/net/bna/bfa_ioc.h   |    1 +
 drivers/net/bna/bnad.c      |   11 ++-
 drivers/net/bna/bnad.h      |    1 +
 drivers/net/bna/bnad_genl.c |  345 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/bna/bnad_genl.h |   87 +++++++++++
 8 files changed, 466 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/bna/bnad_genl.c
 create mode 100644 drivers/net/bna/bnad_genl.h

diff --git a/drivers/net/bna/Makefile b/drivers/net/bna/Makefile
index 4bb0d5d..f3339dc 100644
--- a/drivers/net/bna/Makefile
+++ b/drivers/net/bna/Makefile
@@ -5,7 +5,7 @@
 
 obj-$(CONFIG_BNA) += bna.o
 
-bna-objs := bnad.o bnad_debugfs.o bnad_ethtool.o
+bna-objs := bnad.o bnad_debugfs.o bnad_ethtool.o bnad_genl.o
 bna-objs += bna_ctrl.o bna_txrx.o
 bna-objs += bfa_ioc.o bfa_ioc_ct.o bfa_cee.o cna_fwimg.o
 
diff --git a/drivers/net/bna/bfa_defs.h b/drivers/net/bna/bfa_defs.h
index 2ea0dfe..2dd0898 100644
--- a/drivers/net/bna/bfa_defs.h
+++ b/drivers/net/bna/bfa_defs.h
@@ -26,6 +26,24 @@
 #define BFA_STRING_32	32
 #define BFA_VERSION_LEN 64
 
+/*
+ * Check if the card having old wwn/mac handling
+ */
+#define bfa_mfg_is_old_wwn_mac_model(type) (( \
+	(type) == BFA_MFG_TYPE_CNA10P2 || \
+	(type) == BFA_MFG_TYPE_CNA10P1 || \
+	(type) == BFA_MFG_TYPE_WANCHESE))
+
+#define bfa_mfg_increment_wwn_mac(m, i)                         \
+do {                                                            \
+	u32 t = ((u32)(m)[0] << 16) | ((u32)(m)[1] << 8) | \
+		(u32)(m)[2];  \
+	t += (i);      \
+	(m)[0] = (t >> 16) & 0xFF;                              \
+	(m)[1] = (t >> 8) & 0xFF;                               \
+	(m)[2] = t & 0xFF;                                      \
+} while (0)
+
 /**
  * ---------------------- adapter definitions ------------
  */
diff --git a/drivers/net/bna/bfa_ioc.c b/drivers/net/bna/bfa_ioc.c
index 15f9dec..eeb7250 100644
--- a/drivers/net/bna/bfa_ioc.c
+++ b/drivers/net/bna/bfa_ioc.c
@@ -82,8 +82,6 @@ static void bfa_ioc_pf_fwmismatch(struct bfa_ioc *ioc);
 static void bfa_ioc_boot(struct bfa_ioc *ioc, u32 boot_type,
 			 u32 boot_param);
 static u32 bfa_ioc_smem_pgnum(struct bfa_ioc *ioc, u32 fmaddr);
-static void bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc,
-						char *serial_num);
 static void bfa_ioc_get_adapter_fw_ver(struct bfa_ioc *ioc,
 						char *fw_ver);
 static void bfa_ioc_get_pci_chip_rev(struct bfa_ioc *ioc,
@@ -2045,7 +2043,7 @@ bfa_ioc_get_adapter_attr(struct bfa_ioc *ioc,
 
 	ioc_attr = ioc->attr;
 
-	bfa_ioc_get_adapter_serial_num(ioc, ad_attr->serial_num);
+	bfa_nw_ioc_get_adapter_serial_num(ioc, ad_attr->serial_num);
 	bfa_ioc_get_adapter_fw_ver(ioc, ad_attr->fw_ver);
 	bfa_ioc_get_adapter_optrom_ver(ioc, ad_attr->optrom_ver);
 	bfa_ioc_get_adapter_manufacturer(ioc, ad_attr->manufacturer);
@@ -2096,8 +2094,8 @@ bfa_ioc_get_type(struct bfa_ioc *ioc)
 	}
 }
 
-static void
-bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
+void
+bfa_nw_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
 {
 	memset(serial_num, 0, BFA_ADAPTER_SERIAL_NUM_LEN);
 	memcpy(serial_num,
diff --git a/drivers/net/bna/bfa_ioc.h b/drivers/net/bna/bfa_ioc.h
index 49739cd..1f71865 100644
--- a/drivers/net/bna/bfa_ioc.h
+++ b/drivers/net/bna/bfa_ioc.h
@@ -280,6 +280,7 @@ mac_t bfa_nw_ioc_get_mac(struct bfa_ioc *ioc);
 void bfa_nw_ioc_debug_memclaim(struct bfa_ioc *ioc, void *dbg_fwsave);
 int bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen);
 int bfa_nw_ioc_debug_fwsave(struct bfa_ioc *ioc, void *trcdata, int *trclen);
+void bfa_nw_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num);
 
 /*
  * Timeout APIs
diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index a997276..09aa132 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -25,6 +25,7 @@
 #include <linux/ip.h>
 
 #include "bnad.h"
+#include "bnad_genl.h"
 #include "bna.h"
 #include "cna.h"
 
@@ -721,7 +722,7 @@ bnad_disable_mbox_irq(struct bnad *bnad)
 	BNAD_UPDATE_CTR(bnad, mbox_intr_disabled);
 }
 
-static void
+void
 bnad_set_netdev_perm_addr(struct bnad *bnad)
 {
 	struct net_device *netdev = bnad->netdev;
@@ -3296,6 +3297,10 @@ bnad_module_init(void)
 		return err;
 	}
 
+	/* Register with generic netlink */
+	if (bnad_genl_init())
+		pr_err("bna: Generic Netlink Register failed\n");
+
 	return 0;
 }
 
@@ -3305,6 +3310,10 @@ bnad_module_exit(void)
 	pci_unregister_driver(&bnad_pci_driver);
 	mutex_destroy(&bnad_list_mutex);
 
+	/* Unegister with generic netlink */
+	if (bnad_genl_uninit())
+		pr_err("bna: Generic Netlink Unregister failed\n");
+
 	if (bfi_fw)
 		release_firmware(bfi_fw);
 }
diff --git a/drivers/net/bna/bnad.h b/drivers/net/bna/bnad.h
index 2c1f283..d3cb27b 100644
--- a/drivers/net/bna/bnad.h
+++ b/drivers/net/bna/bnad.h
@@ -302,6 +302,7 @@ extern u32 *cna_get_firmware_buf(struct pci_dev *pdev);
 extern void bnad_set_ethtool_ops(struct net_device *netdev);
 
 /* Configuration & setup */
+extern void bnad_set_netdev_perm_addr(struct bnad *bnad);
 extern void bnad_tx_coalescing_timeo_set(struct bnad *bnad);
 extern void bnad_rx_coalescing_timeo_set(struct bnad *bnad);
 
diff --git a/drivers/net/bna/bnad_genl.c b/drivers/net/bna/bnad_genl.c
new file mode 100644
index 0000000..eec2a56
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.c
@@ -0,0 +1,345 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#include "bnad.h"
+#include "bnad_genl.h"
+#include "bna.h"
+
+static struct nla_policy bnad_genl_policy[BNAD_GENL_ATTR_MAX + 1] = {
+	[BNAD_GENL_ATTR_IOC_INFO]
+				= { .len = sizeof(struct bnad_genl_ioc_info) },
+	[BNAD_GENL_ATTR_IOC_ATTR]
+				= { .len = sizeof(struct bnad_genl_ioc_attr) },
+};
+
+static struct genl_family bnad_genl_family = {
+	.id = GENL_ID_GENERATE,
+	.name = "BNAD_GENL",
+	.version = BNAD_GENL_VERSION,
+	.hdrsize = 0,
+	.maxattr = BNAD_GENL_ATTR_MAX,
+};
+
+static struct bnad *
+bnad_get_bnadev(int bna_id)
+{
+	struct bnad *bnad;
+
+	mutex_lock(&bnad_list_mutex);
+	list_for_each_entry(bnad, &bnad_list, list_entry) {
+		if (bnad->id == bna_id) {
+			mutex_unlock(&bnad_list_mutex);
+			return bnad;
+		}
+	}
+	mutex_unlock(&bnad_list_mutex);
+	return NULL;
+}
+
+static void
+bnad_hwpath_get(struct bnad *bnad, char *hwpath, char *adapter_hwpath)
+{
+	int i;
+
+	strcpy(hwpath, pci_name(bnad->pcidev));
+	strcpy(adapter_hwpath, pci_name(bnad->pcidev));
+	i = strlen(adapter_hwpath) - 1;
+	while (i && (adapter_hwpath[i] != '.'))
+		i--;
+	adapter_hwpath[i] = '\0';
+}
+
+static void
+bnad_get_pci_attr(struct bnad *bnad, struct bfa_ioc_pci_attr *pci_attr)
+{
+	pci_attr->vendor_id = bnad->pcidev->vendor;
+	pci_attr->device_id = bnad->pcidev->device;
+	pci_attr->ssid = bnad->pcidev->subsystem_device;
+	pci_attr->ssvid = bnad->pcidev->subsystem_vendor;
+	pci_attr->pcifn = PCI_FUNC(bnad->pcidev->devfn);
+}
+
+static int
+bnad_genl_ioc_info_rsp(struct genl_info *info,
+	struct bnad_genl_ioc_info *genlcmd, size_t attr_size, u8 cmd)
+{
+	struct sk_buff *rsp_skb = NULL;
+	void *genl_msg_hdr = NULL;
+	size_t msg_size;
+	int err = 0;
+
+	msg_size = nla_total_size(attr_size);
+
+	rsp_skb = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!rsp_skb) {
+		pr_warn("bnad[%d]: Failed to get response skb\n",
+			genlcmd->bnad_num);
+		return -ENOMEM;
+	}
+
+	genl_msg_hdr = genlmsg_put(rsp_skb, info->snd_pid, info->snd_seq,
+		&bnad_genl_family, 0, cmd);
+	if (!genl_msg_hdr) {
+		pr_warn("bnad[%d]: Failed to get the genl_msg_header\n",
+			genlcmd->bnad_num);
+		err = -ENOMEM;
+		goto failure;
+	}
+
+	NLA_PUT(rsp_skb, BNAD_GENL_ATTR_IOC_INFO, attr_size, genlcmd);
+
+	err = genlmsg_end(rsp_skb, genl_msg_hdr);
+	if (err < 0) {
+		pr_warn("bnad[%d]: Failed to do genlmsg_end\n",
+			genlcmd->bnad_num);
+		goto failure;
+	}
+
+	err = genlmsg_reply(rsp_skb, info);
+	if (err)
+		pr_warn("bnad[%d]: Could not do genlmsg_reply (%d)\n",
+			genlcmd->bnad_num, err);
+
+	return err;
+
+nla_put_failure:
+	pr_warn("bnad[%d]: Failed to do NLA_PUT\n", genlcmd->bnad_num);
+	err = -EMSGSIZE;
+failure:
+	genlmsg_cancel(rsp_skb, genl_msg_hdr);
+	kfree_skb(rsp_skb);
+	return err;
+}
+
+static int
+bnad_genl_ioc_attr_rsp(struct genl_info *info,
+	struct bnad_genl_ioc_attr *genlcmd, size_t attr_size, u8 cmd)
+{
+	struct sk_buff *rsp_skb = NULL;
+	void *genl_msg_hdr = NULL;
+	size_t msg_size;
+	int err = 0;
+
+	msg_size = nla_total_size(attr_size);
+
+	rsp_skb = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!rsp_skb) {
+		pr_warn("bnad[%d]: Failed to get response skb\n",
+			genlcmd->bnad_num);
+		return -ENOMEM;
+	}
+
+	genl_msg_hdr = genlmsg_put(rsp_skb, info->snd_pid, info->snd_seq,
+		&bnad_genl_family, 0, cmd);
+	if (!genl_msg_hdr) {
+		pr_warn("bnad[%d]: Failed to get the genl_msg_header\n",
+			genlcmd->bnad_num);
+		err = -ENOMEM;
+		goto failure;
+	}
+
+	NLA_PUT(rsp_skb, BNAD_GENL_ATTR_IOC_ATTR, attr_size, genlcmd);
+
+	err = genlmsg_end(rsp_skb, genl_msg_hdr);
+	if (err < 0) {
+		pr_warn("bnad[%d]: Failed to do genlmsg_end\n",
+			genlcmd->bnad_num);
+		goto failure;
+	}
+
+	err = genlmsg_reply(rsp_skb, info);
+	if (err)
+		pr_warn("bnad[%d]: Could not do genlmsg_reply (%d)\n",
+			genlcmd->bnad_num, err);
+
+	return err;
+
+nla_put_failure:
+	pr_warn("bnad[%d]: Failed to do NLA_PUT\n", genlcmd->bnad_num);
+	err = -EMSGSIZE;
+failure:
+	genlmsg_cancel(rsp_skb, genl_msg_hdr);
+	kfree_skb(rsp_skb);
+	return err;
+}
+
+/* Note: Should be called holding bnad_conf_lock */
+static int
+bnad_genl_ioc_get_info(struct sk_buff *skb, struct genl_info *info)
+{
+	struct bnad *bnad = NULL;
+	struct bnad_genl_ioc_info *genlcmd = NULL;
+	struct bfa_ioc *ioc = NULL;
+	unsigned long flags = 0;
+	int err;
+
+	genlcmd = (struct bnad_genl_ioc_info *)
+		nla_data(info->attrs[BNAD_GENL_ATTR_IOC_INFO]);
+
+	bnad = bnad_get_bnadev(genlcmd->bnad_num);
+	if (!bnad) {
+		pr_warn("bna: Failed to get driver instance\n");
+		return -EINVAL;
+	}
+	ioc = &bnad->bna.device.ioc;
+
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	bfa_nw_ioc_get_adapter_serial_num(ioc, genlcmd->serialnum);
+	genlcmd->mac = bfa_nw_ioc_get_mac(ioc);
+	/* Get manufacturing MAC */
+	genlcmd->factory_mac = ioc->attr->mfg_mac;
+	if (bfa_mfg_is_old_wwn_mac_model(ioc->attr->card_type))
+		genlcmd->factory_mac.mac[MAC_ADDRLEN - 1] += bfa_ioc_pcifn(ioc);
+	else
+		bfa_mfg_increment_wwn_mac(
+			&(genlcmd->factory_mac.mac[MAC_ADDRLEN-3]),
+			bfa_ioc_pcifn(ioc));
+
+	/* Get stack MAC */
+	if (is_zero_ether_addr(&bnad->perm_addr.mac[0])) {
+		bna_port_mac_get(&bnad->bna.port, &bnad->perm_addr);
+		bnad_set_netdev_perm_addr(bnad);
+	}
+	memcpy(&genlcmd->current_mac, bnad->netdev->dev_addr, sizeof(mac_t));
+
+	genlcmd->bnad_num = bnad->id;
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	bnad_hwpath_get(bnad, genlcmd->hwpath, genlcmd->adapter_hwpath);
+	sprintf(genlcmd->eth_name, "%s", bnad->netdev->name);
+	strncpy(genlcmd->name, bnad->adapter_name, sizeof(genlcmd->name) - 1);
+	strncpy(genlcmd->port_name, bnad->port_name,
+		sizeof(genlcmd->port_name) - 1);
+	genlcmd->ioc_type = BFA_IOC_TYPE_LL;
+	genlcmd->status = BFA_STATUS_OK;
+
+	err = bnad_genl_ioc_info_rsp(info, genlcmd,
+		sizeof(struct bnad_genl_ioc_info), BNAD_GENL_CMD_IOC_INFO);
+
+	return err;
+}
+
+/* Note: Should be called holding bnad_conf_lock */
+static int
+bnad_genl_ioc_get_attr(struct sk_buff *skb, struct genl_info *info)
+{
+	struct bnad *bnad = NULL;
+	struct bnad_genl_ioc_attr *genlcmd = NULL;
+	struct bfa_ioc *ioc = NULL;
+	unsigned long flags = 0;
+	int err = 0;
+
+	genlcmd = (struct bnad_genl_ioc_attr *)
+		nla_data(info->attrs[BNAD_GENL_ATTR_IOC_ATTR]);
+
+	bnad = bnad_get_bnadev(genlcmd->bnad_num);
+	if (!bnad) {
+		pr_warn("bna: Failed to get driver instance\n");
+		return -EINVAL;
+	}
+	ioc = &bnad->bna.device.ioc;
+
+	memset(&genlcmd->ioc_attr, 0, sizeof(genlcmd->ioc_attr));
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	bfa_nw_ioc_get_attr(&bnad->bna.device.ioc, &genlcmd->ioc_attr);
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	genlcmd->ioc_attr.ioc_type = BFA_IOC_TYPE_LL;
+	strcpy(genlcmd->ioc_attr.driver_attr.driver, BNAD_NAME);
+	strncpy(genlcmd->ioc_attr.driver_attr.driver_ver,
+		BNAD_VERSION, BFA_VERSION_LEN);
+	strcpy(genlcmd->ioc_attr.driver_attr.fw_ver,
+	       genlcmd->ioc_attr.adapter_attr.fw_ver);
+	strcpy(genlcmd->ioc_attr.driver_attr.bios_ver,
+	       genlcmd->ioc_attr.adapter_attr.optrom_ver);
+	bnad_get_pci_attr(bnad, &genlcmd->ioc_attr.pci_attr);
+	genlcmd->status = BFA_STATUS_OK;
+
+	err = bnad_genl_ioc_attr_rsp(info, genlcmd,
+		sizeof(struct bnad_genl_ioc_attr), BNAD_GENL_CMD_IOC_ATTR);
+
+	return err;
+}
+
+/* BNAD generic netlink ops */
+static struct genl_ops bnad_genl_ops[] = {
+	{
+	.cmd = BNAD_GENL_CMD_IOC_INFO,
+	.flags = 0,
+	.policy = bnad_genl_policy,
+	.doit = bnad_genl_ioc_get_info,
+	.dumpit = NULL,
+	},
+	{
+	.cmd = BNAD_GENL_CMD_IOC_ATTR,
+	.flags = 0,
+	.policy = bnad_genl_policy,
+	.doit = bnad_genl_ioc_get_attr,
+	.dumpit = NULL,
+	},
+};
+
+/* Initialize generic netlink for BNAD */
+int
+bnad_genl_init(void)
+{
+	int i, err = 0;
+
+	/* Register family */
+	err = genl_register_family(&bnad_genl_family);
+	if (err) {
+		pr_warn("bna: failed to register with Netlink\n");
+		return err;
+	}
+	pr_info("bna: registered with Netlink\n");
+
+	/* Register ops */
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_register_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_warn("bna: failed to register netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: registered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	return err;
+}
+
+/* Uninitialize generic netlink for BNAD */
+int
+bnad_genl_uninit(void)
+{
+	int i, err = 0;
+
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_unregister_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_warn("bna: failed to unregister netlink op %u)\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: unregistered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	err = genl_unregister_family(&bnad_genl_family);
+	if (err)
+		pr_warn("bna: failed to unregister with Netlink\n");
+	else
+		pr_info("bna: unregistered with Netlink\n");
+
+	return err;
+}
diff --git a/drivers/net/bna/bnad_genl.h b/drivers/net/bna/bnad_genl.h
new file mode 100644
index 0000000..d469fe5
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.h
@@ -0,0 +1,87 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#ifndef __BNAD_GENL_H__
+#define __BNAD_GENL_H__
+
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/gfp.h>
+#include <linux/skbuff.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <net/genetlink.h>
+
+#include "cna.h"
+
+/* Attributes */
+enum {
+	BNAD_GENL_ATTR_UNSPEC,
+	BNAD_GENL_ATTR_IOC_INFO,
+	BNAD_GENL_ATTR_IOC_ATTR,
+	__BNAD_GENL_ATTR_MAX
+};
+
+/* Effectively a single attribute */
+#define BNAD_GENL_ATTR_MAX (__BNAD_GENL_ATTR_MAX - 1)
+
+enum {
+	BNAD_GENL_VERSION = 1,
+};
+
+/* Commands/Responses */
+enum {
+	BNAD_GENL_CMD_UNSPEC,
+	BNAD_GENL_CMD_IOC_INFO,
+	BNAD_GENL_CMD_IOC_ATTR,
+	__BNAD_GENL_CMD_MAX,
+};
+
+struct bnad_genl_ioc_info {
+	int		status;
+	u16		bnad_num;
+	u16		rsvd;
+	char		serialnum[64];
+	char		hwpath[BFA_STRING_32];
+	char		adapter_hwpath[BFA_STRING_32];
+	char		guid[BFA_ADAPTER_SYM_NAME_LEN*2];
+	char		name[BFA_ADAPTER_SYM_NAME_LEN];
+	char		port_name[BFA_ADAPTER_SYM_NAME_LEN];
+	char		eth_name[BFA_ADAPTER_SYM_NAME_LEN];
+	u64		rsvd1[4];
+	mac_t		mac;
+	mac_t		factory_mac;	/* Factory mac address */
+	mac_t		current_mac;	/* Currently assigned mac address */
+	enum bfa_ioc_type     ioc_type;
+	u16		pvid;		/* Port vlan id */
+	u16		rsvd2;
+	u32		host;
+	u32		bandwidth;	/* For PF support */
+	u32		rsvd3;
+};
+
+struct bnad_genl_ioc_attr {
+	int		status;
+	u16		bnad_num;
+	u16		rsvd;
+	struct bfa_ioc_attr  ioc_attr;
+};
+
+extern int bnad_genl_init(void);
+extern int bnad_genl_uninit(void);
+
+#endif /* __BNAD_GENL_H__ */
-- 
1.7.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox