Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 1/2] net: add NETDEV_PRECHANGEMTU to notify before mtu change happens
From: Jiri Pirko @ 2014-01-16  7:18 UTC (permalink / raw)
  To: Veaceslav Falico
  Cc: netdev, David S. Miller, Eric Dumazet, Nicolas Dichtel, Cong Wang
In-Reply-To: <1389826939-20691-1-git-send-email-vfalico@redhat.com>

Thu, Jan 16, 2014 at 12:02:18AM CET, vfalico@redhat.com wrote:
>Currently, if a device changes its mtu, first the change happens (invloving
>all the side effects), and after that the NETDEV_CHANGEMTU is sent so that
>other devices can catch up with the new mtu. However, if they return
>NOTIFY_BAD, then the change is reverted and error returned.
>
>This is a really long and costy operation (sometimes). To fix this, add
>NETDEV_PRECHANGEMTU notification which is called prior to any change
>actually happening, and if any callee returns NOTIFY_BAD - the change is
>aborted. This way we're skipping all the playing with apply/revert the mtu.
>
>CC: "David S. Miller" <davem@davemloft.net>
>CC: Jiri Pirko <jiri@resnulli.us>
>CC: Eric Dumazet <edumazet@google.com>
>CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>CC: Cong Wang <amwang@redhat.com>
>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
>---
> include/linux/netdevice.h | 3 ++-
> net/core/dev.c            | 5 +++++
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 5c88ab1..6fa2c91 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1718,7 +1718,7 @@ struct pcpu_sw_netstats {
> #define NETDEV_CHANGE	0x0004	/* Notify device state change */
> #define NETDEV_REGISTER 0x0005
> #define NETDEV_UNREGISTER	0x0006
>-#define NETDEV_CHANGEMTU	0x0007
>+#define NETDEV_CHANGEMTU	0x0007 /* notify after mtu change happened */
> #define NETDEV_CHANGEADDR	0x0008
> #define NETDEV_GOING_DOWN	0x0009
> #define NETDEV_CHANGENAME	0x000A
>@@ -1734,6 +1734,7 @@ struct pcpu_sw_netstats {
> #define NETDEV_JOIN		0x0014
> #define NETDEV_CHANGEUPPER	0x0015
> #define NETDEV_RESEND_IGMP	0x0016
>+#define NETDEV_PRECHANGEMTU	0x0017 /* notify before mtu change happened */
> 
> int register_netdevice_notifier(struct notifier_block *nb);
> int unregister_netdevice_notifier(struct notifier_block *nb);
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 20c834e..cb78923 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -5360,6 +5360,11 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
> 	if (!netif_device_present(dev))
> 		return -ENODEV;
> 
>+	err = call_netdevice_notifiers(NETDEV_PRECHANGEMTU, dev);
>+	err = notifier_to_errno(err);
>+	if (err)
>+		return err;
>+
> 	orig_mtu = dev->mtu;
> 	err = __dev_set_mtu(dev, new_mtu);
> 
>-- 
>1.8.4
>

I like this 

Acked-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [patch net-next] neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions.
From: Jiri Pirko @ 2014-01-16  7:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jes
In-Reply-To: <20140113.113538.1561843824076418241.davem@davemloft.net>

Mon, Jan 13, 2014 at 08:35:38PM CET, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Thu,  9 Jan 2014 14:13:47 +0100
>
>> When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
>> not initialized yet. This might cause confusion for the people who use
>> NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
>> usage in ndo_neigh_setup.
>> 
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>
>Wouldn't it be better to move the neigh_parms_data_state_cleanall() call
>before we invoke ->ndo_neigh_setup()?  It seems that this code intended
>to work that way, no?

Even moving neigh_parms_data_state_cleanall before ndo_neigh_setup
would not solve this. In ndo_neigh_setup the original default value is
changed. If it is changed by NEIGH_VAR_SET, it touches data_state bit
and it looks as if it was changed by user. That is not desired because
default value change would be ignored for this device. Therefore there
is need for NEIGH_VAR_INIT which does not touch data_state bit.

^ permalink raw reply

* Re: [PATCH net-next 0/2] net: mvneta: simple cleanups
From: Willy Tarreau @ 2014-01-16  7:14 UTC (permalink / raw)
  To: David Miller; +Cc: thomas.petazzoni, netdev, arno, linux-arm-kernel
In-Reply-To: <20140115.225451.1012267210637180372.davem@davemloft.net>

On Wed, Jan 15, 2014 at 10:54:51PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 16 Jan 2014 07:16:15 +0100
> 
> > Would you prefer that we resend a complete series with all the patches
> > at once ?
> 
> That might, in fact, work much better.

OK I'm resending the whole series now. Thanks.
Willy

^ permalink raw reply

* Re: [PATCH net-next 0/2] net: mvneta: simple cleanups
From: David Miller @ 2014-01-16  6:54 UTC (permalink / raw)
  To: w; +Cc: arno, thomas.petazzoni, netdev, linux-arm-kernel
In-Reply-To: <20140116061615.GB15597@1wt.eu>

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 07:16:15 +0100

> Would you prefer that we resend a complete series with all the patches
> at once ?

That might, in fact, work much better.

Thanks.

^ permalink raw reply

* [PATCH net-next] virtio-net: drop rq->max and rq->num
From: Jason Wang @ 2014-01-16  6:45 UTC (permalink / raw)
  To: davem, virtualization, netdev, linux-kernel; +Cc: Michael S. Tsirkin

It looks like there's no need for those two fields:

- Unless there's a failure for the first refill try, rq->max should be always
  equal to the vring size.
- rq->num is only used to determine the condition that we need to do the refill,
  we could check vq->num_free instead.
- rq->num was required to be increased or decreased explicitly after each
  get/put which results a bad API.

So this patch removes them both to make the code simpler.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/virtio_net.c |   16 +++-------------
 1 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b17240..9bd70aa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,9 +72,6 @@ struct receive_queue {
 
 	struct napi_struct napi;
 
-	/* Number of input buffers, and max we've ever had. */
-	unsigned int num, max;
-
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -360,7 +357,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 
 		page = virt_to_head_page(buf);
-		--rq->num;
 
 		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
 		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
@@ -406,7 +402,6 @@ err_skb:
 		}
 		page = virt_to_head_page(buf);
 		put_page(page);
-		--rq->num;
 	}
 err_buf:
 	dev->stats.rx_dropped++;
@@ -628,10 +623,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 		oom = err == -ENOMEM;
 		if (err)
 			break;
-		++rq->num;
 	} while (rq->vq->num_free);
-	if (unlikely(rq->num > rq->max))
-		rq->max = rq->num;
 	if (unlikely(!virtqueue_kick(rq->vq)))
 		return false;
 	return !oom;
@@ -699,11 +691,10 @@ again:
 	while (received < budget &&
 	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
 		receive_buf(rq, buf, len);
-		--rq->num;
 		received++;
 	}
 
-	if (rq->num < rq->max / 2) {
+	if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
 		if (!try_fill_recv(rq, GFP_ATOMIC))
 			schedule_delayed_work(&vi->refill, 0);
 	}
@@ -1398,9 +1389,7 @@ static void free_unused_bufs(struct virtnet_info *vi)
 				give_pages(&vi->rq[i], buf);
 			else
 				dev_kfree_skb(buf);
-			--vi->rq[i].num;
 		}
-		BUG_ON(vi->rq[i].num != 0);
 	}
 }
 
@@ -1671,7 +1660,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		try_fill_recv(&vi->rq[i], GFP_KERNEL);
 
 		/* If we didn't even get one input buffer, we're useless. */
-		if (vi->rq[i].num == 0) {
+		if (vi->rq[i].vq->num_free ==
+		    virtqueue_get_vring_size(vi->rq[i].vq)) {
 			free_unused_bufs(vi);
 			err = -ENOMEM;
 			goto free_recv_bufs;
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] tun/macvtap: limit the packets queued through rcvbuf
From: Michael S. Tsirkin @ 2014-01-16  6:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: davem, netdev, linux-kernel, Vlad Yasevich, John Fastabend,
	Stephen Hemminger, Herbert Xu
In-Reply-To: <52D7763A.4040100@redhat.com>

On Thu, Jan 16, 2014 at 02:03:38PM +0800, Jason Wang wrote:
> On 01/16/2014 01:47 PM, Michael S. Tsirkin wrote:
> >On Thu, Jan 16, 2014 at 12:29:35PM +0800, Jason Wang wrote:
> >>On 01/15/2014 03:21 PM, Michael S. Tsirkin wrote:
> >>>On Wed, Jan 15, 2014 at 11:36:01AM +0800, Jason Wang wrote:
> >>>>On 01/14/2014 05:52 PM, Michael S. Tsirkin wrote:
> >>>>>On Tue, Jan 14, 2014 at 04:45:24PM +0800, Jason Wang wrote:
> >>>>>>>On 01/14/2014 04:25 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>On Tue, Jan 14, 2014 at 02:53:07PM +0800, Jason Wang wrote:
> >>>>>>>>>>>We used to limit the number of packets queued through tx_queue_length. This
> >>>>>>>>>>>has several issues:
> >>>>>>>>>>>
> >>>>>>>>>>>- tx_queue_length is the control of qdisc queue length, simply reusing it
> >>>>>>>>>>>   to control the packets queued by device may cause confusion.
> >>>>>>>>>>>- After commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 ("macvtap: Add
> >>>>>>>>>>>   support of packet capture on macvtap device."), an unexpected qdisc
> >>>>>>>>>>>   caused by non-zero tx_queue_length will lead qdisc lock contention for
> >>>>>>>>>>>   multiqueue deivce.
> >>>>>>>>>>>- What we really want is to limit the total amount of memory occupied not
> >>>>>>>>>>>   the number of packets.
> >>>>>>>>>>>
> >>>>>>>>>>>So this patch tries to solve the above issues by using socket rcvbuf to
> >>>>>>>>>>>limit the packets could be queued for tun/macvtap. This was done by using
> >>>>>>>>>>>sock_queue_rcv_skb() instead of a direct call to skb_queue_tail(). Also two
> >>>>>>>>>>>new ioctl() were introduced for userspace to change the rcvbuf like what we
> >>>>>>>>>>>have done for sndbuf.
> >>>>>>>>>>>
> >>>>>>>>>>>With this fix, we can safely change the tx_queue_len of macvtap to
> >>>>>>>>>>>zero. This will make multiqueue works without extra lock contention.
> >>>>>>>>>>>
> >>>>>>>>>>>Cc: Vlad Yasevich<vyasevic@redhat.com>
> >>>>>>>>>>>Cc: Michael S. Tsirkin<mst@redhat.com>
> >>>>>>>>>>>Cc: John Fastabend<john.r.fastabend@intel.com>
> >>>>>>>>>>>Cc: Stephen Hemminger<stephen@networkplumber.org>
> >>>>>>>>>>>Cc: Herbert Xu<herbert@gondor.apana.org.au>
> >>>>>>>>>>>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >>>>>>>>>No, I don't think we can change userspace-visible behaviour like that.
> >>>>>>>>>
> >>>>>>>>>This will break any existing user that tries to control
> >>>>>>>>>queue length through sysfs,netlink or device ioctl.
> >>>>>>>But it looks like a buggy API, since tx_queue_len should be for qdisc
> >>>>>>>queue length instead of device itself.
> >>>>>Probably, but it's been like this since 2.6.x time.
> >>>>>Also, qdisc queue is unused for tun so it seemed kind of
> >>>>>reasonable to override tx_queue_len.
> >>>>>
> >>>>>>>If we really want to preserve the
> >>>>>>>behaviour, how about using a new feature flag and change the behaviour
> >>>>>>>only when the device is created (TUNSETIFF) with the new flag?
> >>>>>OK this addresses the issue partially, but there's also an issue
> >>>>>of permissions: tx_queue_len can only be changed if
> >>>>>capable(CAP_NET_ADMIN). OTOH in your patch a regular user
> >>>>>can change the amount of memory consumed per queue
> >>>>>by calling TUNSETRCVBUF.
> >>>>Yes, but we have the same issue for TUNSETSNDBUF.
> >>>To an extent, but TUNSETSNDBUF is different. It limits how much device can queue
> >>>*in the networking stack* but each queue in the stack is also
> >>>limited, when we exceed that we star dropping packets.
> >>>So while with infinite value (which is the default btw)
> >>>you can keep host pretty busy, you will not be able to run
> >>>it out of memory.
> >>>
> >>>The proposed TUNSETRCVBUF would keep configured amount
> >>>of memory around indefinitely so you can run host out of memory.
> >>>
> >>>So assuming all this
> >>>How about an ethtool or netlink command to configure this
> >>>instead?
> >>>
> >>Ok, so we can add net admin check for before trying to set rcvbuf.
> >No, in practice I think using ioctl for sndbuf was also a mistake.
> >Applications have no idea what to set it to - you need to know what else
> >is running on the system, after a while
> >QEMU ended up setting it back to infinity otherwise things kept
> >breaking.
> 
> Yes, but it's not the problem of ioctl itself.
> >ethtool or netlink would not have this problem.
> >Which of the two is preferable I'm not sure.
> >I wonder what do management tools such as libvirt prefer.
> >
> 
> Libvirt usually create file descriptor, so it can also use ioctl()
> to set sndbuf.
> >>I
> >>think it's better to use ioctl since we've already use it for
> >>sndbuf. Using ethool means you need a dedicated new ethtool method
> >>just for tuntap which seems sub-optimal.
> >>Netlink looks better, but
> >>we should also implement other ioctl also.
> >I'm not sure what this last phrase means. Can you clarify pls?
> >
> 
> Sure, I mean if we choose netlink, we'd better add other function
> such as sndbuf, persistent flag and so on. Consider lots of existed
> function were implemented through ioctl(), using it also for rcvbuf
> looks ok and can simplify the changes of userspace.

I'm not sure how long will it take for userspace to
take advantage of this, or what guidance we'll give
userspace.

The advantage of ethtool is that users can tweak settings
without updating userspace management tools at all.


Basically this new interface comes to replace the use of
tx_queue_len which can be set from sysfs and ethtool
so it should not be harder to use.

> >>>>>>>>>Take a look at my patch in msg ID 20140109071721.GD19559@redhat.com
> >>>>>>>>>which gives one way to set tx_queue_len to zero without
> >>>>>>>>>breaking userspace.
> >>>>>>>If I read the patch correctly, it will make no way for the user who
> >>>>>>>really want to change the qdisc queue length for tun.
> >>>>>Why would this matter?  As far as I can see qdisc queue is currently unused.
> >>>>>
> >>>>User may use qdisc to do port mirroring, bandwidth limitation, traffic
> >>>>prioritization or more for a VM. So we do have users and maybe more
> >>>>consider the case of vpn.
> >>>Well it's not used by default at least.
> >>>I remember that we discussed this previously actually.
> >>>
> >>>If all we want to do actually is utilize no_qdisc by default,
> >>>we can simply use Eric's patch:
> >>>
> >>>http://article.gmane.org/gmane.linux.kernel/1279597
> >>>
> >>>and a similar patch for macvtap.
> >>>I tried it at the time and it didn't seem to help performance
> >>>at all, but a lot has changed since, in particular I didn't
> >>>test mq.
> >>>
> >>>If you now have results showing how it's beneficial, pls post them.
> >>>
> >>I will have a test to see the difference.
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2/2 Net-next v4] ixgbe: set driver_max_VFs should be done before enabling SRIOV
From: ethan zhao @ 2014-01-16  6:19 UTC (permalink / raw)
  To: aaron.f.brown, jeffrey.t.kirsher
  Cc: e1000-devel, netdev, linux-kernel, ethan.kernel


commit 43dc4e01 Limit number of reported VFs to device specific value
It doesn't work and always returns -EBUSY because VFs ware already enabled.

ixgbe_enable_sriov()
         pci_enable_sriov()
                 sriov_enable()
                 {
                 ... ..
                 iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
                 pci_cfg_access_lock(dev);
                 ... ...
                 }

pci_sriov_set_totalvfs()
{
... ...
if (dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
                 return -EBUSY;
...
}

So should set driver_max_VFs with pci_sriov_set_totalvfs() before
enable VFs with ixgbe_enable_sriov().

V2: revised for net-next tree.
V4: remove one signoff of two.

Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index bea2cec..6e6af0d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7972,8 +7972,8 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
      /* Mailbox */
      ixgbe_init_mbx_params_pf(hw);
      memcpy(&hw->mbx.ops, ii->mbx_ops, sizeof(hw->mbx.ops));
-    ixgbe_enable_sriov(adapter);
      pci_sriov_set_totalvfs(pdev, IXGBE_MAX_VFS_DRV_LIMIT);
+    ixgbe_enable_sriov(adapter);
  skip_sriov:

  #endif
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next 0/2] net: mvneta: simple cleanups
From: Willy Tarreau @ 2014-01-16  6:16 UTC (permalink / raw)
  To: David Miller; +Cc: thomas.petazzoni, netdev, arno, linux-arm-kernel
In-Reply-To: <20140115.154122.1485420525760591929.davem@davemloft.net>

Hi David,

On Wed, Jan 15, 2014 at 03:41:22PM -0800, David Miller wrote:
> From: Arnaud Ebalard <arno@natisbad.org>
> Date: Wed, 15 Jan 2014 00:45:31 +0100
> 
> > 
> > Those two patches are intended for net-next. They apply on top of
> > performance improvements patches from Willy for mvneta driver.
> > They provide some simple cleanups for unused variables, function
> > params or return values.
> > 
> > Arnaud Ebalard (2):
> >   net: mvneta: mvneta_tx_done_gbe() cleanups
> >   net: mvneta: make mvneta_txq_done() return void
> 
> These patches do not apply to net-next, please respin.

They depend on the two series I sent, I verified that they apply well
once these two series are applied to net-next.

Would you prefer that we resend a complete series with all the patches
at once ?

Thanks,
Willy

^ permalink raw reply

* [PATCH 1/2 net-next v4] ixgbe: define IXGBE_MAX_VFS_DRV_LIMIT macro and cleanup const 63
From: ethan zhao @ 2014-01-16  6:16 UTC (permalink / raw)
  To: aaron.f.brown, jeffrey.t.kirsher
  Cc: e1000-devel, netdev, ethan.kernel, linux-kernel


Because ixgbe driver limit the max number of VF functions could be enalbed
to 63, so define one macro IXGBE_MAX_VFS_DRV_LIMIT and cleanup the const 63
in code.

v3: revised for net-next tree.
V4: remove one signoff of two.

Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    4 ++--
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    4 ++--
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |    5 +++++
  3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index cc06854..bea2cec 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5028,7 +5028,7 @@ static int ixgbe_sw_init(struct ixgbe_adapter 
*adapter)

      /* assign number of SR-IOV VFs */
      if (hw->mac.type != ixgbe_mac_82598EB) {
-        if (max_vfs > 63) {
+        if (max_vfs > IXGBE_MAX_VFS_DRV_LIMIT) {
              adapter->num_vfs = 0;
              e_dev_warn("max_vfs parameter out of range. Not assigning 
any SR-IOV VFs\n");
          } else {
@@ -7973,7 +7973,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
      ixgbe_init_mbx_params_pf(hw);
      memcpy(&hw->mbx.ops, ii->mbx_ops, sizeof(hw->mbx.ops));
      ixgbe_enable_sriov(adapter);
-    pci_sriov_set_totalvfs(pdev, 63);
+    pci_sriov_set_totalvfs(pdev, IXGBE_MAX_VFS_DRV_LIMIT);
  skip_sriov:

  #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 359f6e6..b324260 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -148,7 +148,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter)
           * physical function.  If the user requests greater thn
           * 63 VFs then it is an error - reset to default of zero.
           */
-        adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, 63);
+        adapter->num_vfs = min_t(unsigned int, adapter->num_vfs, 
IXGBE_MAX_VFS_DRV_LIMIT);

          err = pci_enable_sriov(adapter->pdev, adapter->num_vfs);
          if (err) {
@@ -257,7 +257,7 @@ static int ixgbe_pci_sriov_enable(struct pci_dev 
*dev, int num_vfs)
       * PF.  The PCI bus driver already checks for other values out of
       * range.
       */
-    if (num_vfs > 63) {
+    if (num_vfs > IXGBE_MAX_VFS_DRV_LIMIT) {
          err = -EPERM;
          goto err_out;
      }
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 4713f9f..8bd2919 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -28,6 +28,11 @@
  #ifndef _IXGBE_SRIOV_H_
  #define _IXGBE_SRIOV_H_

+/*  ixgbe driver limit the max number of VFs could be enabled to
+ *  63 (IXGBE_MAX_VF_FUNCTIONS - 1)
+ */
+#define IXGBE_MAX_VFS_DRV_LIMIT  (IXGBE_MAX_VF_FUNCTIONS - 1)
+
  void ixgbe_restore_vf_multicasts(struct ixgbe_adapter *adapter);
  void ixgbe_msg_task(struct ixgbe_adapter *adapter);
  int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask);
-- 
1.7.1


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related

* Re: [PATCH] net: davinci_mdio: Fix sparse warning
From: Mugunthan V N @ 2014-01-16  6:14 UTC (permalink / raw)
  To: Prabhakar Lad, David S. Miller, netdev; +Cc: linux-kernel
In-Reply-To: <1389852341-15393-1-git-send-email-prabhakar.csengg@gmail.com>

On Thursday 16 January 2014 11:35 AM, Prabhakar Lad wrote:
> From: "Lad, Prabhakar" <prabhakar.csengg@gmail.com>
>
> This patch fixes following sparse warning
> davinci_mdio.c:85:27: warning: symbol 'default_pdata' was not declared. Should it be static?
> Also makes the default_pdata as a constant.
>
> Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>

Regards
Mugunthan V N

^ permalink raw reply

* Re: [PATCH 1/2 Net-next] ixgbe: define IXGBE_MAX_VFS_DRV_LIMIT macro and cleanup const 63
From: Ethan Zhao @ 2014-01-16  6:12 UTC (permalink / raw)
  To: David Miller
  Cc: ethan zhao, Brown, Aaron F, Kirsher, Jeffrey T,
	e1000-devel@lists.sourceforge.net, netdev, LKML
In-Reply-To: <20140115.215753.1619877365795484712.davem@davemloft.net>

David,
   Sorry about the signoffs, I am confused about how to sign the patch,
If I sign it with off-work mail address, some of it is done within work-hours,
maybe that why I sign if with both personal mail address and work one.
If that bother you, just remove it.

Thanks,
Ethan

On Thu, Jan 16, 2014 at 1:57 PM, David Miller <davem@davemloft.net> wrote:
> From: ethan zhao <ethan.zhao@oracle.com>
> Date: Thu, 16 Jan 2014 12:25:01 +0800
>
>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>> Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
>
> Please don't give two signoffs for yourself, it is not appropriate at
> all.
>
> I am very genuinely curious where you got the idea to do that,
> particularly as I've never seen anyone else do it before.  So it's
> really not possible that you got the idea from someone else's actions.

^ permalink raw reply

* Re: TI CPSW Ethernet Tx performance regression
From: Mugunthan V N @ 2014-01-16  6:07 UTC (permalink / raw)
  To: Florian Fainelli, Ben Hutchings; +Cc: netdev
In-Reply-To: <CAGVrzca+jeetqMbQ8NvVgddrjEuay+GWZjZadekV2-AiE4OLyw@mail.gmail.com>

Hi

On Thursday 16 January 2014 02:51 AM, Florian Fainelli wrote:
> 2014/1/15 Ben Hutchings <bhutchings@solarflare.com>:
>> On Wed, 2014-01-15 at 18:18 +0530, Mugunthan V N wrote:
>>> Hi
>>>
>>> I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
>>> CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
>>> comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
>>> CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
>>> is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
>>> it is *70Mbps*
>>>
>>> Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
>>>
>>> On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
>>> 196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
>>>
>>> Can someone point me out where can I look for improving Tx performance. I also
>>> checked whether there is Tx descriptor over flow and there is none. I have
>>> tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
>>> only.
>>>
>>> [1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary
>> If you don't get any specific suggestions, you could try bisecting to
>> find out which specific commit(s) changed the performance.
> Not necessarily related to that issue, but there are a few
> weird/unusual things done in the CPSW interrupt handler:
>
> static irqreturn_t cpsw_interrupt(int irq, void *dev_id)
> {
>         struct cpsw_priv *priv = dev_id;
>
>         cpsw_intr_disable(priv);
>         if (priv->irq_enabled == true) {
>                 cpsw_disable_irq(priv);
>                 priv->irq_enabled = false;
>         }
>
>         if (netif_running(priv->ndev)) {
>                 napi_schedule(&priv->napi);
>                 return IRQ_HANDLED;
>         }
>
> Checking for netif_running() should not be required, you should not
> get any TX/RX interrupts if your interface is not running.

The driver also supports Dual EMAC with one physical device. More
description can be found in [1] under the topic *9.2.1.5.2 Dual Mac
Mode*. If the first interface is down and the second interface is up,
without checking the interface we will not know which napi to schedule.

>
>
>         priv = cpsw_get_slave_priv(priv, 1);
>         if (!priv)
>                 return IRQ_NONE;
>
> Should not this be moved up as the very first conditional check to do?
> is not there a risk to leave the interrupts disabled and not
> re-enabled due to the first 5 lines at the top?

This has to be kept here to check if the interrupt is triggered by the
second Ethernet port interface when the first interface is down.

>
>
>         if (netif_running(priv->ndev)) {
>                 napi_schedule(&priv->napi);
>                 return IRQ_HANDLED;
>         }
>
> This was done before, why doing it again?
>
> In drivers/net/ethernet/ti/davinci_cpdma.c::cpdma_chan_process()
> treats equally an error processing a packet (and will stop there) as
> well as successfully processing num_tx packets, is that also
> intentional? Should you attempt to keep processing "quota" packets?

I tried it in my local build but no success.

>
> As Ben suggests, bisecting what is causing the regression is your best bet here.

I can do a bisect but the issue is I don't have a good commit to bisect
as 3.2 kernel is TI maintained repo and is not upstreamed as is. CPSW
with base port support is available in mainline kernel from v3.7, and I
have tested till v3.7 and the Transmit performance is poor when compared
to v3.2 kernel maintained by TI.

[1] - http://www.ti.com/lit/ug/sprugz8e/sprugz8e.pdf

Regards
Mugunthan V N

^ permalink raw reply

* [PATCH] net: davinci_mdio: Fix sparse warning
From: Prabhakar Lad @ 2014-01-16  6:05 UTC (permalink / raw)
  To: David S. Miller, netdev; +Cc: linux-kernel, Mugunthan V N, Lad, Prabhakar

From: "Lad, Prabhakar" <prabhakar.csengg@gmail.com>

This patch fixes following sparse warning
davinci_mdio.c:85:27: warning: symbol 'default_pdata' was not declared. Should it be static?
Also makes the default_pdata as a constant.

Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com>
---
 drivers/net/ethernet/ti/davinci_mdio.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/davinci_mdio.c b/drivers/net/ethernet/ti/davinci_mdio.c
index 4ec9265..0cca9de 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -82,7 +82,7 @@ struct davinci_mdio_regs {
 	}	user[0];
 };
 
-struct mdio_platform_data default_pdata = {
+static const struct mdio_platform_data default_pdata = {
 	.bus_freq = DEF_OUT_FREQ,
 };
 
-- 
1.7.9.5

^ permalink raw reply related

* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: annie li @ 2014-01-16  6:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, wei.liu2, ian.campbell, xen-devel
In-Reply-To: <20140114.152825.2110618560908841742.davem@davemloft.net>


On 2014/1/15 7:28, David Miller wrote:
> From: Annie Li <Annie.li@oracle.com>
> Date: Fri, 10 Jan 2014 06:48:38 +0800
>
>> Current netfront only grants pages for grant copy, not for grant transfer, so
>> remove corresponding transfer code and add receiving copy code in
>> xennet_release_rx_bufs.
>>
>> Signed-off-by: Annie Li <Annie.li@oracle.com>
> If a Xen networking driver would review this I'd appreciate it.
>

I will re-post a new patch with improved comments. This patch mainly 
fixes leaking grant issue in netfront, and improvement on 
gnttab_end_foreign_access_ref and gnttab_end_foreign_access can be 
implemented in a separate patch.

Thanks
Annie

^ permalink raw reply

* Re: [PATCH net-next] tun/macvtap: limit the packets queued through rcvbuf
From: Jason Wang @ 2014-01-16  6:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, netdev, linux-kernel, Vlad Yasevich, John Fastabend,
	Stephen Hemminger, Herbert Xu
In-Reply-To: <20140116054700.GC8915@redhat.com>

On 01/16/2014 01:47 PM, Michael S. Tsirkin wrote:
> On Thu, Jan 16, 2014 at 12:29:35PM +0800, Jason Wang wrote:
>> On 01/15/2014 03:21 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 15, 2014 at 11:36:01AM +0800, Jason Wang wrote:
>>>> On 01/14/2014 05:52 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Jan 14, 2014 at 04:45:24PM +0800, Jason Wang wrote:
>>>>>>> On 01/14/2014 04:25 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Tue, Jan 14, 2014 at 02:53:07PM +0800, Jason Wang wrote:
>>>>>>>>>>> We used to limit the number of packets queued through tx_queue_length. This
>>>>>>>>>>> has several issues:
>>>>>>>>>>>
>>>>>>>>>>> - tx_queue_length is the control of qdisc queue length, simply reusing it
>>>>>>>>>>>    to control the packets queued by device may cause confusion.
>>>>>>>>>>> - After commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 ("macvtap: Add
>>>>>>>>>>>    support of packet capture on macvtap device."), an unexpected qdisc
>>>>>>>>>>>    caused by non-zero tx_queue_length will lead qdisc lock contention for
>>>>>>>>>>>    multiqueue deivce.
>>>>>>>>>>> - What we really want is to limit the total amount of memory occupied not
>>>>>>>>>>>    the number of packets.
>>>>>>>>>>>
>>>>>>>>>>> So this patch tries to solve the above issues by using socket rcvbuf to
>>>>>>>>>>> limit the packets could be queued for tun/macvtap. This was done by using
>>>>>>>>>>> sock_queue_rcv_skb() instead of a direct call to skb_queue_tail(). Also two
>>>>>>>>>>> new ioctl() were introduced for userspace to change the rcvbuf like what we
>>>>>>>>>>> have done for sndbuf.
>>>>>>>>>>>
>>>>>>>>>>> With this fix, we can safely change the tx_queue_len of macvtap to
>>>>>>>>>>> zero. This will make multiqueue works without extra lock contention.
>>>>>>>>>>>
>>>>>>>>>>> Cc: Vlad Yasevich<vyasevic@redhat.com>
>>>>>>>>>>> Cc: Michael S. Tsirkin<mst@redhat.com>
>>>>>>>>>>> Cc: John Fastabend<john.r.fastabend@intel.com>
>>>>>>>>>>> Cc: Stephen Hemminger<stephen@networkplumber.org>
>>>>>>>>>>> Cc: Herbert Xu<herbert@gondor.apana.org.au>
>>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>>>>>> No, I don't think we can change userspace-visible behaviour like that.
>>>>>>>>>
>>>>>>>>> This will break any existing user that tries to control
>>>>>>>>> queue length through sysfs,netlink or device ioctl.
>>>>>>> But it looks like a buggy API, since tx_queue_len should be for qdisc
>>>>>>> queue length instead of device itself.
>>>>> Probably, but it's been like this since 2.6.x time.
>>>>> Also, qdisc queue is unused for tun so it seemed kind of
>>>>> reasonable to override tx_queue_len.
>>>>>
>>>>>>> If we really want to preserve the
>>>>>>> behaviour, how about using a new feature flag and change the behaviour
>>>>>>> only when the device is created (TUNSETIFF) with the new flag?
>>>>> OK this addresses the issue partially, but there's also an issue
>>>>> of permissions: tx_queue_len can only be changed if
>>>>> capable(CAP_NET_ADMIN). OTOH in your patch a regular user
>>>>> can change the amount of memory consumed per queue
>>>>> by calling TUNSETRCVBUF.
>>>> Yes, but we have the same issue for TUNSETSNDBUF.
>>> To an extent, but TUNSETSNDBUF is different. It limits how much device can queue
>>> *in the networking stack* but each queue in the stack is also
>>> limited, when we exceed that we star dropping packets.
>>> So while with infinite value (which is the default btw)
>>> you can keep host pretty busy, you will not be able to run
>>> it out of memory.
>>>
>>> The proposed TUNSETRCVBUF would keep configured amount
>>> of memory around indefinitely so you can run host out of memory.
>>>
>>> So assuming all this
>>> How about an ethtool or netlink command to configure this
>>> instead?
>>>
>> Ok, so we can add net admin check for before trying to set rcvbuf.
> No, in practice I think using ioctl for sndbuf was also a mistake.
> Applications have no idea what to set it to - you need to know what else
> is running on the system, after a while
> QEMU ended up setting it back to infinity otherwise things kept
> breaking.

Yes, but it's not the problem of ioctl itself.
> ethtool or netlink would not have this problem.
> Which of the two is preferable I'm not sure.
> I wonder what do management tools such as libvirt prefer.
>

Libvirt usually create file descriptor, so it can also use ioctl() to 
set sndbuf.
>> I
>> think it's better to use ioctl since we've already use it for
>> sndbuf. Using ethool means you need a dedicated new ethtool method
>> just for tuntap which seems sub-optimal.
>> Netlink looks better, but
>> we should also implement other ioctl also.
> I'm not sure what this last phrase means. Can you clarify pls?
>

Sure, I mean if we choose netlink, we'd better add other function such 
as sndbuf, persistent flag and so on. Consider lots of existed function 
were implemented through ioctl(), using it also for rcvbuf looks ok and 
can simplify the changes of userspace.
>>>>>>>>> Take a look at my patch in msg ID 20140109071721.GD19559@redhat.com
>>>>>>>>> which gives one way to set tx_queue_len to zero without
>>>>>>>>> breaking userspace.
>>>>>>> If I read the patch correctly, it will make no way for the user who
>>>>>>> really want to change the qdisc queue length for tun.
>>>>> Why would this matter?  As far as I can see qdisc queue is currently unused.
>>>>>
>>>> User may use qdisc to do port mirroring, bandwidth limitation, traffic
>>>> prioritization or more for a VM. So we do have users and maybe more
>>>> consider the case of vpn.
>>> Well it's not used by default at least.
>>> I remember that we discussed this previously actually.
>>>
>>> If all we want to do actually is utilize no_qdisc by default,
>>> we can simply use Eric's patch:
>>>
>>> http://article.gmane.org/gmane.linux.kernel/1279597
>>>
>>> and a similar patch for macvtap.
>>> I tried it at the time and it didn't seem to help performance
>>> at all, but a lot has changed since, in particular I didn't
>>> test mq.
>>>
>>> If you now have results showing how it's beneficial, pls post them.
>>>
>> I will have a test to see the difference.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: David Miller @ 2014-01-16  6:01 UTC (permalink / raw)
  To: fubar; +Cc: vfalico, netdev, andy
In-Reply-To: <15764.1389848997@death.nxdomain>

From: Jay Vosburgh <fubar@us.ibm.com>
Date: Wed, 15 Jan 2014 21:09:57 -0800

> 	The main reason for preserving the non-validate behavior (any
> traffic counts) is for the loadbalance (xor and rr) modes.  In those
> modes, the switch decides which slave receives the incoming traffic, and
> so it's to our advantage to permit any incoming traffic to count for
> "up-ness."  The arp_validate option is not allowed in these modes
> because it won't work.
> 
> 	With these changes, I suspect that the loadbalance ARP monitor
> will be less reliable with these changes (granted that it's already a
> bit dodgy in its dependence on the switch to hit all slaves with
> incoming packets regularly).  Particularly if the switch ports are
> configured into an Etherchannel ("static link aggregation") group, in
> which case only one slave will receive any given frame (broadcast /
> multicast traffic will not be duplicated across all slaves).
> 
> 	I'm not sure that this change (the "only count ARPs even without
> arp_validate" bit) won't break existing configurations.  Did you test
> the -rr and -xor modes with ARP monitor after your changes (with and
> without configuring a channel group on the switch ports)?

Sorry Jay, I only read this just now.  I won't push these changes
out until you've had some time to discuss them.

To my untrained eye they looked rather straightforward :-)

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: annie li @ 2014-01-16  5:59 UTC (permalink / raw)
  To: David Vrabel; +Cc: Wei Liu, xen-devel, netdev, ian.campbell
In-Reply-To: <52D6A5B7.40503@citrix.com>


On 2014/1/15 23:13, David Vrabel wrote:
> On 15/01/14 14:17, annie li wrote:
>> I am thinking of two ways, and they can be implemented in new patches.
>> 1. If gnttab_end_foreign_access_ref succeeds, then kfree_skb is called
>> to free skb. Otherwise, using gnttab_end_foreign_access to release ref
>> and pages.
>> 2. Add a similar deferred way of gnttab_end_foreign_access in
>> gnttab_end_foreign_access_ref.
> Something like the following (untested!) patch is what I'm thinking of.
>
> Fixups to existing drivers (except netfront) are included but may not be
> quite correct.

Part of it implements the 1 above, if gnttab_end_foreign_access_ref 
fails and then use gnttab_end_foreign_access to end the grant. Another 
is splitting __free_page from gnttab_end_foreign_access. This is 
improvement for current grant end access, and all drivers that involve 
gnttab_end_foreign_access needs to do corresponding change.
I think it can be a separate patch from my clean up patch which fixes 
grant leaking in netfront.

Thanks
Annie
>
> 8<----------
>  From 76c254c8e020f4427e8f37c867622f0bfd5ac85f Mon Sep 17 00:00:00 2001
> From: David Vrabel <david.vrabel@citrix.com>
> Date: Wed, 15 Jan 2014 15:04:52 +0000
> Subject: [PATCH] HACK! xen/gnttab: make gnttab_end_foreign_access() more
> useful
>
> This is UNTESTED and is an example of the sort of change I'm looking
> for.
>
> Freeing the page in gnttab_end_foreign_access() means it cannot be
> used where the pages are freed in some other way (e.g., as part of a
> kfree_skb()).
>
> Leave the freeing of the page to the caller.  If the page still has
> foreign users, take an additional reference before adding it to the
> deferred list.
>
> Hack all the users of the call to do something resembling the right
> thing.  Not quite sure on the blkfront changes.
> ---
>   drivers/block/xen-blkfront.c    |   22 +++++++++++++---------
>   drivers/char/tpm/xen-tpmfront.c |    3 +--
>   drivers/pci/xen-pcifront.c      |    3 +--
>   drivers/xen/grant-table.c       |   19 +++++++++++--------
>   include/xen/grant_table.h       |   11 ++++++-----
>   5 files changed, 32 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index aa846a4..a586496 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -504,7 +504,7 @@ static void gnttab_handle_deferred(unsigned long unused)
>   			if (entry->page) {
>   				pr_debug("freeing g.e. %#x (pfn %#lx)\n",
>   					 entry->ref, page_to_pfn(entry->page));
> -				__free_page(entry->page);
> +				put_page(entry->page);
>   			} else
>   				pr_info("freeing g.e. %#x\n", entry->ref);
>   			kfree(entry);
> @@ -555,15 +555,18 @@ static void gnttab_add_deferred(grant_ref_t ref,
> bool readonly,
>   }
>
>   void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
> -			       unsigned long page)
> +			       struct page *page)
>   {
> -	if (gnttab_end_foreign_access_ref(ref, readonly)) {
> +	if (gnttab_end_foreign_access_ref(ref, readonly))
>   		put_free_entry(ref);
> -		if (page != 0)
> -			free_page(page);
> -	} else
> -		gnttab_add_deferred(ref, readonly,
> -				    page ? virt_to_page(page) : NULL);
> +	else {
> +		/*
> +		 * Take a reference to the page so it's not freed
> +		 * while the foreign domain still has access to it.
> +		 */
> +		get_page(page);
> +		gnttab_add_deferred(ref, readonly, page);
> +	}
>   }
>   EXPORT_SYMBOL_GPL(gnttab_end_foreign_access);
>
> diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
> index 694dcaf..ffa3ce6 100644
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -91,13 +91,14 @@ bool gnttab_trans_grants_available(void);
>   int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly);
>
>   /*
> - * Eventually end access through the given grant reference, and once that
> - * access has been ended, free the given page too.  Access will be ended
> - * immediately iff the grant entry is not in use, otherwise it will happen
> - * some time later.  page may be 0, in which case no freeing will occur.
> + * Eventually end access through the given grant reference, if the
> + * page is still in use an additional reference is taken and released
> + * when access has ended.  Access will be ended immediately iff the
> + * grant entry is not in use, otherwise it will happen some time
> + * later.
>    */
>   void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
> -			       unsigned long page);
> +			       struct page *page);
>
>   int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn);
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index c4a4c90..45a2a01 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -931,14 +931,16 @@ static void blkif_free(struct blkfront_info *info,
> int suspend)
>   	if (!list_empty(&info->grants)) {
>   		list_for_each_entry_safe(persistent_gnt, n,
>   		                         &info->grants, node) {
> +			struct page *page = pfn_to_page(persistent_gnt->pfn);
> +
>   			list_del(&persistent_gnt->node);
>   			if (persistent_gnt->gref != GRANT_INVALID_REF) {
>   				gnttab_end_foreign_access(persistent_gnt->gref,
> -				                          0, 0UL);
> +				                          0, page);
>   				info->persistent_gnts_c--;
>   			}
>   			if (info->feature_persistent)
> -				__free_page(pfn_to_page(persistent_gnt->pfn));
> +				__free_page(page);
>   			kfree(persistent_gnt);
>   		}
>   	}
> @@ -970,10 +972,13 @@ static void blkif_free(struct blkfront_info *info,
> int suspend)
>   		       info->shadow[i].req.u.indirect.nr_segments :
>   		       info->shadow[i].req.u.rw.nr_segments;
>   		for (j = 0; j < segs; j++) {
> +			struct page *page;
> +
>   			persistent_gnt = info->shadow[i].grants_used[j];
> -			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> +			page = pfn_to_page(persistent_gnt->pfn);
> +			gnttab_end_foreign_access(persistent_gnt->gref, 0, page);
>   			if (info->feature_persistent)
> -				__free_page(pfn_to_page(persistent_gnt->pfn));
> +				__free_page(page);
>   			kfree(persistent_gnt);
>   		}
>
> @@ -1010,10 +1015,11 @@ free_shadow:
>   	/* Free resources associated with old device channel. */
>   	if (info->ring_ref != GRANT_INVALID_REF) {
>   		gnttab_end_foreign_access(info->ring_ref, 0,
> -					  (unsigned long)info->ring.sring);
> +					  virt_to_page(info->ring.sring));
>   		info->ring_ref = GRANT_INVALID_REF;
>   		info->ring.sring = NULL;
>   	}
> +	free_page((unsigned long)info->ring.sring);
>   	if (info->irq)
>   		unbind_from_irqhandler(info->irq, info);
>   	info->evtchn = info->irq = 0;
> @@ -1053,7 +1059,7 @@ static void blkif_completion(struct blk_shadow *s,
> struct blkfront_info *info,
>   	}
>   	/* Add the persistent grant into the list of free grants */
>   	for (i = 0; i < nseg; i++) {
> -		if (gnttab_query_foreign_access(s->grants_used[i]->gref)) {
> +		if (gnttab_end_foreign_access_ref(s->grants_used[i]->gref, 0)) {
>   			/*
>   			 * If the grant is still mapped by the backend (the
>   			 * backend has chosen to make this grant persistent)
> @@ -1072,14 +1078,13 @@ static void blkif_completion(struct blk_shadow
> *s, struct blkfront_info *info,
>   			 * so it will not be picked again unless we run out of
>   			 * persistent grants.
>   			 */
> -			gnttab_end_foreign_access(s->grants_used[i]->gref, 0, 0UL);
>   			s->grants_used[i]->gref = GRANT_INVALID_REF;
>   			list_add_tail(&s->grants_used[i]->node, &info->grants);
>   		}
>   	}
>   	if (s->req.operation == BLKIF_OP_INDIRECT) {
>   		for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
> -			if (gnttab_query_foreign_access(s->indirect_grants[i]->gref)) {
> +			if (gnttab_end_foreign_access_ref(s->indirect_grants[i]->gref, 0)) {
>   				if (!info->feature_persistent)
>   					pr_alert_ratelimited("backed has not unmapped grant: %u\n",
>   							     s->indirect_grants[i]->gref);
> @@ -1088,7 +1093,6 @@ static void blkif_completion(struct blk_shadow *s,
> struct blkfront_info *info,
>   			} else {
>   				struct page *indirect_page;
>
> -				gnttab_end_foreign_access(s->indirect_grants[i]->gref, 0, 0UL);
>   				/*
>   				 * Add the used indirect page back to the list of
>   				 * available pages for indirect grefs.
> diff --git a/drivers/char/tpm/xen-tpmfront.c
> b/drivers/char/tpm/xen-tpmfront.c
> index c8ff4df..00d1132 100644
> --- a/drivers/char/tpm/xen-tpmfront.c
> +++ b/drivers/char/tpm/xen-tpmfront.c
> @@ -315,8 +315,7 @@ static void ring_free(struct tpm_private *priv)
>   	if (priv->ring_ref)
>   		gnttab_end_foreign_access(priv->ring_ref, 0,
>   				(unsigned long)priv->shr);
> -	else
> -		free_page((unsigned long)priv->shr);
> +	free_page((unsigned long)priv->shr);
>
>   	if (priv->chip && priv->chip->vendor.irq)
>   		unbind_from_irqhandler(priv->chip->vendor.irq, priv);
> diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
> index f7197a7..253a129 100644
> --- a/drivers/pci/xen-pcifront.c
> +++ b/drivers/pci/xen-pcifront.c
> @@ -759,8 +759,7 @@ static void free_pdev(struct pcifront_device *pdev)
>   	if (pdev->gnt_ref != INVALID_GRANT_REF)
>   		gnttab_end_foreign_access(pdev->gnt_ref, 0 /* r/w page */,
>   					  (unsigned long)pdev->sh_info);
> -	else
> -		free_page((unsigned long)pdev->sh_info);
> +	free_page((unsigned long)pdev->sh_info);
>
>   	dev_set_drvdata(&pdev->xdev->dev, NULL);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2 Net-next] ixgbe: define IXGBE_MAX_VFS_DRV_LIMIT macro and cleanup const 63
From: David Miller @ 2014-01-16  5:57 UTC (permalink / raw)
  To: ethan.zhao
  Cc: aaron.f.brown, jeffrey.t.kirsher, e1000-devel, netdev,
	linux-kernel, ethan.kernel
In-Reply-To: <52D75F1D.9040405@oracle.com>

From: ethan zhao <ethan.zhao@oracle.com>
Date: Thu, 16 Jan 2014 12:25:01 +0800

> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>

Please don't give two signoffs for yourself, it is not appropriate at
all.

I am very genuinely curious where you got the idea to do that,
particularly as I've never seen anyone else do it before.  So it's
really not possible that you got the idea from someone else's actions.

^ permalink raw reply

* [PATCH net-next 2/2] bonding: add netlink attributes to slave link dev
From: Scott Feldman @ 2014-01-16  5:54 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm

If link is IFF_SLAVE, extend link dev netlink attributes to include
slave attributes with new IFLA_SLAVE nest.  Add netlink notification
(RTM_NEWLINK) when slave status changes from backup to active, or
visa-versa.

Adds new ndo_get_slave op to net_device_ops to fill skb with IFLA_SLAVE
attributes.  Currently only used by bonding driver, but could be
used by other aggregating devices with slaves.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
---
 drivers/net/bonding/bond_main.c    |    1 +
 drivers/net/bonding/bond_netlink.c |   36 ++++++++++++++++++++++++
 drivers/net/bonding/bonding.h      |   11 ++++++-
 include/linux/netdevice.h          |    5 +++
 include/uapi/linux/if_link.h       |   13 +++++++++
 net/core/rtnetlink.c               |   54 ++++++++++++++++++++++++++++++++++++
 6 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 4f1adae..a9c8e3b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3865,6 +3865,7 @@ static const struct net_device_ops bond_netdev_ops = {
 #endif
 	.ndo_add_slave		= bond_enslave,
 	.ndo_del_slave		= bond_release,
+	.ndo_get_slave		= bond_get_slave,
 	.ndo_fix_features	= bond_fix_features,
 };
 
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 555c783..21c6488 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -22,6 +22,42 @@
 #include <linux/reciprocal_div.h>
 #include "bonding.h"
 
+int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb)
+{
+	struct slave *slave = bond_slave_get_rtnl(slave_dev);
+	const struct aggregator *agg;
+
+	if (nla_put_u8(skb, IFLA_SLAVE_STATE, bond_slave_state(slave)))
+		goto nla_put_failure;
+
+	if (nla_put_u8(skb, IFLA_SLAVE_MII_STATUS, slave->link))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, IFLA_SLAVE_LINK_FAILURE_COUNT,
+			slave->link_failure_count))
+		goto nla_put_failure;
+
+	if (nla_put(skb, IFLA_SLAVE_PERM_HWADDR,
+		    slave_dev->addr_len, slave->perm_hwaddr))
+		goto nla_put_failure;
+
+	if (nla_put_u16(skb, IFLA_SLAVE_QUEUE_ID, slave->queue_id))
+		goto nla_put_failure;
+
+	if (slave->bond->params.mode == BOND_MODE_8023AD) {
+		agg = SLAVE_AD_INFO(slave).port.aggregator;
+		if (agg)
+			if (nla_put_u16(skb, IFLA_SLAVE_AD_AGGREGATOR_ID,
+					agg->aggregator_identifier))
+				goto nla_put_failure;
+	}
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
 	[IFLA_BOND_MODE]		= { .type = NLA_U8 },
 	[IFLA_BOND_ACTIVE_SLAVE]	= { .type = NLA_U32 },
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 309757d..8a935f8 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -285,12 +285,18 @@ static inline bool bond_is_lb(const struct bonding *bond)
 
 static inline void bond_set_active_slave(struct slave *slave)
 {
-	slave->backup = 0;
+	if (slave->backup) {
+		slave->backup = 0;
+		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_KERNEL);
+	}
 }
 
 static inline void bond_set_backup_slave(struct slave *slave)
 {
-	slave->backup = 1;
+	if (!slave->backup) {
+		slave->backup = 1;
+		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_KERNEL);
+	}
 }
 
 static inline int bond_slave_state(struct slave *slave)
@@ -426,6 +432,7 @@ int bond_sysfs_slave_add(struct slave *slave);
 void bond_sysfs_slave_del(struct slave *slave);
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
+int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb);
 int bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, int count);
 int bond_parse_parm(const char *mode_arg, const struct bond_parm_tbl *tbl);
 int bond_parm_tbl_lookup(int mode, const struct bond_parm_tbl *tbl);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 30f6513..3d033cc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -908,6 +908,9 @@ struct netdev_phys_port_id {
  * int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev);
  *	Called to release previously enslaved netdev.
  *
+ * int (*ndo_get_slave)(struct net_device *slave_dev, struct sk_buff *skb);
+ *	Called to fill netlink skb with slave info.
+ *
  *      Feature/offload setting functions.
  * netdev_features_t (*ndo_fix_features)(struct net_device *dev,
  *		netdev_features_t features);
@@ -1080,6 +1083,8 @@ struct net_device_ops {
 						 struct net_device *slave_dev);
 	int			(*ndo_del_slave)(struct net_device *dev,
 						 struct net_device *slave_dev);
+	int			(*ndo_get_slave)(struct net_device *slave_dev,
+						 struct sk_buff *skb);
 	netdev_features_t	(*ndo_fix_features)(struct net_device *dev,
 						    netdev_features_t features);
 	int			(*ndo_set_features)(struct net_device *dev,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 3e6bd3c..ba2f3bf 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -144,6 +144,7 @@ enum {
 	IFLA_NUM_RX_QUEUES,
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
+	IFLA_SLAVE,
 	__IFLA_MAX
 };
 
@@ -368,6 +369,18 @@ enum {
 
 #define IFLA_BOND_AD_INFO_MAX	(__IFLA_BOND_AD_INFO_MAX - 1)
 
+enum {
+	IFLA_SLAVE_STATE,
+	IFLA_SLAVE_MII_STATUS,
+	IFLA_SLAVE_LINK_FAILURE_COUNT,
+	IFLA_SLAVE_PERM_HWADDR,
+	IFLA_SLAVE_QUEUE_ID,
+	IFLA_SLAVE_AD_AGGREGATOR_ID,
+	__IFLA_SLAVE_MAX,
+};
+
+#define IFLA_SLAVE_MAX	(__IFLA_SLAVE_MAX - 1)
+
 /* SR-IOV virtual function management section */
 
 enum {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6e7d58..4f85de7 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -721,6 +721,28 @@ static size_t rtnl_port_size(const struct net_device *dev)
 		return port_self_size;
 }
 
+static size_t rtnl_bond_slave_size(const struct net_device *dev)
+{
+	struct net_device *bond;
+	size_t slave_size =
+		nla_total_size(sizeof(struct nlattr)) +	/* IFLA_SLAVE */
+		nla_total_size(1) +	/* IFLA_SLAVE_STATE */
+		nla_total_size(1) +	/* IFLA_SLAVE_MII_STATUS */
+		nla_total_size(4) +	/* IFLA_SLAVE_LINK_FAILURE_COUNT */
+		nla_total_size(MAX_ADDR_LEN) +	/* IFLA_SLAVE_PERM_HWADDR */
+		nla_total_size(2) +	/* IFLA_SLAVE_QUEUE_ID */
+		nla_total_size(2) +	/* IFLA_SLAVE_AD_AGGREGATOR_ID */
+		0;
+
+	if (netif_is_bond_slave((struct net_device *)dev)) {
+		bond = netdev_master_upper_dev_get((struct net_device *)dev);
+		if (bond && bond->netdev_ops->ndo_get_slave)
+			return slave_size;
+	}
+
+	return 0;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 				     u32 ext_filter_mask)
 {
@@ -750,6 +772,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
+	       + rtnl_bond_slave_size(dev) /* IFLA_SLAVE */
 	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
 }
 
@@ -847,6 +870,34 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static size_t rtnl_bond_slave_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct net_device *bond;
+	struct nlattr *nest;
+	int err;
+
+	if (!netif_is_bond_slave(dev))
+		return 0;
+
+	bond = netdev_master_upper_dev_get(dev);
+	if (!bond || !bond->netdev_ops->ndo_get_slave)
+		return 0;
+
+	nest = nla_nest_start(skb, IFLA_SLAVE);
+	if (!nest)
+		return -EMSGSIZE;
+
+	err = bond->netdev_ops->ndo_get_slave(dev, skb);
+	if (err) {
+		nla_nest_cancel(skb, nest);
+		return (err == -EMSGSIZE) ? err : 0;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1001,6 +1052,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_port_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_bond_slave_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;

^ permalink raw reply related

* [PATCH net-next 1/2] bonding: add sysfs /slave dir for bond slave devices.
From: Scott Feldman @ 2014-01-16  5:54 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm

Add sub-directory under /sys/class/net/<interface>/slave with
read-only attributes for slave.  Directory only appears when
<interface> is a slave.

$ tree /sys/class/net/eth2/slave/
/sys/class/net/eth2/slave/
├── ad_aggregator_id
├── link_failure_count
├── mii_status
├── perm_hwaddr
├── queue_id
└── state

$ cat /sys/class/net/eth2/slave/*
2
0
up
40:02:10:ef:06:01
0
active

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
---
 drivers/net/bonding/Makefile           |    2 
 drivers/net/bonding/bond_main.c        |   27 ++++++
 drivers/net/bonding/bond_procfs.c      |   12 ---
 drivers/net/bonding/bond_sysfs_slave.c |  142 ++++++++++++++++++++++++++++++++
 drivers/net/bonding/bonding.h          |    4 +
 5 files changed, 174 insertions(+), 13 deletions(-)
 create mode 100644 drivers/net/bonding/bond_sysfs_slave.c

diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 5a5d720..6f4e808 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_BONDING) += bonding.o
 
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o bond_netlink.o bond_options.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
 
 proc-$(CONFIG_PROC_FS) += bond_procfs.o
 bonding-objs += $(proc-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f2fe6cb..4f1adae 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -466,6 +466,22 @@ static void bond_update_speed_duplex(struct slave *slave)
 	return;
 }
 
+const char *bond_slave_link_status(s8 link)
+{
+	switch (link) {
+	case BOND_LINK_UP:
+		return "up";
+	case BOND_LINK_FAIL:
+		return "going down";
+	case BOND_LINK_DOWN:
+		return "down";
+	case BOND_LINK_BACK:
+		return "going back";
+	default:
+		return "unknown";
+	}
+}
+
 /*
  * if <dev> supports MII link status reporting, check its link status.
  *
@@ -1576,6 +1592,12 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		goto err_unregister;
 	}
 
+	res = bond_sysfs_slave_add(new_slave);
+	if (res) {
+		pr_debug("Error %d calling bond_sysfs_slave_add\n", res);
+		goto err_upper_unlink;
+	}
+
 	bond->slave_cnt++;
 	bond_compute_features(bond);
 	bond_set_carrier(bond);
@@ -1595,6 +1617,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	return 0;
 
 /* Undo stages on error */
+err_upper_unlink:
+	bond_upper_dev_unlink(bond_dev, slave_dev);
+
 err_unregister:
 	netdev_rx_handler_unregister(slave_dev);
 
@@ -1687,6 +1712,8 @@ static int __bond_release_one(struct net_device *bond_dev,
 	/* release the slave from its bond */
 	bond->slave_cnt--;
 
+	bond_sysfs_slave_del(slave);
+
 	bond_upper_dev_unlink(bond_dev, slave_dev);
 	/* unregister rx_handler early so bond_handle_frame wouldn't be called
 	 * for this slave anymore.
diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index fb868d6..8515b344 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -159,18 +159,6 @@ static void bond_info_show_master(struct seq_file *seq)
 	}
 }
 
-static const char *bond_slave_link_status(s8 link)
-{
-	static const char * const status[] = {
-		[BOND_LINK_UP] = "up",
-		[BOND_LINK_FAIL] = "going down",
-		[BOND_LINK_DOWN] = "down",
-		[BOND_LINK_BACK] = "going back",
-	};
-
-	return status[link];
-}
-
 static void bond_info_show_slave(struct seq_file *seq,
 				 const struct slave *slave)
 {
diff --git a/drivers/net/bonding/bond_sysfs_slave.c b/drivers/net/bonding/bond_sysfs_slave.c
new file mode 100644
index 0000000..28390af
--- /dev/null
+++ b/drivers/net/bonding/bond_sysfs_slave.c
@@ -0,0 +1,142 @@
+/*	Sysfs attributes of bond slaves
+ *
+ *      Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/capability.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+
+#include "bonding.h"
+
+struct slave_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct slave *, char *);
+};
+
+#define SLAVE_ATTR(_name, _mode, _show)				\
+const struct slave_attribute slave_attr_##_name = {		\
+	.attr = {.name = __stringify(_name),			\
+		 .mode = _mode },				\
+	.show	= _show,					\
+};
+#define SLAVE_ATTR_RO(_name) \
+	SLAVE_ATTR(_name, S_IRUGO, _name##_show)
+
+static ssize_t state_show(struct slave *slave, char *buf)
+{
+	switch (bond_slave_state(slave)) {
+	case BOND_STATE_ACTIVE:
+		return sprintf(buf, "active\n");
+	case BOND_STATE_BACKUP:
+		return sprintf(buf, "backup\n");
+	default:
+		return sprintf(buf, "UNKONWN\n");
+	}
+}
+static SLAVE_ATTR_RO(state);
+
+static ssize_t mii_status_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%s\n", bond_slave_link_status(slave->link));
+}
+static SLAVE_ATTR_RO(mii_status);
+
+static ssize_t link_failure_count_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%d\n", slave->link_failure_count);
+}
+static SLAVE_ATTR_RO(link_failure_count);
+
+static ssize_t perm_hwaddr_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%pM\n", slave->perm_hwaddr);
+}
+static SLAVE_ATTR_RO(perm_hwaddr);
+
+static ssize_t queue_id_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%d\n", slave->queue_id);
+}
+static SLAVE_ATTR_RO(queue_id);
+
+static ssize_t ad_aggregator_id_show(struct slave *slave, char *buf)
+{
+	const struct aggregator *agg;
+
+	if (slave->bond->params.mode == BOND_MODE_8023AD) {
+		agg = SLAVE_AD_INFO(slave).port.aggregator;
+		if (agg)
+			return sprintf(buf, "%d\n",
+				       agg->aggregator_identifier);
+	}
+
+	return sprintf(buf, "N/A\n");
+}
+static SLAVE_ATTR_RO(ad_aggregator_id);
+
+static const struct slave_attribute *slave_attrs[] = {
+	&slave_attr_state,
+	&slave_attr_mii_status,
+	&slave_attr_link_failure_count,
+	&slave_attr_perm_hwaddr,
+	&slave_attr_queue_id,
+	&slave_attr_ad_aggregator_id,
+	NULL
+};
+
+#define to_slave_attr(_at) container_of(_at, struct slave_attribute, attr)
+#define to_slave(obj)	container_of(obj, struct slave, kobj)
+
+static ssize_t slave_show(struct kobject *kobj,
+			  struct attribute *attr, char *buf)
+{
+	struct slave_attribute *slave_attr = to_slave_attr(attr);
+	struct slave *slave = to_slave(kobj);
+
+	return slave_attr->show(slave, buf);
+}
+
+const struct sysfs_ops slave_sysfs_ops = {
+	.show = slave_show,
+};
+
+static struct kobj_type slave_ktype = {
+#ifdef CONFIG_SYSFS
+	.sysfs_ops = &slave_sysfs_ops,
+#endif
+};
+
+int bond_sysfs_slave_add(struct slave *slave)
+{
+	const struct slave_attribute **a;
+	int err;
+
+	err = kobject_init_and_add(&slave->kobj, &slave_ktype,
+				   &(slave->dev->dev.kobj), "slave");
+	if (err)
+		return err;
+
+	for (a = slave_attrs; *a; ++a) {
+		err = sysfs_create_file(&slave->kobj, &((*a)->attr));
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+void bond_sysfs_slave_del(struct slave *slave)
+{
+	const struct slave_attribute **a;
+
+	for (a = slave_attrs; *a; ++a)
+		sysfs_remove_file(&slave->kobj, &((*a)->attr));
+
+	kobject_del(&slave->kobj);
+}
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 955dc48..309757d 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -203,6 +203,7 @@ struct slave {
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	struct netpoll *np;
 #endif
+	struct kobject kobj;
 };
 
 /*
@@ -421,6 +422,8 @@ int bond_create(struct net *net, const char *name);
 int bond_create_sysfs(struct bond_net *net);
 void bond_destroy_sysfs(struct bond_net *net);
 void bond_prepare_sysfs_group(struct bonding *bond);
+int bond_sysfs_slave_add(struct slave *slave);
+void bond_sysfs_slave_del(struct slave *slave);
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, int count);
@@ -469,6 +472,7 @@ int bond_option_lacp_rate_set(struct bonding *bond, int lacp_rate);
 int bond_option_ad_select_set(struct bonding *bond, int ad_select);
 struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond);
 struct net_device *bond_option_active_slave_get(struct bonding *bond);
+const char *bond_slave_link_status(s8 link);
 
 struct bond_net {
 	struct net *		net;	/* Associated network namespace */

^ permalink raw reply related

* [PATCH net-next 0/2] bonding: add slave netlink and sysfs support
From: Scott Feldman @ 2014-01-16  5:54 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm

The following series adds bonding slave netlink and sysfs interfaces.
Slave interfaces get a new IFLA_SLAVE set of netlink attributes, along
with RTM_NEWLINK notification when slave's active status changes.  The
sysfs interface adds read-only nodes for slave attributes under a /slave
dir, simliar to how bond interfaces get a /bonding dir for bonding
attributes.

---

Scott Feldman (2):
      bonding: add sysfs /slave dir for bond slave devices.
      bonding: add netlink attributes to slave link dev


 drivers/net/bonding/Makefile           |    2 
 drivers/net/bonding/bond_main.c        |   28 ++++++
 drivers/net/bonding/bond_netlink.c     |   36 ++++++++
 drivers/net/bonding/bond_procfs.c      |   12 ---
 drivers/net/bonding/bond_sysfs_slave.c |  142 ++++++++++++++++++++++++++++++++
 drivers/net/bonding/bonding.h          |   15 +++
 include/linux/netdevice.h              |    5 +
 include/uapi/linux/if_link.h           |   13 +++
 net/core/rtnetlink.c                   |   54 ++++++++++++
 9 files changed, 292 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/bonding/bond_sysfs_slave.c

-- 
Signature

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: David Miller @ 2014-01-16  5:53 UTC (permalink / raw)
  To: vfalico; +Cc: netdev, fubar, andy
In-Reply-To: <1389837916-5377-1-git-send-email-vfalico@redhat.com>

From: Veaceslav Falico <vfalico@redhat.com>
Date: Thu, 16 Jan 2014 03:05:10 +0100

> Currently, if arp_validate is off (0), slave_last_rx() returns the
> slave->dev->last_rx, which is always updated on *any* packet received by
> slave, and not only arps. This means that, if the validation of arps is
> off, we're treating *any* incoming packet as a proof of slave being up, and
> not only arps.
> 
> This might seem logical at the first glance, however it can cause a lot of
> troubles and false-positives, one example would be:
> 
> The arp_ip_target is NOT accessible, however someone in the broadcast domain
> spams with any broadcast traffic. This way bonding will be tricked that the
> slave is still up (as in - can access arp_ip_target), while it's not.
> 
> The documentation for arp_validate also states that *ARPs* will (not) be
> validated if it's on/off, and that the arp monitoring works on arps as
> traffic generators.
> 
> Also, the net_device->last_rx is already used in a lot of drivers (even
> though the comment states to NOT do it :)), and it's also ugly to modify it
> from bonding.
> 
> So, to fix this, remove the last_rx from bonding, *always* call
> bond_arp_rcv() in slave's rx_handler (bond_handle_frame), and if we spot an
> arp there - update the slave->last_arp_rx - and use it instead of
> net_device->last_rx. Finally, rename slave_last_rx() to slave_last_arp_rx()
> to reflect the changes.
> 
> As the changes touch really sensitive parts, I've tried to split them as
> much as possible, for easier debugging/bisecting.
> 
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

This series looks really good, applied, thanks Veaceslav.

^ permalink raw reply

* Re: [net-next 0/2] Intel Wired LAN Driver Updates
From: David Miller @ 2014-01-16  5:49 UTC (permalink / raw)
  To: aaron.f.brown; +Cc: netdev, gospo, sassmann
In-Reply-To: <1389836321-17013-1-git-send-email-aaron.f.brown@intel.com>

From: Aaron Brown <aaron.f.brown@intel.com>
Date: Wed, 15 Jan 2014 17:38:39 -0800

> This series contains several updates from Alex to ixgbe.
> 
> To avoid head of line blocking in the event a VF stops cleaning Rx descriptors
> he makes sure QDE bits are set for a VF before the Rx queues are enabled.
> 
> To avoid a situation where the head write-back registers can remain set ofter 
> the driver is unloaded he clears them on a VF reset.

Series applied, thanks Aaron.

^ permalink raw reply

* Re: [PATCH net-next] tun/macvtap: limit the packets queued through rcvbuf
From: Michael S. Tsirkin @ 2014-01-16  5:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: davem, netdev, linux-kernel, Vlad Yasevich, John Fastabend,
	Stephen Hemminger, Herbert Xu
In-Reply-To: <52D7602F.8090800@redhat.com>

On Thu, Jan 16, 2014 at 12:29:35PM +0800, Jason Wang wrote:
> On 01/15/2014 03:21 PM, Michael S. Tsirkin wrote:
> >On Wed, Jan 15, 2014 at 11:36:01AM +0800, Jason Wang wrote:
> >>On 01/14/2014 05:52 PM, Michael S. Tsirkin wrote:
> >>>On Tue, Jan 14, 2014 at 04:45:24PM +0800, Jason Wang wrote:
> >>>>>On 01/14/2014 04:25 PM, Michael S. Tsirkin wrote:
> >>>>>>>On Tue, Jan 14, 2014 at 02:53:07PM +0800, Jason Wang wrote:
> >>>>>>>>>We used to limit the number of packets queued through tx_queue_length. This
> >>>>>>>>>has several issues:
> >>>>>>>>>
> >>>>>>>>>- tx_queue_length is the control of qdisc queue length, simply reusing it
> >>>>>>>>>   to control the packets queued by device may cause confusion.
> >>>>>>>>>- After commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 ("macvtap: Add
> >>>>>>>>>   support of packet capture on macvtap device."), an unexpected qdisc
> >>>>>>>>>   caused by non-zero tx_queue_length will lead qdisc lock contention for
> >>>>>>>>>   multiqueue deivce.
> >>>>>>>>>- What we really want is to limit the total amount of memory occupied not
> >>>>>>>>>   the number of packets.
> >>>>>>>>>
> >>>>>>>>>So this patch tries to solve the above issues by using socket rcvbuf to
> >>>>>>>>>limit the packets could be queued for tun/macvtap. This was done by using
> >>>>>>>>>sock_queue_rcv_skb() instead of a direct call to skb_queue_tail(). Also two
> >>>>>>>>>new ioctl() were introduced for userspace to change the rcvbuf like what we
> >>>>>>>>>have done for sndbuf.
> >>>>>>>>>
> >>>>>>>>>With this fix, we can safely change the tx_queue_len of macvtap to
> >>>>>>>>>zero. This will make multiqueue works without extra lock contention.
> >>>>>>>>>
> >>>>>>>>>Cc: Vlad Yasevich<vyasevic@redhat.com>
> >>>>>>>>>Cc: Michael S. Tsirkin<mst@redhat.com>
> >>>>>>>>>Cc: John Fastabend<john.r.fastabend@intel.com>
> >>>>>>>>>Cc: Stephen Hemminger<stephen@networkplumber.org>
> >>>>>>>>>Cc: Herbert Xu<herbert@gondor.apana.org.au>
> >>>>>>>>>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >>>>>>>No, I don't think we can change userspace-visible behaviour like that.
> >>>>>>>
> >>>>>>>This will break any existing user that tries to control
> >>>>>>>queue length through sysfs,netlink or device ioctl.
> >>>>>But it looks like a buggy API, since tx_queue_len should be for qdisc
> >>>>>queue length instead of device itself.
> >>>Probably, but it's been like this since 2.6.x time.
> >>>Also, qdisc queue is unused for tun so it seemed kind of
> >>>reasonable to override tx_queue_len.
> >>>
> >>>>>If we really want to preserve the
> >>>>>behaviour, how about using a new feature flag and change the behaviour
> >>>>>only when the device is created (TUNSETIFF) with the new flag?
> >>>OK this addresses the issue partially, but there's also an issue
> >>>of permissions: tx_queue_len can only be changed if
> >>>capable(CAP_NET_ADMIN). OTOH in your patch a regular user
> >>>can change the amount of memory consumed per queue
> >>>by calling TUNSETRCVBUF.
> >>Yes, but we have the same issue for TUNSETSNDBUF.
> >To an extent, but TUNSETSNDBUF is different. It limits how much device can queue
> >*in the networking stack* but each queue in the stack is also
> >limited, when we exceed that we star dropping packets.
> >So while with infinite value (which is the default btw)
> >you can keep host pretty busy, you will not be able to run
> >it out of memory.
> >
> >The proposed TUNSETRCVBUF would keep configured amount
> >of memory around indefinitely so you can run host out of memory.
> >
> >So assuming all this
> >How about an ethtool or netlink command to configure this
> >instead?
> >
> 
> Ok, so we can add net admin check for before trying to set rcvbuf.

No, in practice I think using ioctl for sndbuf was also a mistake.
Applications have no idea what to set it to - you need to know what else
is running on the system, after a while
QEMU ended up setting it back to infinity otherwise things kept
breaking.

ethtool or netlink would not have this problem.
Which of the two is preferable I'm not sure.
I wonder what do management tools such as libvirt prefer.

> I
> think it's better to use ioctl since we've already use it for
> sndbuf. Using ethool means you need a dedicated new ethtool method
> just for tuntap which seems sub-optimal.
> Netlink looks better, but
> we should also implement other ioctl also.

I'm not sure what this last phrase means. Can you clarify pls?

> >>>>>>>Take a look at my patch in msg ID 20140109071721.GD19559@redhat.com
> >>>>>>>which gives one way to set tx_queue_len to zero without
> >>>>>>>breaking userspace.
> >>>>>If I read the patch correctly, it will make no way for the user who
> >>>>>really want to change the qdisc queue length for tun.
> >>>Why would this matter?  As far as I can see qdisc queue is currently unused.
> >>>
> >>User may use qdisc to do port mirroring, bandwidth limitation, traffic
> >>prioritization or more for a VM. So we do have users and maybe more
> >>consider the case of vpn.
> >Well it's not used by default at least.
> >I remember that we discussed this previously actually.
> >
> >If all we want to do actually is utilize no_qdisc by default,
> >we can simply use Eric's patch:
> >
> >http://article.gmane.org/gmane.linux.kernel/1279597
> >
> >and a similar patch for macvtap.
> >I tried it at the time and it didn't seem to help performance
> >at all, but a lot has changed since, in particular I didn't
> >test mq.
> >
> >If you now have results showing how it's beneficial, pls post them.
> >
> 
> I will have a test to see the difference.

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Jay Vosburgh @ 2014-01-16  5:09 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: netdev, Andy Gospodarek, David S. Miller
In-Reply-To: <1389837916-5377-1-git-send-email-vfalico@redhat.com>

Veaceslav Falico <vfalico@redhat.com> wrote:

>Currently, if arp_validate is off (0), slave_last_rx() returns the
>slave->dev->last_rx, which is always updated on *any* packet received by
>slave, and not only arps. This means that, if the validation of arps is
>off, we're treating *any* incoming packet as a proof of slave being up, and
>not only arps.

	The "any incoming packet" part is intentional.

>This might seem logical at the first glance, however it can cause a lot of
>troubles and false-positives, one example would be:
>
>The arp_ip_target is NOT accessible, however someone in the broadcast domain
>spams with any broadcast traffic. This way bonding will be tricked that the
>slave is still up (as in - can access arp_ip_target), while it's not.

	This type of situation is why arp_validate was added.

	The specific situation was when multiple hosts using bonding
with the ARP monitor were set up behind a common gateway (in the same
Ethernet broadcast domain).  The arp_ip_target is unreachable for
whatever reason.  In that case, the various bonding instances on the
different hosts will each issue broadcast ARP requests, and (in the
absence of arp_validate) those requests would trick the other bonds into
believing that they are up.

	I don't think this patch set will resolve that problem, since
you explicitly permit any incoming ARP to count.

>The documentation for arp_validate also states that *ARPs* will (not) be
>validated if it's on/off, and that the arp monitoring works on arps as
>traffic generators.

	I wrote most of that text in the documentation, and the intent
was not to imply that only ARPs should count for "up-ness" even without
arp_validate enabled.  The intent was to distinguish it from
"non-validate," in which any incoming traffic counted for "up-ness."

	The main reason for preserving the non-validate behavior (any
traffic counts) is for the loadbalance (xor and rr) modes.  In those
modes, the switch decides which slave receives the incoming traffic, and
so it's to our advantage to permit any incoming traffic to count for
"up-ness."  The arp_validate option is not allowed in these modes
because it won't work.

	With these changes, I suspect that the loadbalance ARP monitor
will be less reliable with these changes (granted that it's already a
bit dodgy in its dependence on the switch to hit all slaves with
incoming packets regularly).  Particularly if the switch ports are
configured into an Etherchannel ("static link aggregation") group, in
which case only one slave will receive any given frame (broadcast /
multicast traffic will not be duplicated across all slaves).

	I'm not sure that this change (the "only count ARPs even without
arp_validate" bit) won't break existing configurations.  Did you test
the -rr and -xor modes with ARP monitor after your changes (with and
without configuring a channel group on the switch ports)?

>Also, the net_device->last_rx is already used in a lot of drivers (even
>though the comment states to NOT do it :)), and it's also ugly to modify it
>from bonding.

	I didn't check, but I suspect those are mostly leftovers from
the distant past, when the drivers were expected to update last_rx, or
perhaps drivers using it for their own purposes.

	I don't really see an issue in decoupling bonding from the
net_device->last_rx; it's pretty much the same thing that was done for
trans_start some time ago.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


>So, to fix this, remove the last_rx from bonding, *always* call
>bond_arp_rcv() in slave's rx_handler (bond_handle_frame), and if we spot an
>arp there - update the slave->last_arp_rx - and use it instead of
>net_device->last_rx. Finally, rename slave_last_rx() to slave_last_arp_rx()
>to reflect the changes.
>
>As the changes touch really sensitive parts, I've tried to split them as
>much as possible, for easier debugging/bisecting.
>
>CC: Jay Vosburgh <fubar@us.ibm.com>
>CC: Andy Gospodarek <andy@greyhouse.net>
>CC: "David S. Miller" <davem@davemloft.net>
>CC: netdev@vger.kernel.org
>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
>
>---
> drivers/net/bonding/bond_main.c    | 18 ++++++++----------
> drivers/net/bonding/bond_options.c | 12 ++----------
> drivers/net/bonding/bonding.h      | 16 ++++++----------
> include/linux/netdevice.h          |  8 +-------
> 4 files changed, 17 insertions(+), 37 deletions(-)
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox