Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next 00/17][pull request] 100GbE Intel Wired LAN Driver Updates 2016-04-20
From: David Miller @ 2016-04-21 16:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene, john.ronciak
In-Reply-To: <1461218969-68578-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 20 Apr 2016 23:09:12 -0700

> This series contains updates to fm10k only.

Pulled, thanks Jeff.

^ permalink raw reply

* [RFC PATCH v3 0/2] ppp: add rtnetlink support
From: Guillaume Nault @ 2016-04-21 16:22 UTC (permalink / raw)
  To: netdev
  Cc: linux-ppp, David Miller, Paul Mackerras, Stephen Hemminger,
	walter harms

PPP devices lack the ability to be customised at creation time. In
particular they can't be created in a given netns or with a particular
name. Moving or renaming the device after creation is possible, but
creates undesirable transient effects on servers where PPP devices are
constantly created and removed, as users connect and disconnect.
Implementing rtnetlink support solves this problem.

The rtnetlink handlers implemented in this series are minimal, and can
only replace the PPPIOCNEWUNIT ioctl. The rest of PPP ioctls remains
necessary for any other operation on channels and units.
It is perfectly possible to mix PPP devices created by rtnl
and by ioctl(PPPIOCNEWUNIT). Devices will behave in the same way.

If necessary, rtnetlink support could be extended to provide some of
the functionalities brought by ppp_net_ioctl() and ppp_ioctl(). This
would let external programs, like "ip link", set or retrieve PPP device
information. However, I haven't made my mind on the usefulness of this
approach, so this isn't implemented in this series.

This series doesn't try to invert lock ordering between ppp_mutex and
rtnl_lock. mutex_trylock() is used instead, which greatly simplifies
things.
A user visible difference brought by this series is that old PPP
interfaces (those created with ioctl(PPPIOCNEWUNIT)), can now be
removed by "ip link del", just like new rtnl based PPP devices.

Changes since v2:
  - Define ->rtnl_link_ops for ioctl based PPP devices, so they can
    handle rtnl messages just like rtnl based ones (suggested by
    Stephen Hemminger).
  - Move back to original lock ordering between ppp_mutex and rtnl_lock
    to simplify patch series. Handle lock inversion issue using
    mutex_trylock() (suggested by Stephen Hemminger).
  - Do file descriptor lookup directly in ppp_nl_newlink(), to simplify
    ppp_dev_configure().

Changes since v1:
  - Rebase on net-next.
  - Invert locking order wrt. ppp_mutex and rtnl_lock and protect
    file->private_data with ppp_mutex.

Guillaume Nault (2):
  ppp: define reusable device creation functions
  ppp: add rtnetlink device creation support

 drivers/net/ppp/ppp_generic.c | 315 ++++++++++++++++++++++++++++++------------
 include/uapi/linux/if_link.h  |   8 ++
 2 files changed, 235 insertions(+), 88 deletions(-)

-- 
2.8.1

^ permalink raw reply

* [RFC PATCH v3 1/2] ppp: define reusable device creation functions
From: Guillaume Nault @ 2016-04-21 16:22 UTC (permalink / raw)
  To: netdev
  Cc: linux-ppp, David Miller, Paul Mackerras, Stephen Hemminger,
	walter harms
In-Reply-To: <cover.1461251392.git.g.nault@alphalink.fr>

Move PPP device initialisation and registration out of
ppp_create_interface().
This prepares code for device registration with rtnetlink.

While there, simplify the prototype of ppp_create_interface():
  * Since ppp_dev_configure() takes care of setting file->private_data,
    there's no need to return a ppp structure to ppp_unattached_ioctl()
    anymore.
  * The unit parameter is made read/write so that ppp_create_interface()
    can tell which unit number has been assigned.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 drivers/net/ppp/ppp_generic.c | 206 ++++++++++++++++++++++++------------------
 1 file changed, 118 insertions(+), 88 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index f572b31..f581a77 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -183,6 +183,11 @@ struct channel {
 #endif /* CONFIG_PPP_MULTILINK */
 };
 
+struct ppp_config {
+	struct file *file;
+	s32 unit;
+};
+
 /*
  * SMP locking issues:
  * Both the ppp.rlock and ppp.wlock locks protect the ppp.channels
@@ -269,8 +274,7 @@ static void ppp_ccp_peek(struct ppp *ppp, struct sk_buff *skb, int inbound);
 static void ppp_ccp_closed(struct ppp *ppp);
 static struct compressor *find_compressor(int type);
 static void ppp_get_stats(struct ppp *ppp, struct ppp_stats *st);
-static struct ppp *ppp_create_interface(struct net *net, int unit,
-					struct file *file, int *retp);
+static int ppp_create_interface(struct net *net, struct file *file, int *unit);
 static void init_ppp_file(struct ppp_file *pf, int kind);
 static void ppp_destroy_interface(struct ppp *ppp);
 static struct ppp *ppp_find_unit(struct ppp_net *pn, int unit);
@@ -853,12 +857,12 @@ static int ppp_unattached_ioctl(struct net *net, struct ppp_file *pf,
 		/* Create a new ppp unit */
 		if (get_user(unit, p))
 			break;
-		ppp = ppp_create_interface(net, unit, file, &err);
-		if (!ppp)
+		err = ppp_create_interface(net, file, &unit);
+		if (err < 0)
 			break;
-		file->private_data = &ppp->file;
+
 		err = -EFAULT;
-		if (put_user(ppp->file.index, p))
+		if (put_user(unit, p))
 			break;
 		err = 0;
 		break;
@@ -960,6 +964,94 @@ static struct pernet_operations ppp_net_ops = {
 	.size = sizeof(struct ppp_net),
 };
 
+static int ppp_unit_register(struct ppp *ppp, int unit)
+{
+	struct ppp_net *pn = ppp_pernet(ppp->ppp_net);
+	int ret;
+
+	mutex_lock(&pn->all_ppp_mutex);
+
+	if (unit < 0) {
+		ret = unit_get(&pn->units_idr, ppp);
+		if (ret < 0)
+			goto err;
+	} else {
+		/* Caller asked for a specific unit number. Fail with -EEXIST
+		 * if unavailable. For backward compatibility, return -EEXIST
+		 * too if idr allocation fails; this makes pppd retry without
+		 * requesting a specific unit number.
+		 */
+		if (unit_find(&pn->units_idr, unit)) {
+			ret = -EEXIST;
+			goto err;
+		}
+		ret = unit_set(&pn->units_idr, ppp, unit);
+		if (ret < 0) {
+			/* Rewrite error for backward compatibility */
+			ret = -EEXIST;
+			goto err;
+		}
+	}
+	ppp->file.index = ret;
+
+	snprintf(ppp->dev->name, IFNAMSIZ, "ppp%i", ppp->file.index);
+
+	ret = register_netdevice(ppp->dev);
+	if (ret < 0)
+		goto err_unit;
+
+	atomic_inc(&ppp_unit_count);
+
+	mutex_unlock(&pn->all_ppp_mutex);
+
+	return 0;
+
+err_unit:
+	unit_put(&pn->units_idr, ppp->file.index);
+err:
+	mutex_unlock(&pn->all_ppp_mutex);
+
+	return ret;
+}
+
+static int ppp_dev_configure(struct net *src_net, struct net_device *dev,
+			     const struct ppp_config *conf)
+{
+	struct ppp *ppp = netdev_priv(dev);
+	int indx;
+	int err;
+
+	ppp->dev = dev;
+	ppp->ppp_net = src_net;
+	ppp->mru = PPP_MRU;
+	ppp->owner = conf->file;
+
+	init_ppp_file(&ppp->file, INTERFACE);
+	ppp->file.hdrlen = PPP_HDRLEN - 2; /* don't count proto bytes */
+
+	for (indx = 0; indx < NUM_NP; ++indx)
+		ppp->npmode[indx] = NPMODE_PASS;
+	INIT_LIST_HEAD(&ppp->channels);
+	spin_lock_init(&ppp->rlock);
+	spin_lock_init(&ppp->wlock);
+#ifdef CONFIG_PPP_MULTILINK
+	ppp->minseq = -1;
+	skb_queue_head_init(&ppp->mrq);
+#endif /* CONFIG_PPP_MULTILINK */
+#ifdef CONFIG_PPP_FILTER
+	ppp->pass_filter = NULL;
+	ppp->active_filter = NULL;
+#endif /* CONFIG_PPP_FILTER */
+
+	err = ppp_unit_register(ppp, conf->unit);
+	if (err < 0)
+		return err;
+
+	conf->file->private_data = &ppp->file;
+
+	return 0;
+}
+
 #define PPP_MAJOR	108
 
 /* Called at boot time if ppp is compiled into the kernel,
@@ -2732,102 +2824,40 @@ ppp_get_stats(struct ppp *ppp, struct ppp_stats *st)
  * or if there is already a unit with the requested number.
  * unit == -1 means allocate a new number.
  */
-static struct ppp *ppp_create_interface(struct net *net, int unit,
-					struct file *file, int *retp)
+static int ppp_create_interface(struct net *net, struct file *file, int *unit)
 {
+	struct ppp_config conf = {
+		.file = file,
+		.unit = *unit,
+	};
 	struct ppp *ppp;
-	struct ppp_net *pn;
-	struct net_device *dev = NULL;
-	int ret = -ENOMEM;
-	int i;
+	struct net_device *dev;
+	int err;
 
 	dev = alloc_netdev(sizeof(struct ppp), "", NET_NAME_ENUM, ppp_setup);
-	if (!dev)
-		goto out1;
-
-	pn = ppp_pernet(net);
-
-	ppp = netdev_priv(dev);
-	ppp->dev = dev;
-	ppp->mru = PPP_MRU;
-	init_ppp_file(&ppp->file, INTERFACE);
-	ppp->file.hdrlen = PPP_HDRLEN - 2;	/* don't count proto bytes */
-	ppp->owner = file;
-	for (i = 0; i < NUM_NP; ++i)
-		ppp->npmode[i] = NPMODE_PASS;
-	INIT_LIST_HEAD(&ppp->channels);
-	spin_lock_init(&ppp->rlock);
-	spin_lock_init(&ppp->wlock);
-#ifdef CONFIG_PPP_MULTILINK
-	ppp->minseq = -1;
-	skb_queue_head_init(&ppp->mrq);
-#endif /* CONFIG_PPP_MULTILINK */
-#ifdef CONFIG_PPP_FILTER
-	ppp->pass_filter = NULL;
-	ppp->active_filter = NULL;
-#endif /* CONFIG_PPP_FILTER */
-
-	/*
-	 * drum roll: don't forget to set
-	 * the net device is belong to
-	 */
+	if (!dev) {
+		err = -ENOMEM;
+		goto err;
+	}
 	dev_net_set(dev, net);
 
 	rtnl_lock();
-	mutex_lock(&pn->all_ppp_mutex);
-
-	if (unit < 0) {
-		unit = unit_get(&pn->units_idr, ppp);
-		if (unit < 0) {
-			ret = unit;
-			goto out2;
-		}
-	} else {
-		ret = -EEXIST;
-		if (unit_find(&pn->units_idr, unit))
-			goto out2; /* unit already exists */
-		/*
-		 * if caller need a specified unit number
-		 * lets try to satisfy him, otherwise --
-		 * he should better ask us for new unit number
-		 *
-		 * NOTE: yes I know that returning EEXIST it's not
-		 * fair but at least pppd will ask us to allocate
-		 * new unit in this case so user is happy :)
-		 */
-		unit = unit_set(&pn->units_idr, ppp, unit);
-		if (unit < 0)
-			goto out2;
-	}
-
-	/* Initialize the new ppp unit */
-	ppp->file.index = unit;
-	sprintf(dev->name, "ppp%d", unit);
 
-	ret = register_netdevice(dev);
-	if (ret != 0) {
-		unit_put(&pn->units_idr, unit);
-		netdev_err(ppp->dev, "PPP: couldn't register device %s (%d)\n",
-			   dev->name, ret);
-		goto out2;
-	}
-
-	ppp->ppp_net = net;
+	err = ppp_dev_configure(net, dev, &conf);
+	if (err < 0)
+		goto err_dev;
+	ppp = netdev_priv(dev);
+	*unit = ppp->file.index;
 
-	atomic_inc(&ppp_unit_count);
-	mutex_unlock(&pn->all_ppp_mutex);
 	rtnl_unlock();
 
-	*retp = 0;
-	return ppp;
+	return 0;
 
-out2:
-	mutex_unlock(&pn->all_ppp_mutex);
+err_dev:
 	rtnl_unlock();
 	free_netdev(dev);
-out1:
-	*retp = ret;
-	return NULL;
+err:
+	return err;
 }
 
 /*
-- 
2.8.1


^ permalink raw reply related

* [RFC PATCH v3 2/2] ppp: add rtnetlink device creation support
From: Guillaume Nault @ 2016-04-21 16:22 UTC (permalink / raw)
  To: netdev
  Cc: linux-ppp, David Miller, Paul Mackerras, Stephen Hemminger,
	walter harms
In-Reply-To: <cover.1461251392.git.g.nault@alphalink.fr>

Define PPP device handler for use with rtnetlink.
The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and
contains the file descriptor of the associated /dev/ppp instance (the
file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in
the ioctl-based API). The PPP device is removed when this file
descriptor is released (same behaviour as with ioctl based PPP
devices).

PPP devices created with the rtnetlink API behave like the ones created
with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same
way, no matter how the PPP device was created.
The rtnl callbacks are also assigned to ioctl based PPP devices. This
way, rtnl messages have the same effect on any PPP devices.
The immediate effect is that all PPP devices, even ioctl-based
ones, can now be removed with "ip link del".

A minor difference still exists between ioctl and rtnl based PPP
interfaces: in the device name, the number following the "ppp" prefix
corresponds to the PPP unit number for ioctl based devices, while it is
just an unrelated incrementing index for rtnl ones.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 drivers/net/ppp/ppp_generic.c | 115 ++++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/if_link.h  |   8 +++
 2 files changed, 120 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index f581a77..8b3db3a 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -46,6 +46,7 @@
 #include <linux/device.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/unaligned.h>
 #include <net/slhc_vj.h>
 #include <linux/atomic.h>
@@ -186,6 +187,7 @@ struct channel {
 struct ppp_config {
 	struct file *file;
 	s32 unit;
+	bool ifname_is_set;
 };
 
 /*
@@ -286,6 +288,7 @@ static int unit_get(struct idr *p, void *ptr);
 static int unit_set(struct idr *p, void *ptr, int n);
 static void unit_put(struct idr *p, int n);
 static void *unit_find(struct idr *p, int n);
+static void ppp_setup(struct net_device *dev);
 
 static const struct net_device_ops ppp_netdev_ops;
 
@@ -964,7 +967,7 @@ static struct pernet_operations ppp_net_ops = {
 	.size = sizeof(struct ppp_net),
 };
 
-static int ppp_unit_register(struct ppp *ppp, int unit)
+static int ppp_unit_register(struct ppp *ppp, int unit, bool ifname_is_set)
 {
 	struct ppp_net *pn = ppp_pernet(ppp->ppp_net);
 	int ret;
@@ -994,7 +997,8 @@ static int ppp_unit_register(struct ppp *ppp, int unit)
 	}
 	ppp->file.index = ret;
 
-	snprintf(ppp->dev->name, IFNAMSIZ, "ppp%i", ppp->file.index);
+	if (!ifname_is_set)
+		snprintf(ppp->dev->name, IFNAMSIZ, "ppp%i", ppp->file.index);
 
 	ret = register_netdevice(ppp->dev);
 	if (ret < 0)
@@ -1043,7 +1047,7 @@ static int ppp_dev_configure(struct net *src_net, struct net_device *dev,
 	ppp->active_filter = NULL;
 #endif /* CONFIG_PPP_FILTER */
 
-	err = ppp_unit_register(ppp, conf->unit);
+	err = ppp_unit_register(ppp, conf->unit, conf->ifname_is_set);
 	if (err < 0)
 		return err;
 
@@ -1052,6 +1056,99 @@ static int ppp_dev_configure(struct net *src_net, struct net_device *dev,
 	return 0;
 }
 
+static const struct nla_policy ppp_nl_policy[IFLA_PPP_MAX + 1] = {
+	[IFLA_PPP_DEV_FD]	= { .type = NLA_S32 },
+};
+
+static int ppp_nl_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (!data)
+		return -EINVAL;
+
+	if (!data[IFLA_PPP_DEV_FD])
+		return -EINVAL;
+	if (nla_get_s32(data[IFLA_PPP_DEV_FD]) < 0)
+		return -EBADF;
+
+	return 0;
+}
+
+static int ppp_nl_newlink(struct net *src_net, struct net_device *dev,
+			  struct nlattr *tb[], struct nlattr *data[])
+{
+	struct ppp_config conf = {
+		.unit = -1,
+		.ifname_is_set = true,
+	};
+	struct file *file;
+	int err;
+
+	file = fget(nla_get_s32(data[IFLA_PPP_DEV_FD]));
+	if (!file)
+		return -EBADF;
+
+	/* rtnl_lock is already held here, but ppp_create_interface() locks
+	 * ppp_mutex before holding rtnl_lock. Using mutex_trylock() avoids
+	 * possible deadlock due to lock order inversion, at the cost of
+	 * pushing the problem back to userspace.
+	 */
+	if (!mutex_trylock(&ppp_mutex)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	if (file->f_op != &ppp_device_fops || file->private_data) {
+		err = -EBADF;
+		goto out_unlock;
+	}
+
+	conf.file = file;
+	err = ppp_dev_configure(src_net, dev, &conf);
+
+out_unlock:
+	mutex_unlock(&ppp_mutex);
+out:
+	fput(file);
+
+	return err;
+}
+
+static void ppp_nl_dellink(struct net_device *dev, struct list_head *head)
+{
+	unregister_netdevice_queue(dev, head);
+}
+
+static size_t ppp_nl_get_size(const struct net_device *dev)
+{
+	return 0;
+}
+
+static int ppp_nl_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	return 0;
+}
+
+static struct net *ppp_nl_get_link_net(const struct net_device *dev)
+{
+	struct ppp *ppp = netdev_priv(dev);
+
+	return ppp->ppp_net;
+}
+
+static struct rtnl_link_ops ppp_link_ops __read_mostly = {
+	.kind		= "ppp",
+	.maxtype	= IFLA_PPP_MAX,
+	.policy		= ppp_nl_policy,
+	.priv_size	= sizeof(struct ppp),
+	.setup		= ppp_setup,
+	.validate	= ppp_nl_validate,
+	.newlink	= ppp_nl_newlink,
+	.dellink	= ppp_nl_dellink,
+	.get_size	= ppp_nl_get_size,
+	.fill_info	= ppp_nl_fill_info,
+	.get_link_net	= ppp_nl_get_link_net,
+};
+
 #define PPP_MAJOR	108
 
 /* Called at boot time if ppp is compiled into the kernel,
@@ -1080,11 +1177,19 @@ static int __init ppp_init(void)
 		goto out_chrdev;
 	}
 
+	err = rtnl_link_register(&ppp_link_ops);
+	if (err) {
+		pr_err("failed to register rtnetlink PPP handler\n");
+		goto out_class;
+	}
+
 	/* not a big deal if we fail here :-) */
 	device_create(ppp_class, NULL, MKDEV(PPP_MAJOR, 0), NULL, "ppp");
 
 	return 0;
 
+out_class:
+	class_destroy(ppp_class);
 out_chrdev:
 	unregister_chrdev(PPP_MAJOR, "ppp");
 out_net:
@@ -2829,6 +2934,7 @@ static int ppp_create_interface(struct net *net, struct file *file, int *unit)
 	struct ppp_config conf = {
 		.file = file,
 		.unit = *unit,
+		.ifname_is_set = false,
 	};
 	struct ppp *ppp;
 	struct net_device *dev;
@@ -2840,6 +2946,7 @@ static int ppp_create_interface(struct net *net, struct file *file, int *unit)
 		goto err;
 	}
 	dev_net_set(dev, net);
+	dev->rtnl_link_ops = &ppp_link_ops;
 
 	rtnl_lock();
 
@@ -3046,6 +3153,7 @@ static void __exit ppp_cleanup(void)
 	/* should never happen */
 	if (atomic_read(&ppp_unit_count) || atomic_read(&channel_count))
 		pr_err("PPP: removing module but units remain!\n");
+	rtnl_link_unregister(&ppp_link_ops);
 	unregister_chrdev(PPP_MAJOR, "ppp");
 	device_destroy(ppp_class, MKDEV(PPP_MAJOR, 0));
 	class_destroy(ppp_class);
@@ -3104,4 +3212,5 @@ EXPORT_SYMBOL(ppp_register_compressor);
 EXPORT_SYMBOL(ppp_unregister_compressor);
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_CHARDEV(PPP_MAJOR, 0);
+MODULE_ALIAS_RTNL_LINK("ppp");
 MODULE_ALIAS("devname:ppp");
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index ba69d44..6b2ec17 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -517,6 +517,14 @@ enum {
 };
 #define IFLA_GENEVE_MAX	(__IFLA_GENEVE_MAX - 1)
 
+/* PPP section */
+enum {
+	IFLA_PPP_UNSPEC,
+	IFLA_PPP_DEV_FD,
+	__IFLA_PPP_MAX
+};
+#define IFLA_PPP_MAX (__IFLA_PPP_MAX - 1)
+
 /* Bonding section */
 
 enum {
-- 
2.8.1


^ permalink raw reply related

* Re: [RFC PATCH v3 net-next 2/3] tcp: Handle eor bit when coalescing skb
From: Martin KaFai Lau @ 2016-04-21 16:56 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: netdev, Eric Dumazet, Neal Cardwell, Willem de Bruijn,
	Yuchung Cheng, Kernel Team
In-Reply-To: <CACSApvbyD7wFy+y2MrShKRMCTg-zF6UkP-Le867N=ssCYa-Yng@mail.gmail.com>

On Wed, Apr 20, 2016 at 04:04:54PM -0400, Soheil Hassas Yeganeh wrote:
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index a6e4a83..96bdf98 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2494,6 +2494,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
> >          * packet counting does not break.
> >          */
> >         TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS;
> > +       TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
> >
> >         /* changed transmit queue under us so clear hints */
> >         tcp_clear_retrans_hints_partial(tp);
> > @@ -2545,6 +2546,9 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
> >                 if (!tcp_can_collapse(sk, skb))
> >                         break;
> >
> > +               if (TCP_SKB_CB(to)->eor)
> > +                       break;
> > +
>
> nit: Perhaps a better place to check for eor is right after entering
> the loop? to skip a few instructions and tcp_can_collapse, in an
> unlikely case eor is set.
hmm... Not sure I understand it.
You meant moving the unlikely case before (or after?) the more likely
cases which may have a better chance to break the loop sooner?

^ permalink raw reply

* [PATCH net-next v2 0/4] libnl: enhance API to ease 64bit alignment for attribute
From: Nicolas Dichtel @ 2016-04-21 16:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, eric.dumazet, tgraf, jhs
In-Reply-To: <1461142655-5067-1-git-send-email-nicolas.dichtel@6wind.com>

Here is a proposal to add more helpers in the libnetlink to manage 64-bit
alignment issues.
Note that this series was only tested on x86 by tweeking
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS and adding some traces.

The first patch adds helpers for 64bit alignment and other patches
use them.

We could also add helpers for nla_put_u64() and its variants if needed.

v1 -> v2:
 - remove patch #1
 - split patch #2 (now #1 and #2)
 - add nla_need_padding_for_64bit()

 include/net/netlink.h          | 39 +++++++++++++----
 include/uapi/linux/rtnetlink.h |  1 +
 lib/nlattr.c                   | 99 ++++++++++++++++++++++++++++++++++++++++++
 net/core/rtnetlink.c           | 22 +++-------
 net/ipv4/ipmr.c                |  4 +-
 net/ipv6/ip6mr.c               |  4 +-
 6 files changed, 140 insertions(+), 29 deletions(-)

Comments are welcomed,
Regards,
Nicolas

^ permalink raw reply

* [PATCH net-next v2 1/4] libnl: add more helpers to align attributes on 64-bit
From: Nicolas Dichtel @ 2016-04-21 16:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, eric.dumazet, tgraf, jhs, Nicolas Dichtel
In-Reply-To: <1461257907-4458-1-git-send-email-nicolas.dichtel@6wind.com>

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/netlink.h | 39 +++++++++++++++-----
 lib/nlattr.c          | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 3c1fd92a52c8..6f51a8a06498 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -244,13 +244,21 @@ int nla_memcpy(void *dest, const struct nlattr *src, int count);
 int nla_memcmp(const struct nlattr *nla, const void *data, size_t size);
 int nla_strcmp(const struct nlattr *nla, const char *str);
 struct nlattr *__nla_reserve(struct sk_buff *skb, int attrtype, int attrlen);
+struct nlattr *__nla_reserve_64bit(struct sk_buff *skb, int attrtype,
+				   int attrlen, int padattr);
 void *__nla_reserve_nohdr(struct sk_buff *skb, int attrlen);
 struct nlattr *nla_reserve(struct sk_buff *skb, int attrtype, int attrlen);
+struct nlattr *nla_reserve_64bit(struct sk_buff *skb, int attrtype,
+				 int attrlen, int padattr);
 void *nla_reserve_nohdr(struct sk_buff *skb, int attrlen);
 void __nla_put(struct sk_buff *skb, int attrtype, int attrlen,
 	       const void *data);
+void __nla_put_64bit(struct sk_buff *skb, int attrtype, int attrlen,
+		     const void *data, int padattr);
 void __nla_put_nohdr(struct sk_buff *skb, int attrlen, const void *data);
 int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data);
+int nla_put_64bit(struct sk_buff *skb, int attrtype, int attrlen,
+		  const void *data, int padattr);
 int nla_put_nohdr(struct sk_buff *skb, int attrlen, const void *data);
 int nla_append(struct sk_buff *skb, int attrlen, const void *data);
 
@@ -1231,6 +1239,27 @@ static inline int nla_validate_nested(const struct nlattr *start, int maxtype,
 }
 
 /**
+ * nla_need_padding_for_64bit - test 64-bit alignment of the next attribute
+ * @skb: socket buffer the message is stored in
+ *
+ * Return true if padding is needed to align the next attribute (nla_data()) to
+ * a 64-bit aligned area.
+ */
+static inline bool nla_need_padding_for_64bit(struct sk_buff *skb)
+{
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	/* The nlattr header is 4 bytes in size, that's why we test
+	 * if the skb->data _is_ aligned.  A NOP attribute, plus
+	 * nlattr header for next attribute, will make nla_data()
+	 * 8-byte aligned.
+	 */
+	if (IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8))
+		return true;
+#endif
+	return false;
+}
+
+/**
  * nla_align_64bit - 64-bit align the nla_data() of next attribute
  * @skb: socket buffer the message is stored in
  * @padattr: attribute type for the padding
@@ -1244,16 +1273,10 @@ static inline int nla_validate_nested(const struct nlattr *start, int maxtype,
  */
 static inline int nla_align_64bit(struct sk_buff *skb, int padattr)
 {
-#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-	/* The nlattr header is 4 bytes in size, that's why we test
-	 * if the skb->data _is_ aligned.  This NOP attribute, plus
-	 * nlattr header for next attribute, will make nla_data()
-	 * 8-byte aligned.
-	 */
-	if (IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) &&
+	if (nla_need_padding_for_64bit(skb) &&
 	    !nla_reserve(skb, padattr, 0))
 		return -EMSGSIZE;
-#endif
+
 	return 0;
 }
 
diff --git a/lib/nlattr.c b/lib/nlattr.c
index f5907d23272d..2b82f1e2ebc2 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -355,6 +355,29 @@ struct nlattr *__nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
 EXPORT_SYMBOL(__nla_reserve);
 
 /**
+ * __nla_reserve_64bit - reserve room for attribute on the skb and align it
+ * @skb: socket buffer to reserve room on
+ * @attrtype: attribute type
+ * @attrlen: length of attribute payload
+ *
+ * Adds a netlink attribute header to a socket buffer and reserves
+ * room for the payload but does not copy it. It also ensure that this
+ * attribute will be 64-bit aign.
+ *
+ * The caller is responsible to ensure that the skb provides enough
+ * tailroom for the attribute header and payload.
+ */
+struct nlattr *__nla_reserve_64bit(struct sk_buff *skb, int attrtype,
+				   int attrlen, int padattr)
+{
+	if (nla_need_padding_for_64bit(skb))
+		nla_align_64bit(skb, padattr);
+
+	return __nla_reserve(skb, attrtype, attrlen);
+}
+EXPORT_SYMBOL(__nla_reserve_64bit);
+
+/**
  * __nla_reserve_nohdr - reserve room for attribute without header
  * @skb: socket buffer to reserve room on
  * @attrlen: length of attribute payload
@@ -397,6 +420,35 @@ struct nlattr *nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
 EXPORT_SYMBOL(nla_reserve);
 
 /**
+ * nla_reserve_64bit - reserve room for attribute on the skb and align it
+ * @skb: socket buffer to reserve room on
+ * @attrtype: attribute type
+ * @attrlen: length of attribute payload
+ *
+ * Adds a netlink attribute header to a socket buffer and reserves
+ * room for the payload but does not copy it. It also ensure that this
+ * attribute will be 64-bit aign.
+ *
+ * Returns NULL if the tailroom of the skb is insufficient to store
+ * the attribute header and payload.
+ */
+struct nlattr *nla_reserve_64bit(struct sk_buff *skb, int attrtype, int attrlen,
+				 int padattr)
+{
+	size_t len;
+
+	if (nla_need_padding_for_64bit(skb))
+		len = nla_total_size_64bit(attrlen);
+	else
+		len = nla_total_size(attrlen);
+	if (unlikely(skb_tailroom(skb) < len))
+		return NULL;
+
+	return __nla_reserve_64bit(skb, attrtype, attrlen, padattr);
+}
+EXPORT_SYMBOL(nla_reserve_64bit);
+
+/**
  * nla_reserve_nohdr - reserve room for attribute without header
  * @skb: socket buffer to reserve room on
  * @attrlen: length of attribute payload
@@ -436,6 +488,26 @@ void __nla_put(struct sk_buff *skb, int attrtype, int attrlen,
 EXPORT_SYMBOL(__nla_put);
 
 /**
+ * __nla_put_64bit - Add a netlink attribute to a socket buffer and align it
+ * @skb: socket buffer to add attribute to
+ * @attrtype: attribute type
+ * @attrlen: length of attribute payload
+ * @data: head of attribute payload
+ *
+ * The caller is responsible to ensure that the skb provides enough
+ * tailroom for the attribute header and payload.
+ */
+void __nla_put_64bit(struct sk_buff *skb, int attrtype, int attrlen,
+		     const void *data, int padattr)
+{
+	struct nlattr *nla;
+
+	nla = __nla_reserve_64bit(skb, attrtype, attrlen, padattr);
+	memcpy(nla_data(nla), data, attrlen);
+}
+EXPORT_SYMBOL(__nla_put_64bit);
+
+/**
  * __nla_put_nohdr - Add a netlink attribute without header
  * @skb: socket buffer to add attribute to
  * @attrlen: length of attribute payload
@@ -474,6 +546,33 @@ int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
 EXPORT_SYMBOL(nla_put);
 
 /**
+ * nla_put_64bit - Add a netlink attribute to a socket buffer and align it
+ * @skb: socket buffer to add attribute to
+ * @attrtype: attribute type
+ * @attrlen: length of attribute payload
+ * @data: head of attribute payload
+ *
+ * Returns -EMSGSIZE if the tailroom of the skb is insufficient to store
+ * the attribute header and payload.
+ */
+int nla_put_64bit(struct sk_buff *skb, int attrtype, int attrlen,
+		  const void *data, int padattr)
+{
+	size_t len;
+
+	if (nla_need_padding_for_64bit(skb))
+		len = nla_total_size_64bit(attrlen);
+	else
+		len = nla_total_size(attrlen);
+	if (unlikely(skb_tailroom(skb) < len))
+		return -EMSGSIZE;
+
+	__nla_put_64bit(skb, attrtype, attrlen, data, padattr);
+	return 0;
+}
+EXPORT_SYMBOL(nla_put_64bit);
+
+/**
  * nla_put_nohdr - Add a netlink attribute without header
  * @skb: socket buffer to add attribute to
  * @attrlen: length of attribute payload
-- 
2.4.2

^ permalink raw reply related

* [PATCH net-next v2 3/4] ipmr: align RTA_MFC_STATS on 64-bit
From: Nicolas Dichtel @ 2016-04-21 16:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, eric.dumazet, tgraf, jhs, Nicolas Dichtel
In-Reply-To: <1461257907-4458-1-git-send-email-nicolas.dichtel@6wind.com>

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/rtnetlink.h | 1 +
 net/ipv4/ipmr.c                | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index cc885c4e9065..a94e0b69c769 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -317,6 +317,7 @@ enum rtattr_type_t {
 	RTA_ENCAP_TYPE,
 	RTA_ENCAP,
 	RTA_EXPIRES,
+	RTA_PAD,
 	__RTA_MAX
 };
 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 395e2814a46d..21a38e296fe2 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2104,7 +2104,7 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
 	mfcs.mfcs_packets = c->mfc_un.res.pkt;
 	mfcs.mfcs_bytes = c->mfc_un.res.bytes;
 	mfcs.mfcs_wrong_if = c->mfc_un.res.wrong_if;
-	if (nla_put(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs) < 0)
+	if (nla_put_64bit(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs, RTA_PAD) < 0)
 		return -EMSGSIZE;
 
 	rtm->rtm_type = RTN_MULTICAST;
@@ -2237,7 +2237,7 @@ static size_t mroute_msgsize(bool unresolved, int maxvif)
 		      + nla_total_size(0)	/* RTA_MULTIPATH */
 		      + maxvif * NLA_ALIGN(sizeof(struct rtnexthop))
 						/* RTA_MFC_STATS */
-		      + nla_total_size(sizeof(struct rta_mfc_stats))
+		      + nla_total_size_64bit(sizeof(struct rta_mfc_stats))
 		;
 
 	return len;
-- 
2.4.2

^ permalink raw reply related

* [PATCH net-next v2 4/4] ip6mr: align RTA_MFC_STATS on 64-bit
From: Nicolas Dichtel @ 2016-04-21 16:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, eric.dumazet, tgraf, jhs, Nicolas Dichtel
In-Reply-To: <1461257907-4458-1-git-send-email-nicolas.dichtel@6wind.com>

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6mr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index a10e77103c88..bf678324fd52 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2268,7 +2268,7 @@ static int __ip6mr_fill_mroute(struct mr6_table *mrt, struct sk_buff *skb,
 	mfcs.mfcs_packets = c->mfc_un.res.pkt;
 	mfcs.mfcs_bytes = c->mfc_un.res.bytes;
 	mfcs.mfcs_wrong_if = c->mfc_un.res.wrong_if;
-	if (nla_put(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs) < 0)
+	if (nla_put_64bit(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs, RTA_PAD) < 0)
 		return -EMSGSIZE;
 
 	rtm->rtm_type = RTN_MULTICAST;
@@ -2411,7 +2411,7 @@ static int mr6_msgsize(bool unresolved, int maxvif)
 		      + nla_total_size(0)	/* RTA_MULTIPATH */
 		      + maxvif * NLA_ALIGN(sizeof(struct rtnexthop))
 						/* RTA_MFC_STATS */
-		      + nla_total_size(sizeof(struct rta_mfc_stats))
+		      + nla_total_size_64bit(sizeof(struct rta_mfc_stats))
 		;
 
 	return len;
-- 
2.4.2

^ permalink raw reply related

* [PATCH net-next v2 2/4] rtnl: use the new API to align IFLA_STATS*
From: Nicolas Dichtel @ 2016-04-21 16:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, eric.dumazet, tgraf, jhs, Nicolas Dichtel
In-Reply-To: <1461257907-4458-1-git-send-email-nicolas.dichtel@6wind.com>

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/core/rtnetlink.c | 22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4a47a9aceb1d..5ec059d52823 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1051,14 +1051,9 @@ static noinline_for_stack int rtnl_fill_stats(struct sk_buff *skb,
 {
 	struct rtnl_link_stats64 *sp;
 	struct nlattr *attr;
-	int err;
-
-	err = nla_align_64bit(skb, IFLA_PAD);
-	if (err)
-		return err;
 
-	attr = nla_reserve(skb, IFLA_STATS64,
-			   sizeof(struct rtnl_link_stats64));
+	attr = nla_reserve_64bit(skb, IFLA_STATS64,
+				 sizeof(struct rtnl_link_stats64), IFLA_PAD);
 	if (!attr)
 		return -EMSGSIZE;
 
@@ -3469,17 +3464,10 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
 
 	if (filter_mask & IFLA_STATS_FILTER_BIT(IFLA_STATS_LINK_64)) {
 		struct rtnl_link_stats64 *sp;
-		int err;
-
-		/* if necessary, add a zero length NOP attribute so that
-		 * IFLA_STATS_LINK_64 will be 64-bit aligned
-		 */
-		err = nla_align_64bit(skb, IFLA_STATS_UNSPEC);
-		if (err)
-			goto nla_put_failure;
 
-		attr = nla_reserve(skb, IFLA_STATS_LINK_64,
-				   sizeof(struct rtnl_link_stats64));
+		attr = nla_reserve_64bit(skb, IFLA_STATS_LINK_64,
+					 sizeof(struct rtnl_link_stats64),
+					 IFLA_STATS_UNSPEC);
 		if (!attr)
 			goto nla_put_failure;
 
-- 
2.4.2

^ permalink raw reply related

* [PATCH] ixgbevf: Fix relaxed order settings in VF driver
From: Babu Moger @ 2016-04-21 17:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jesse.brandeburg, shannon.nelson,
	carolyn.wyborny, donald.c.skidmore, bruce.w.allan, john.ronciak,
	mitch.a.williams
  Cc: intel-wired-lan, netdev, linux-kernel, babu.moger,
	sowmini.varadhan

Current code writes the tx/rx relaxed order without reading it first.
This can lead to unintended consequences as we are forcibly writing
other bits.

We noticed this problem while testing VF driver on sparc. Relaxed
order settings for rx queue were all messed up which was causing
performance drop with VF interface.

Fixed it by reading the registers first and setting the specific
bit of interest. With this change we are able to match the bandwidth
equivalent to PF interface.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 0ea14c0..51abff1 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1545,6 +1545,7 @@ static inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter)
 static void ixgbevf_configure_tx_ring(struct ixgbevf_adapter *adapter,
 				      struct ixgbevf_ring *ring)
 {
+	u32 regval;
 	struct ixgbe_hw *hw = &adapter->hw;
 	u64 tdba = ring->dma;
 	int wait_loop = 10;
@@ -1565,8 +1566,10 @@ static void ixgbevf_configure_tx_ring(struct ixgbevf_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_VFTDWBAL(reg_idx), 0);
 
 	/* enable relaxed ordering */
+	regval = IXGBE_READ_REG(hw, IXGBE_VFDCA_TXCTRL(reg_idx));
+
 	IXGBE_WRITE_REG(hw, IXGBE_VFDCA_TXCTRL(reg_idx),
-			(IXGBE_DCA_TXCTRL_DESC_RRO_EN |
+			(regval | IXGBE_DCA_TXCTRL_DESC_RRO_EN |
 			 IXGBE_DCA_TXCTRL_DATA_RRO_EN));
 
 	/* reset head and tail pointers */
@@ -1734,6 +1737,7 @@ static void ixgbevf_setup_vfmrqc(struct ixgbevf_adapter *adapter)
 static void ixgbevf_configure_rx_ring(struct ixgbevf_adapter *adapter,
 				      struct ixgbevf_ring *ring)
 {
+	u32 regval;
 	struct ixgbe_hw *hw = &adapter->hw;
 	u64 rdba = ring->dma;
 	u32 rxdctl;
@@ -1749,8 +1753,9 @@ static void ixgbevf_configure_rx_ring(struct ixgbevf_adapter *adapter,
 			ring->count * sizeof(union ixgbe_adv_rx_desc));
 
 	/* enable relaxed ordering */
+	regval = IXGBE_READ_REG(hw, IXGBE_VFDCA_RXCTRL(reg_idx));
 	IXGBE_WRITE_REG(hw, IXGBE_VFDCA_RXCTRL(reg_idx),
-			IXGBE_DCA_RXCTRL_DESC_RRO_EN);
+			regval | IXGBE_DCA_RXCTRL_DESC_RRO_EN);
 
 	/* reset head and tail pointers */
 	IXGBE_WRITE_REG(hw, IXGBE_VFRDH(reg_idx), 0);
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC PATCH net-next 7/8] net: ipv4: listified version of ip_rcv
From: Edward Cree @ 2016-04-21 17:24 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Jesper Dangaard Brouer, linux-net-drivers
In-Reply-To: <5716347D.3030808@solarflare.com>

On 19/04/16 14:37, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
Turns out, this breaks the build if netfilter is *disabled*, because I forgot to
add a stub in that case.  Next version of patch (if there is one) will have the
following under #ifndef CONFIG_NETFILTER:

static inline void
NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
	     struct sk_buff_head *list, struct sk_buff_head *sublist,
	     struct net_device *in, struct net_device *out,
	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
{
	__skb_queue_head_init(sublist);
	/* Move everything to the sublist */
	skb_queue_splice_init(list, sublist);
}

-Ed

^ permalink raw reply

* Re: [PATCHv2 net] openvswitch: Orphan skbs before IPv6 defrag
From: David Miller @ 2016-04-21 17:42 UTC (permalink / raw)
  To: joe; +Cc: netdev, fw, netfilter-devel, diproiettod, pablo
In-Reply-To: <1461016307-14098-1-git-send-email-joe@ovn.org>

From: Joe Stringer <joe@ovn.org>
Date: Mon, 18 Apr 2016 14:51:47 -0700

> This is the IPv6 counterpart to commit 8282f27449bf ("inet: frag: Always
> orphan skbs inside ip_defrag()").
> 
> Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
> clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
> cloned (implicitly orphaning) prior to queueing for reassembly. As such,
> when the IPv6 message is eventually reassembled, the skb->sk for all
> fragments would be NULL. After that commit was introduced, rather than
> cloning, the original skbs were queued directly without orphaning. The
> end result is that all frags except for the first and last may have a
> socket attached.
> 
> This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
> prevent BUG_ON(skb->sk) during a later call to ip6_fragment().
 ...
> Fixes: 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone
> operations")
> Reported-by: Daniele Di Proietto <diproiettod@vmware.com>
> Signed-off-by: Joe Stringer <joe@ovn.org>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: remove tag_protocol from dsa_switch
From: David Miller @ 2016-04-21 17:44 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <1461018244-5371-1-git-send-email-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Mon, 18 Apr 2016 18:24:04 -0400

> Having the tag protocol in dsa_switch_driver for setup time and in
> dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
From: David Miller @ 2016-04-21 17:46 UTC (permalink / raw)
  To: kafai
  Cc: netdev, kernel-team, edumazet, ncardwell, soheil.kdev, willemb,
	ycheng
In-Reply-To: <1461019193-3034571-1-git-send-email-kafai@fb.com>

From: Martin KaFai Lau <kafai@fb.com>
Date: Mon, 18 Apr 2016 15:39:53 -0700

> Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
> it could incorrectly think that a skb has already
> been acked and queue a SCM_TSTAMP_ACK cmsg to the
> sk->sk_error_queue.
> 
> In tcp_ack_tstamp(), it checks
> 'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
> If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
> script, between() returns true but the tskey is actually not acked.
> e.g. try between(3, 2, 1).
> 
> The fix is to replace between() with one before() and one !before().
> By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
> removed.
 ...
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied, thank you.

^ permalink raw reply

* Re: drop all fragments inside tx queue if one gets dropped
From: Michael Richardson @ 2016-04-21 17:48 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, linux-wpan, Alexander Aring
In-Reply-To: <5717EA7E.1020606@hpe.com>

[-- Attachment #1: Type: text/plain, Size: 1473 bytes --]

Rick Jones <rick.jones2@hpe.com> wrote:
    > I don't recall seeing similar poor behaviour in Linux; I would have
    > assumed
    > that the intra-stack flow-control "took care" of it.  Perhaps there is
    > something specific to wpan which precludes that?

The major user of big UDP packets in the 1990s was NFS.
I fondly recall deploying wsize=1024,rsize=1024 for NFS mounts between HPUX,
Apollo and Sun machines across the intersite Ottawa BNR "WAN"

NFS mounts now use TCP by default, and NFS is not a well used protocol
outside of clueful people (everyone else uses CIFS), and modern machines have
way DMA engines in their ethernet that can accomodate more than enough
xmit buffers to perhaps make this moot.  But, I dealt with this very problem
with a Linux NFS server that would get GbE XOFF's by a broken Cisco switch
that wouldn't always XON, and would seem to drop it's queue. (Turning QoS off
on the cisco switch made it tolerable.)

Still, Alex, it would be worth looking at whether the NFS UDP transmitter
does anything clueful to keep from overwhelming the ethernet layer.

wpan deals with sub-IP fragmentation of 1280 (or larger) IPv6 packets into
127 6lowpan fragments for transmission over 802.15.4 interfaces which run at
typically 250kb/s.   Those radios are typically SPI interfaced (often with
bit-banged SPI interfaces), and the radio MAC has only a single transmit
buffer (which is also the receive buffer!).

Packet trains would be nice to have.

^ permalink raw reply

* Re: [PATCH net-next] perf, bpf: minimize the size of perf_trace_() tracepoint handler
From: David Miller @ 2016-04-21 17:49 UTC (permalink / raw)
  To: ast
  Cc: rostedt, peterz, mingo, daniel, acme, wangnan0, jbacik,
	brendan.d.gregg, netdev, linux-kernel, kernel-team
In-Reply-To: <1461035510-2810305-1-git-send-email-ast@fb.com>

From: Alexei Starovoitov <ast@fb.com>
Date: Mon, 18 Apr 2016 20:11:50 -0700

> move trace_call_bpf() into helper function to minimize the size
> of perf_trace_*() tracepoint handlers.
>     text	   data	    bss	    dec	 	   hex	filename
> 10541679	5526646	2945024	19013349	1221ee5	vmlinux_before
> 10509422	5526646	2945024	18981092	121a0e4	vmlinux_after
> 
> It may seem that perf_fetch_caller_regs() can also be moved,
> but that is incorrect, since ip/sp will be wrong.
> 
> bpf+tracepoint performance is not affected, since
> perf_swevent_put_recursion_context() is now inlined.
> export_symbol_gpl can also be dropped.
> 
> No measurable change in normal perf tracepoints.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Applied, thanks Alexei.

^ permalink raw reply

* [PATCH net v2 0/3] drivers: net: cpsw: phy-handle fixes
From: David Rivshin (Allworx) @ 2016-04-21 17:50 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-omap-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Miller, Mugunthan V N,
	Grygorii Strashko, Andrew Goodbody, Markus Brunner,
	Nicolas Chauvet

From: David Rivshin <drivshin-5fOYsn7Fw8lBDgjK7y7TUQ@public.gmane.org>

The first patch fixes a bug that makes dual_emac mode break if
either slave uses the phy-handle property in the devicetree.

The second patch fixes some cosmetic problems with error messages,
and also makes the binding documentation more explicit.

The third patch cleans up the fixed-link case to work like
the now-fixed phy-handle case.

I have tested on the following hardware configurations:
 - (EVMSK) dual emac, phy_id property in both slaves
 - (EVMSK) dual emac, phy-handle property in both slaves
 - (BeagleBoneBlack) single emac, phy_id property
 - (custom) single emac, fixed-link subnode

Nicolas Chauvet reported testing on an HP t410 (dm8148).

Markus Brunner reported testing v1 on the following [1]:
 - emac0 with phy_id and emac1 with fixed phy
 - emac0 with phy-handle and emac1 with fixed phy
 - emac0 with fixed phy and emac1 with fixed phy


Changes since v1 [2]:
- Rebased
- Added Tested-by from Nicolas Chauvet on all patches
- Added Acked-by from Rob Herring for the binding change in patch 2 [3]

[1] http://www.spinics.net/lists/netdev/msg357890.html
[2] http://www.spinics.net/lists/netdev/msg357772.html
[3] http://www.spinics.net/lists/netdev/msg358254.html

David Rivshin (3):
  drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac
    config
  drivers: net: cpsw: fix error messages when using phy-handle DT
    property
  drivers: net: cpsw: use of_phy_connect() in fixed-link case

 Documentation/devicetree/bindings/net/cpsw.txt |  4 +--
 drivers/net/ethernet/ti/cpsw.c                 | 41 +++++++++++++-------------
 drivers/net/ethernet/ti/cpsw.h                 |  1 +
 3 files changed, 23 insertions(+), 23 deletions(-)

-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] macvlan: fix failure during registration v2
From: Eric W. Biederman @ 2016-04-21 17:44 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <1461188301-4466-1-git-send-email-fruggeri@arista.com>

Francesco Ruggeri <fruggeri@arista.com> writes:

> If macvlan_common_newlink fails in register_netdevice after macvlan_init
> then it decrements port->count twice, first in macvlan_uninit (from
> register_netdevice or rollback_registered) and then again in
> macvlan_common_newlink.
> A similar problem may exist in the ipvlan driver.
> This patch consolidates modifications to port->count into macvlan_init
> and macvlan_uninit (thanks to Eric Biederman for suggesting this approach).
> In macvtap_device_event it also avoids cleaning up in NETDEV_UNREGISTER
> if NETDEV_REGISTER had previously failed.
>
> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>

The macvlan_init bits obviously corect.

The macvtap bits I don't really understand what is going on.

> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 95394ed..e770221 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -1303,6 +1303,8 @@ static int macvtap_device_event(struct notifier_block *unused,
>  		}
>  		break;
>  	case NETDEV_UNREGISTER:
> +		if (vlan->minor == 0)
> +			break;

I don't understand this bit.  A minor of 0 is never assigned.  That is
clear from the code.  On what code path can you get here without
assigning a minor?

My gut says this either needs a big fat comment explaining what is going
on, or a slight refactoring of the code (like moving port count
increments into macvlan_init) that makes it so this code path can't
happen.

>  		devt = MKDEV(MAJOR(macvtap_major), vlan->minor);
>  		device_destroy(macvtap_class, devt);
>  		macvtap_free_minor(vlan);

Eric

^ permalink raw reply

* [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time
From: Eric Dumazet @ 2016-04-21 17:55 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Eric Dumazet, Eric Dumazet

Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     |  4 ++--
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/tcp_output.c | 64 +++++++++++++++++++++++----------------------------
 net/ipv4/tcp_timer.c  |  4 ++--
 4 files changed, 34 insertions(+), 40 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c64d5f..0dc272dcd772 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -538,8 +538,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss);
 void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
 			       int nonagle);
 bool tcp_may_send_now(struct sock *sk);
-int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
-int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
 void tcp_retransmit_timer(struct sock *sk);
 void tcp_xmit_retransmit_queue(struct sock *);
 void tcp_simple_retransmit(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 90e0d9256b74..729e489b5608 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5543,7 +5543,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 	if (data) { /* Retransmit unacked data in SYN */
 		tcp_for_write_queue_from(data, sk) {
 			if (data == tcp_send_head(sk) ||
-			    __tcp_retransmit_skb(sk, data))
+			    __tcp_retransmit_skb(sk, data, 1))
 				break;
 		}
 		tcp_rearm_rto(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83d81e9..4876b256a70a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2266,7 +2266,7 @@ void tcp_send_loss_probe(struct sock *sk)
 	if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
 		goto rearm_timer;
 
-	if (__tcp_retransmit_skb(sk, skb))
+	if (__tcp_retransmit_skb(sk, skb, 1))
 		goto rearm_timer;
 
 	/* Record snd_nxt for loss detection. */
@@ -2551,17 +2551,17 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
  * state updates are done by the caller.  Returns non-zero if an
  * error occurred which prevented the send.
  */
-int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int cur_mss;
-	int err;
+	int diff, len, err;
+
 
-	/* Inconslusive MTU probe */
-	if (icsk->icsk_mtup.probe_size) {
+	/* Inconclusive MTU probe */
+	if (icsk->icsk_mtup.probe_size)
 		icsk->icsk_mtup.probe_size = 0;
-	}
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
 	 * copying overhead: fragmentation, tunneling, mangling etc.
@@ -2594,30 +2594,27 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	    TCP_SKB_CB(skb)->seq != tp->snd_una)
 		return -EAGAIN;
 
-	if (skb->len > cur_mss) {
-		if (tcp_fragment(sk, skb, cur_mss, cur_mss, GFP_ATOMIC))
+	len = cur_mss * segs;
+	if (skb->len > len) {
+		if (tcp_fragment(sk, skb, len, cur_mss, GFP_ATOMIC))
 			return -ENOMEM; /* We'll try again later. */
 	} else {
-		int oldpcount = tcp_skb_pcount(skb);
+		if (skb_unclone(skb, GFP_ATOMIC))
+			return -ENOMEM;
 
-		if (unlikely(oldpcount > 1)) {
-			if (skb_unclone(skb, GFP_ATOMIC))
-				return -ENOMEM;
-			tcp_init_tso_segs(skb, cur_mss);
-			tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb));
-		}
+		diff = tcp_skb_pcount(skb);
+		tcp_set_skb_tso_segs(skb, cur_mss);
+		diff -= tcp_skb_pcount(skb);
+		if (diff)
+			tcp_adjust_pcount(sk, skb, diff);
+		if (skb->len < cur_mss)
+			tcp_retrans_try_collapse(sk, skb, cur_mss);
 	}
 
 	/* RFC3168, section 6.1.1.1. ECN fallback */
 	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
 		tcp_ecn_clear_syn(sk, skb);
 
-	tcp_retrans_try_collapse(sk, skb, cur_mss);
-
-	/* Make a copy, if the first transmission SKB clone we made
-	 * is still in somebody's hands, else make a clone.
-	 */
-
 	/* make sure skb->data is aligned on arches that require it
 	 * and check if ack-trimming & collapsing extended the headroom
 	 * beyond what csum_start can cover.
@@ -2633,20 +2630,22 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	}
 
 	if (likely(!err)) {
+		segs = tcp_skb_pcount(skb);
+
 		TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
 		/* Update global TCP statistics. */
-		TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
+		TCP_ADD_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS, segs);
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-		tp->total_retrans++;
+		tp->total_retrans += segs;
 	}
 	return err;
 }
 
-int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int err = __tcp_retransmit_skb(sk, skb);
+	int err = __tcp_retransmit_skb(sk, skb, segs);
 
 	if (err == 0) {
 #if FASTRETRANS_DEBUG > 0
@@ -2737,6 +2736,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 
 	tcp_for_write_queue_from(skb, sk) {
 		__u8 sacked = TCP_SKB_CB(skb)->sacked;
+		int segs;
 
 		if (skb == tcp_send_head(sk))
 			break;
@@ -2744,14 +2744,8 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 		if (!hole)
 			tp->retransmit_skb_hint = skb;
 
-		/* Assume this retransmit will generate
-		 * only one packet for congestion window
-		 * calculation purposes.  This works because
-		 * tcp_retransmit_skb() will chop up the
-		 * packet to be MSS sized and all the
-		 * packet counting works out.
-		 */
-		if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
+		segs = tp->snd_cwnd - tcp_packets_in_flight(tp);
+		if (segs <= 0)
 			return;
 
 		if (fwd_rexmitting) {
@@ -2788,7 +2782,7 @@ begin_fwd:
 		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
 			continue;
 
-		if (tcp_retransmit_skb(sk, skb))
+		if (tcp_retransmit_skb(sk, skb, segs))
 			return;
 
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 49bc474f8e35..373b03e78aaa 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -404,7 +404,7 @@ void tcp_retransmit_timer(struct sock *sk)
 			goto out;
 		}
 		tcp_enter_loss(sk);
-		tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
+		tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1);
 		__sk_dst_reset(sk);
 		goto out_reset_timer;
 	}
@@ -436,7 +436,7 @@ void tcp_retransmit_timer(struct sock *sk)
 
 	tcp_enter_loss(sk);
 
-	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
+	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
 		/* Retransmission failed because of local congestion,
 		 * do not backoff.
 		 */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [PATCHv3 net-next] net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diag
From: David Miller @ 2016-04-21 17:56 UTC (permalink / raw)
  To: lucien.xin
  Cc: netdev, linux-sctp, marcelo.leitner, vyasevich, daniel,
	eric.dumazet, jsitnick
In-Reply-To: <87a0e83132ce64a3fe2266a5323fa6577a2e5c96.1461049801.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 19 Apr 2016 15:10:01 +0800

> EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
> back to before jiffies_to_msecs() has been introduced.
> 
> Now we can remove it and use jiffies_to_msecs().
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] NLA_BINARY misuse bug in HSR
From: David Miller @ 2016-04-21 17:59 UTC (permalink / raw)
  To: mail; +Cc: arvid.brodin, netdev, daniel, julia.lawall, peter.heise
In-Reply-To: <20160419113428.GA10532@aircraft-controller>

From: Peter Heise <mail@pheise.de>
Date: Tue, 19 Apr 2016 13:34:28 +0200

> Removed .type field from NLA to do proper length checking.
> Reported by Daniel Borkmann and Julia Lawall.
> 
> Signed-off-by: Peter Heise <peter.heise@airbus.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] [v4] net: emac: emac gigabit ethernet controller driver
From: Timur Tabi @ 2016-04-21 18:03 UTC (permalink / raw)
  To: Florian Fainelli, netdev, linux-kernel, devicetree, linux-arm-msm,
	sdharia, Shanker Donthineni, Greg Kroah-Hartman, vikrams, cov,
	gavidov, Rob Herring, andrew, bjorn.andersson, Mark Langsdorf,
	Jon Masters, Andy Gross, David S. Miller
In-Reply-To: <57100962.40404@gmail.com>

Florian Fainelli wrote:

> Well, PHYLIB does prefer using MDIO accesses to "speak" to PHYs,
> built-in or external, but there is always the option of investing into
> some custom development with the subsystem to make it play nicely with
> your HW.

So I've done some more research, and I believe that the internal phy is 
not a candidate for phylib, but the external phy (which is a real phy) 
might be.  There's no MDIO bus to the internal phy.

Does this mean that I will need to enable a PHY driver, and that driver 
will control the external phy?  If so, then does that mean that I would 
delete all to code in my driver that calls emac_phy_read() and 
emac_phy_write()?  For example, I wouldn't need emac_phy_link_check() 
any more?

>> The MDIO bus on these chips is not accessible as a separate entity.  It
>> is melded (for lack of a better word) into the EMAC itself.  That's why
>> there is a "qcom,no-external-phy" property.  You could, in theory, wire
>> the internal phy of one SOC directly to the internal phy of another SOC,
>> and use that as in interconnect between SOCs.  I don't know of any such
>> use-cases however.
>
> The fact the MDIO bus is built-into the MAC is really not a problem
> here, there are tons of drivers that deal with that just fine, yet, the
> DT binding needs to reflect that properly by having a sub-node of the
> Ethernet MAC which is a MDIO bus controller node. If external or
> internal PHYs are accessible through that MDIO bus, they also need to
> appear as child-nodes of that MDIO bus controller node.

Does the compatible property of the phy node (for the external phy) need 
to list the actual external phy?  That is, should it look like this:

	phy0: ethernet-phy@0 {
		compatible = "qcom,fsm9900-emac-phy";
		reg = <0>;
	}

or this:

	phy0: ethernet-phy@0 {
		compatible = "athr,whatever-phy";
		reg = <0>;
	}



> Can we just say that, an absence of PHY specified in the Device Tree (no
> phy-handle property and PHY not a child node of the MDIO bus), means
> that there is no external PHY?

Yes, that works.

>
> [snip]
>
>>> Do you need to maintain these flags when most, if not all of them
>>> already exist in dev->flags or dev->features?
>>
>> So you're saying that, for example, in emac_set_features() I should
>> remove this:
>>
>>      if (netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
>>          set_bit(EMAC_STATUS_VLANSTRIP_EN, &adpt->status);
>>      else
>>          clear_bit(EMAC_STATUS_VLANSTRIP_EN, &adpt->status);
>>
>> and then in emac_mac_mode_config(), I should do this instead:
>>
>> void emac_mac_mode_config(struct emac_adapter *adpt)
>> {
>>      struct net_device *netdev = adpt->netdev;
>>
>>      if (netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
>>          mac |= VLAN_STRIP;
>>      else
>>          mac &= ~VLAN_STRIP;
>>
>>
>> If so, then what do I do in emac_rx_mode_set()?  Should I delete this
>> entire block:
>>
>>      /* Check for Promiscuous and All Multicast modes */
>>      if (netdev->flags & IFF_PROMISC) {
>>          set_bit(EMAC_STATUS_PROMISC_EN, &adpt->status);
>>      } else if (netdev->flags & IFF_ALLMULTI) {
>>          set_bit(EMAC_STATUS_MULTIALL_EN, &adpt->status);
>>          clear_bit(EMAC_STATUS_PROMISC_EN, &adpt->status);
>>      } else {
>>          clear_bit(EMAC_STATUS_MULTIALL_EN, &adpt->status);
>>          clear_bit(EMAC_STATUS_PROMISC_EN, &adpt->status);
>>      }
>>
>> It does look like Gilad is just mirroring the flags/features variable
>> into adpt->status.  What I can't figure out is why.  It seems completely
>> redundant, but I have a nagging feeling that there is a good reason.
>
> Yes, I think your set_features and set_rx_mode functions would be
> greatly simplified, if each of them did take care of programming the HW
> immediately based on function arguments/flags. Unless absolutely
> required (e.g: suspend/resume, outside of the scope of the function
> etc..) having bookeeping variables is always something that can be out
> of sync, so better avoid them as much as possible.

Ok, I'll try to clean this up.

>> So I should move the netif_start_queue() to the end of this function?
>> Sorry if that's a stupid question, but I know little about the MAC side
>> of network drivers.
>
> That's fine, yes moving netif_start_queue() at the far end of the
> function is a good change.

Ok.

>>>> +/* Bring down the interface/HW */
>>>> +void emac_mac_down(struct emac_adapter *adpt, bool reset)
>>>> +{
>>>> +    struct net_device *netdev = adpt->netdev;
>>>> +    struct emac_phy *phy = &adpt->phy;
>>>> +    unsigned long flags;
>>>> +
>>>> +    set_bit(EMAC_STATUS_DOWN, &adpt->status);
>>>
>>> Do you need to maintain that? Would not netif_running() tell you what
>>> you want if you reflect the carrier state properly?
>>
>> I think that emac_work_thread_link_check() handles this.  It's a timer
>> thread that polls the link state and calls netif_carrier_off() if the
>> link is down.  Is that sufficient?
>>
>
> Probably, then again, with PHYLIB you have the option of either
> switching the PHY to interrupt mode (thsus saving the polling_), or it
> polls the PHY for link statuses every HZ.

I'll have to check and see if interrupt mode is even an option.  So 
phylib can do the polling for me?

>>> Since you have a producer index, you should consider checking
>>> skb->xmit_more to know whether you can update the register now, or
>>> later, which could save some expensive operation and batch TX.
>>
>> I'll have to figure out what means and get back to you.  When would
>> "later" be?
>
> After the driver gets accepted mainline for instance would seem fine.
> Considering how this seems to work, something like this is usally all
> that is needed:
>
> if (!skb->xmit_more || netif_xmit_stopped(txq)
> 	/* write producer index to get HW to transmit */

Oh, I thought you meant later in the code somewhere.  At a later date 
with another patch sounds great to me, though.

>>>> +irqreturn_t emac_isr(int _irq, void *data)
>>>> +{
>>>> +    struct emac_irq *irq = data;
>>>> +    struct emac_adapter *adpt = container_of(irq, struct
>>>> emac_adapter, irq);
>>>> +    struct emac_rx_queue *rx_q = &adpt->rx_q;
>>>> +
>>>> +    int max_ints = 1;
>>>> +    u32 isr, status;
>>>> +
>>>> +    /* disable the interrupt */
>>>> +    writel(0, adpt->base + EMAC_INT_MASK);
>>>> +
>>>> +    do {
>>>
>>> With max_ints = 1, this is essentially the same as no loop, so just
>>> inline it to reduce the indentation.
>>
>> In another internal version of this driver, max_ints is set to 5.  Could
>> this be some way of processing multiple packets in one interrupt?  Isn't
>> that something that NAPI already takes care of, anyway?
>
> Yes, NAPI is going to mitigate the cost of taking an interrupt and
> scheduling your bottom-half/soft IRQ for actual packet processing, it is
> the recommended way to mitigate the number of interrupts in the receive
> path (and transmit for that matter).

I'll clean up the code and remove max_ints.

>
>>
>>>> +        isr = readl_relaxed(adpt->base + EMAC_INT_STATUS);
>>>> +        status = isr & irq->mask;
>>>> +
>>>> +        if (status == 0)
>>>> +            break;
>>>> +
>>>> +        if (status & ISR_ERROR) {
>>>> +            netif_warn(adpt,  intr, adpt->netdev,
>>>> +                   "warning: error irq status 0x%lx\n",
>>>> +                   status & ISR_ERROR);
>>>> +            /* reset MAC */
>>>> +            set_bit(EMAC_STATUS_TASK_REINIT_REQ, &adpt->status);
>>>> +            emac_work_thread_reschedule(adpt);
>>>> +        }
>>>> +
>>>> +        /* Schedule the napi for receive queue with interrupt
>>>> +         * status bit set
>>>> +         */
>>>> +        if ((status & rx_q->intr)) {
>>>> +            if (napi_schedule_prep(&rx_q->napi)) {
>>>> +                irq->mask &= ~rx_q->intr;
>>>> +                __napi_schedule(&rx_q->napi);
>>>> +            }
>>>> +        }
>>>> +
>>>> +        if (status & TX_PKT_INT)
>>>> +            emac_mac_tx_process(adpt, &adpt->tx_q);
>>>
>>> You should consider using a NAPI instance for reclaiming TX buffers as
>>> well.
>>
>> I'll have to figure out what means and get back to you.
>
> drivers/net/ethernet/broadcom/bcmsysport.c is an example driver that
> reclaims transmitted buffers in NAPI. What that means is, take the TX
> completion interrupt, schedule a NAPI instance to run, and this NAPI
> instance cleans up the entire TX queue (it is not bounded, like the RX
> NAPI instance). It is really just moving the freeing of SKBs into
> softIRQ context vs. hardIRQ.

Thanks.  I don't think I'll get to any of the NAPI fixes in v5 of this 
driver.  I want to make sure I get the phylib conversion correct first.

>>>> +/* Configure VLAN tag strip/insert feature */
>>>> +static int emac_set_features(struct net_device *netdev,
>>>> +                 netdev_features_t features)
>>>> +{
>>>> +    struct emac_adapter *adpt = netdev_priv(netdev);
>>>> +
>>>> +    netdev_features_t changed = features ^ netdev->features;
>>>> +
>>>> +    if (!(changed & (NETIF_F_HW_VLAN_CTAG_TX |
>>>> NETIF_F_HW_VLAN_CTAG_RX)))
>>>> +        return 0;
>>>> +
>>>> +    netdev->features = features;
>>>> +    if (netdev->features & NETIF_F_HW_VLAN_CTAG_RX)
>>>> +        set_bit(EMAC_STATUS_VLANSTRIP_EN, &adpt->status);
>>>> +    else
>>>> +        clear_bit(EMAC_STATUS_VLANSTRIP_EN, &adpt->status);
>>>
>>> What about TX vlan offload?
>>
>> I don't know what that is.
>
> TX VLAN offload would be that you can specify the VLAN id somewhere in a
> packet's descriptor and have the HW automatically build an Ethernet
> frame with the correct VLAN id, and all the Ethernet frame payload
> appropriately placed at the correct offsets, with no cost for the CPU
> but indicating that information (and not having to do a memmove() to
> insert the 802.1Q tag).

I have no idea if our hardware supports that.  I'll make a note of TX 
VLAN offload and submit a separate patch if I can make it work.

>> and I've never understood why it's necessary to fall back to 32-bits if
>> 64 bits fails.  Isn't 64 bits a superset of 32 bits?  The driver is
>> saying that the hardware supports all of DDR.  How could fail, and how
>> could 32-bit succeed if 64-bits fails?
>
> I believe there could be cases where the HW is capable of addressing
> more physical memory than the CPU itself (usually unlikely, but it
> could),there could be cases where the HW is behind an IOMMMU which only
> has a window into the DDR, and that could prevent a higher DMA_BIT_MASK
> from being successfully configured.

So, so I'm going to add dma-ranges support (I posted another patch asked 
for feedback, but I haven't gotten it yet).

For ACPI, we're going to depend on IORT to set the DMA mask for us.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.

^ permalink raw reply

* Re: [PATCH net 9/9] net/mlx5e: Reset link modes upon setting speed to zero
From: David Miller @ 2016-04-21 18:04 UTC (permalink / raw)
  To: saeedm; +Cc: netdev, ogerlitz, talal, eranbe
In-Reply-To: <1461069222-27076-10-git-send-email-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 19 Apr 2016 15:33:42 +0300

> Upon ethtool request to set speed to 0 we handle it as a special request
> to reset link modes to Device's defaults.
> 
> Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4
> Ethernet functionality")
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Please don't try to sneak things like this into the patches you submit.

If you continue to add weird stuff like this, I will never _ever_ be
able to trust you guys and have a high degree of confidence in your
changes.  If you continue like this, I will always have to audit your
patches very strictly which is very time consuming for me.

Do not extend ethtool's semantics in a way which suits you specifically.

If we want to have this semantic, you must first propose it as a
global semantic which then in turn can be adopted by all drivers
supporting ethtool.

Thank you.

^ permalink raw reply

* [PATCH net 1/1] qlcnic: Update version to 5.3.64
From: Manish Chopra @ 2016-04-21 17:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, Dept-GELinuxNICDev

Just updating the version as many fixes got
accumulated over 5.3.63

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h b/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
index 55007f1..caf6ddb 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic.h
@@ -37,8 +37,8 @@
 
 #define _QLCNIC_LINUX_MAJOR 5
 #define _QLCNIC_LINUX_MINOR 3
-#define _QLCNIC_LINUX_SUBVERSION 63
-#define QLCNIC_LINUX_VERSIONID  "5.3.63"
+#define _QLCNIC_LINUX_SUBVERSION 64
+#define QLCNIC_LINUX_VERSIONID  "5.3.64"
 #define QLCNIC_DRV_IDC_VER  0x01
 #define QLCNIC_DRIVER_VERSION  ((_QLCNIC_LINUX_MAJOR << 16) |\
 		 (_QLCNIC_LINUX_MINOR << 8) | (_QLCNIC_LINUX_SUBVERSION))
-- 
2.7.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox