Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH net-next 06/10] net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap
From: David Laight @ 2016-11-28 15:28 UTC (permalink / raw)
  To: 'Tariq Toukan', David S. Miller
  Cc: netdev@vger.kernel.org, Eran Ben Elisha, Tal Alon
In-Reply-To: <1480261877-19720-7-git-send-email-tariqt@mellanox.com>

From: Tariq Toukan
> Sent: 27 November 2016 15:51
> From: Eran Ben Elisha <eranbe@mellanox.com>
> 
> In order to avoid temporary large structs on the stack,
> allocate them dynamically.
> 
> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> Signed-off-by: Tal Alon <talal@mellanox.com>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c | 244 +++++++++++++++++-------------
>  1 file changed, 142 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 4a9497e9778d..65502df9fd96 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -799,40 +799,117 @@ static void slave_adjust_steering_mode(struct mlx4_dev *dev,
...
> +static int mlx4_slave_special_qp_cap(struct mlx4_dev *dev)
> +{
> +	struct mlx4_func_cap *func_cap = NULL;
> +	int i, err = 0;
> +
> +	func_cap = kzalloc(sizeof(*func_cap), GFP_KERNEL);
> +	dev->caps.qp0_qkey = kcalloc(dev->caps.num_ports,
> +				     sizeof(u32), GFP_KERNEL);
> +	dev->caps.qp0_tunnel = kcalloc(dev->caps.num_ports,
> +				       sizeof(u32), GFP_KERNEL);
> +	dev->caps.qp0_proxy = kcalloc(dev->caps.num_ports,
> +				      sizeof(u32), GFP_KERNEL);
> +	dev->caps.qp1_tunnel = kcalloc(dev->caps.num_ports,
> +				       sizeof(u32), GFP_KERNEL);
> +	dev->caps.qp1_proxy = kcalloc(dev->caps.num_ports,
> +				      sizeof(u32), GFP_KERNEL);
...
It has to be better to allocate a single piece of memory.
Potentially using a structure something like:

struct fubar {
	struct mlx4_func_cap func_cap;
	struct {
		u32	qp0_qkey;
		u32	qp0_tunnel;
		u32	qp0_proxy;
		u32	qp1_tunnel;
		u32	qp1_proxy;
	} port_info[];
};

	David

^ permalink raw reply

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: Eric Dumazet @ 2016-11-28 15:31 UTC (permalink / raw)
  To: David Miller; +Cc: lsanfil, pavel, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161128.095431.856183735501262965.davem@davemloft.net>

On Mon, 2016-11-28 at 09:54 -0500, David Miller wrote:
> From: Lino Sanfilippo <lsanfil@marvell.com>
> Date: Mon, 28 Nov 2016 14:07:51 +0100
> 
> > Calling skb_orphan() in the xmit handler made this issue disappear.
> 
> This is not the way to handle this problem.
> 
> The solution is to free the SKBs in a timely manner after the
> chip has transmitted the frame.

Note that the 'pauses' described by Pavel are also caused by a too small
SO_SNDBUF value on the UDP socket.

An immediate fix, with no kernel change is to increase it.

echo 1000000 >/proc/sys/net/core/wmem_default

or

val = 1000000;
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &val, sizeof(val));

^ permalink raw reply

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: Lino Sanfilippo @ 2016-11-28 15:33 UTC (permalink / raw)
  To: David Miller; +Cc: pavel, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161128.095431.856183735501262965.davem@davemloft.net>

Hi,

On 28.11.2016 15:54, David Miller wrote:
> From: Lino Sanfilippo <lsanfil@marvell.com>
> Date: Mon, 28 Nov 2016 14:07:51 +0100
>
>> Calling skb_orphan() in the xmit handler made this issue disappear.
>
> This is not the way to handle this problem.
>

I agree, its not a clean solution. But if the use of skb_orphan makes the delays
disappear, it can at least be a hint that socket queue limits are involved.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net] net/dccp: fix use-after-free in dccp_invalid_packet
From: Arnaldo Carvalho de Melo @ 2016-11-28 15:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Arnaldo Carvalho de Melo, Andrey Konovalov,
	Gerrit Renker, David S. Miller, dccp, netdev, LKML, Dmitry Vyukov,
	Kostya Serebryany, Eric Dumazet, syzkaller
In-Reply-To: <1480346458.18162.54.camel@edumazet-glaptop3.roam.corp.google.com>

Em Mon, Nov 28, 2016 at 07:20:58AM -0800, Eric Dumazet escreveu:
> On Mon, 2016-11-28 at 12:05 -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Nov 28, 2016 at 06:47:14AM -0800, Eric Dumazet escreveu:
> > > On Mon, 2016-11-28 at 11:40 -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Mon, Nov 28, 2016 at 06:26:49AM -0800, Eric Dumazet escreveu:
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > 
> > > > > pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
> > > > > in dccp_invalid_packet() or risk use after free.
> > > > > 
> > > > > Bug found by Andrey Konovalov using syzkaller.
> > > > > 
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > > > 
> > > > Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > 
> > > > I was about to send exactly this patch, and while looking at it I think
> > > > the patch below needs to go in as well, no? To follow the advice of that
> > > > Warning line there :-)
> > > > 
> > > > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > 
> > > > pskb_may_pull() can reallocate skb->head, so we can't access
> > > > iph->frag_off or risk use after free, save it to a variable and us that
> > > > later.
> > > > 
> > > > Cc: Andrey Konovalov <andreyknvl@google.com>
> > > > Cc: Eric Dumazet <edumazet@google.com>
> > > > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > 
> > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > > > index 5ddf5cda07f4..9462070561a3 100644
> > > > --- a/net/ipv4/af_inet.c
> > > > +++ b/net/ipv4/af_inet.c
> > > > @@ -1198,6 +1198,7 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
> > > >  	struct iphdr *iph;
> > > >  	int proto, tot_len;
> > > >  	int nhoff;
> > > > +	u16 frag_off;
> > > >  	int ihl;
> > > >  	int id;
> > > >  
> > > > @@ -1213,6 +1214,7 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
> > > >  
> > > >  	id = ntohs(iph->id);
> > > >  	proto = iph->protocol;
> > > > +	frag_off = iph->frag_off;
> > > >  
> > > >  	/* Warning: after this point, iph might be no longer valid */
> > > >  	if (unlikely(!pskb_may_pull(skb, ihl)))
> > > > @@ -1233,7 +1235,7 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
> > > >  		fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);
> > > >  
> > > >  		/* fixed ID is invalid if DF bit is not set */
> > > > -		if (fixedid && !(iph->frag_off & htons(IP_DF)))
> > > > +		if (fixedid && !(frag_off & htons(IP_DF)))
> > > >  			goto out;
> > > >  	}
> > > >  
> > > 
> > > 
> > > I do not see why this patch would be needed ?
> > 
> > Where is iph being reloaded after that pskb_may_pull() and thus at line 1236 we
> > could use after free? The warning at line 1217?
> > 
> > 1209         iph = ip_hdr(skb);
> > 1210         ihl = iph->ihl * 4;
> > 1211         if (ihl < sizeof(*iph))
> > 1212                 goto out;
> > 1213 
> > 1214         id = ntohs(iph->id);
> > 1215         proto = iph->protocol;
> > 1216 
> > 1217         /* Warning: after this point, iph might be no longer valid */
> > 1218         if (unlikely(!pskb_may_pull(skb, ihl)))
> > 1219                 goto out;
> > 1220         __skb_pull(skb, ihl);
> > 1221 
> > 1222         encap = SKB_GSO_CB(skb)->encap_level > 0;
> > 1223         if (encap)
> > 1224                 features &= skb->dev->hw_enc_features;
> > 1225         SKB_GSO_CB(skb)->encap_level += ihl;
> > 1226 
> > 1227         skb_reset_transport_header(skb);
> > 1228 
> > 1229         segs = ERR_PTR(-EPROTONOSUPPORT);
> > 1230 
> > 1231         if (!skb->encapsulation || encap) {
> > 1232                 udpfrag = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
> > 1233                 fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);
> > 1234 
> > 1235                 /* fixed ID is invalid if DF bit is not set */
> > 1236                 if (fixedid && !(iph->frag_off & htons(IP_DF)))
> > 1237                         goto out;
> > 1238         }
> > 
> 
> Arg, I was looking at an old tree.
> 
> Please then add
> 
> Fixes: cbc53e08a793b ("GSO: Add GSO type for fixed IPv4 ID")
> 
> To ease stable backports.
> 
> Also, it looks comments are not read, we might kill this one and reload
> iph.
> 
> ( Saving 3 fields is now more expensive than simply reloading iph )

Ok, I instead used ip_hdr(skb)->frag_off at that place, as it is the
only use after the pskb_may_pull(), right after that use it will reload
iph in another fashion anyway, sending the patch in another message,
holler if you disagree, i.e. nacking that ack 8-)

- Arnaldo

^ permalink raw reply

* [PATCH 1/1] GSO: Reload iph after pskb_may_pull
From: Arnaldo Carvalho de Melo @ 2016-11-28 15:36 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, Alexander Duyck, Andrey Konovalov,
	Linux Networking Development Mailing List

As it may get stale and lead to use after free.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexander Duyck <aduyck@mirantis.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID")
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 net/ipv4/af_inet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cda07f4..215143246e4b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1233,7 +1233,7 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 		fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);
 
 		/* fixed ID is invalid if DF bit is not set */
-		if (fixedid && !(iph->frag_off & htons(IP_DF)))
+		if (fixedid && !(ip_hdr(skb)->frag_off & htons(IP_DF)))
 			goto out;
 	}
 
-- 
2.9.3

^ permalink raw reply related

* RE: [PATCH net-next 01/10] net/mlx4_core: Make each VF manage its own mac table
From: Mintz, Yuval @ 2016-11-28 15:37 UTC (permalink / raw)
  To: Tariq Toukan, David S. Miller
  Cc: netdev@vger.kernel.org, Eran Ben Elisha, Erez Shitrit,
	Gal Pressman
In-Reply-To: <1480261877-19720-2-git-send-email-tariqt@mellanox.com>

> Each VF can catch up to the max number of MACs in the MAC table (128) on
> the base of "first asks first gets".
> The VF should know the total free number of MACs from the PF.

While this might be a reasonable default policy given the alternatives
[or more precisely, the lack of ability for hyper-user to configure such],
it still sounds a bit crude - a user would have an available MAC only if
other users were kind enough not to consume all of them?

Have you thought of adding some configurable method through which
this could be controlled?

^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Uwe Kleine-König @ 2016-11-28 15:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Rob Herring, Frank Rowand, Andreas Färber,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Michal Hrusecki, Tomas Hlavacek, Bed??icha Ko??atu,
	Vivien Didelot, Florian Fainelli,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161128131735.GA4379-g2DYL2Zd6BY@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2968 bytes --]

On 11/28/2016 02:17 PM, Andrew Lunn wrote:
>> I still wonder (and didn't get an answer back when I asked about this)
>> why a comment is preferred here. For other devices I know it's usual and
>> requested by the maintainers to use:
>>
>> 	compatible = "exact name", "earlyer device to match driver";
>>
>> . This is more robust, documents the situation more formally and makes
>> it better greppable. The price to pay is only a few bytes in the dtb
>> which IMO is ok.
> 
> We did discuss this a while back. The information is useless and
> should to be ignored if present.

Who is "we"?

> The switch has a register which contains its model and revision. Each
> port has a set of registers, and register 3 contains the
> model/version. For all devices compatible with the 6085, the port
> registers start at address 0x10. For the 6190, the port registers
> start at 0x0. So given one of these two compatible strings, we can
> find the model of the device, from something which is burned into the
> silicon.
> 
> Now, say we did add per device compatible strings. We look up the
> model burned into the silicon, find it is different to what the device
> tree is and do what? Fail the probe? Or just keep going using the

I'd say fail to probe is the right thing to do. Of course that doesn't
work for already supported models because it will break compatibility.

I'd value the advantages (i.e. easily find machines with a given
hardware) higher than making broken dtbs work, so being a bit silly is
fine for me.

> value in the silicon? It seems silly to fail the probe if the driver
> does support the model, but that means the device tree is never
> verified and hence probably wrong. Why have wrong information in the
> device tree, especially wrong information which we never use. It is
> better to not have that information in the device tree.

At least we'd have a canonical way to specify the type of switch. If
it's not verified it's as good and bad as a dts comment. But the latter
isn't available in the dtb, which I consider a small disadvantage.

Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
hardware is really a "marvell,mv88e6176".

> Linus has said he does not like ARM devices because of all the busses
> which are not enumerable. Here we have a device which with a little
> bit of help we can enumerate. So we should. 

If you write

	compatible = "marvell,mv88e6176", "marvell,mv88e6085";

you can still enumerate in the same way as before.

There are several more instances where the device tree specifies
something that could be probed instead. Some examples:

	compatible = "ethernet-phy-id0141.0DD1", "ethernet-phy-ieee802.3-c22";
	compatible = "spansion,s25fl164k", "jedec,spi-nor";
	compatible = "fsl,imx25-flexcan", "fsl,p1010-flexcan";
	compatible = "arm,pl011", "arm,primecell";

So you think they are all doing it wrong?

Best regards
Uwe



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-28 15:48 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480348130-31354-1-git-send-email-dsa@cumulusnetworks.com>

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v3
- dropped the rename

v2
- fix bpf_prog_run_clear_cb to bpf_prog_run_save_cb as caught by Daniel

- rename BPF_PROG_TYPE_CGROUP_SKB and its cg_skb functions to
  BPF_PROG_TYPE_CGROUP and cgroup

 kernel/bpf/cgroup.c  | 27 ++++++++++++++++++++++-----
 kernel/bpf/syscall.c | 28 +++++++++++++++-------------
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f264b0..d5746aec8f34 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,6 +117,19 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
 	}
 }
 
+static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
+				       struct bpf_prog *prog)
+{
+	unsigned int offset = skb->data - skb_network_header(skb);
+	int ret;
+
+	__skb_push(skb, offset);
+	ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
+	__skb_pull(skb, offset);
+
+	return ret;
+}
+
 /**
  * __cgroup_bpf_run_filter() - Run a program for packet filtering
  * @sk: The socken sending or receiving traffic
@@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 
 	prog = rcu_dereference(cgrp->bpf.effective[type]);
 	if (prog) {
-		unsigned int offset = skb->data - skb_network_header(skb);
-
-		__skb_push(skb, offset);
-		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
-		__skb_pull(skb, offset);
+		switch (type) {
+		case BPF_CGROUP_INET_INGRESS:
+		case BPF_CGROUP_INET_EGRESS:
+			ret = __cgroup_bpf_run_filter_skb(skb, prog);
+			break;
+		/* make gcc happy else complains about missing enum value */
+		default:
+			return 0;
+		}
 	}
 
 	rcu_read_unlock();
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1090d16a31c1..c2bce596e842 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -843,6 +843,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
+	enum bpf_prog_type ptype;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -853,25 +854,26 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
-		prog = bpf_prog_get_type(attr->attach_bpf_fd,
-					 BPF_PROG_TYPE_CGROUP_SKB);
-		if (IS_ERR(prog))
-			return PTR_ERR(prog);
-
-		cgrp = cgroup_get_from_fd(attr->target_fd);
-		if (IS_ERR(cgrp)) {
-			bpf_prog_put(prog);
-			return PTR_ERR(cgrp);
-		}
-
-		cgroup_bpf_update(cgrp, prog, attr->attach_type);
-		cgroup_put(cgrp);
+		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
 
 	default:
 		return -EINVAL;
 	}
 
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(cgrp);
+	}
+
+	cgroup_bpf_update(cgrp, prog, attr->attach_type);
+	cgroup_put(cgrp);
+
 	return 0;
 }
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v3 0/3] net: Add bpf support to set sk_bound_dev_if
From: David Ahern @ 2016-11-28 15:48 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.

For example:
 1. configure vrf (e.g., using ifupdown2)
        auto eth0
        iface eth0 inet dhcp
            vrf mgmt

        auto mgmt
        iface mgmt
            vrf-table auto

 2. configure cgroup
        mount -t cgroup2 none /tmp/cgroupv2
        mkdir /tmp/cgroupv2/mgmt
        test_cgrp2_sock /tmp/cgroupv2/mgmt 15

 3. set shell into cgroup (e.g., can be done at login using pam)
        echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs

At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

David Ahern (3):
  bpf: Refactor cgroups code in prep for new type
  bpf: Add new cgroup attach type to enable sock modifications
  samples: bpf: add userspace example for modifying sk_bound_dev_if

 include/linux/bpf-cgroup.h    | 11 ++++++
 include/linux/filter.h        |  2 +-
 include/uapi/linux/bpf.h      |  6 ++++
 kernel/bpf/cgroup.c           | 36 ++++++++++++++++---
 kernel/bpf/syscall.c          | 33 +++++++++--------
 net/core/filter.c             | 65 +++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c            | 12 ++++++-
 net/ipv6/af_inet6.c           |  8 +++++
 samples/bpf/Makefile          |  2 ++
 samples/bpf/test_cgrp2_sock.c | 83 +++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 237 insertions(+), 21 deletions(-)
 create mode 100644 samples/bpf/test_cgrp2_sock.c

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next v3 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: David Ahern @ 2016-11-28 15:48 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480348130-31354-1-git-send-email-dsa@cumulusnetworks.com>

Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v3
- reverted to new prog type BPF_PROG_TYPE_CGROUP_SOCK
- dropped the subtype

v2
- dropped the bpf_sock_store_u32 helper
- dropped the new prog type BPF_PROG_TYPE_CGROUP_SOCK
- moved valid access and context conversion to use subtype
- dropped CREATE from BPF_CGROUP_INET_SOCK and related function names
- moved running of filter from sk_alloc to inet{6}_create

 include/linux/bpf-cgroup.h | 11 ++++++++
 include/linux/filter.h     |  2 +-
 include/uapi/linux/bpf.h   |  6 +++++
 kernel/bpf/cgroup.c        |  9 +++++++
 kernel/bpf/syscall.c       |  5 +++-
 net/core/filter.c          | 65 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         | 12 ++++++++-
 net/ipv6/af_inet6.c        |  8 ++++++
 8 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index ec80d0c0953e..2ca529664c8b 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -64,6 +64,16 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 	__ret;								\
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				\
+({									\
+	int __ret = 0;							\
+	if (cgroup_bpf_enabled && sk) {					\
+		__ret = __cgroup_bpf_run_filter(sk, NULL,		\
+						BPF_CGROUP_INET_SOCK);	\
+	}								\
+	__ret;								\
+})
+
 #else
 
 struct cgroup_bpf {};
@@ -73,6 +83,7 @@ static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
 
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 
 #endif /* CONFIG_CGROUP_BPF */
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c521adfe..808e158742a2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -408,7 +408,7 @@ struct bpf_prog {
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
-	unsigned int		(*bpf_func)(const struct sk_buff *skb,
+	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *filter);
 	/* Instructions for interpreter */
 	union {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d1456f..8f410ecdaac4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,11 +101,13 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
 	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_CGROUP_SOCK,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
+	BPF_CGROUP_INET_SOCK,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -537,6 +539,10 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+struct bpf_sock {
+	__u32 bound_dev_if;
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index d5746aec8f34..796e39aa28f5 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,6 +117,12 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
 	}
 }
 
+static int __cgroup_bpf_run_filter_sock(struct sock *sk,
+					struct bpf_prog *prog)
+{
+	return prog->bpf_func(sk, prog->insnsi) == 1 ? 0 : -EPERM;
+}
+
 static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
 				       struct bpf_prog *prog)
 {
@@ -171,6 +177,9 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 		case BPF_CGROUP_INET_EGRESS:
 			ret = __cgroup_bpf_run_filter_skb(skb, prog);
 			break;
+		case BPF_CGROUP_INET_SOCK:
+			ret = __cgroup_bpf_run_filter_sock(sk, prog);
+			break;
 		/* make gcc happy else complains about missing enum value */
 		default:
 			return 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c2bce596e842..f5247901a4cc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -856,7 +856,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_EGRESS:
 		ptype = BPF_PROG_TYPE_CGROUP_SKB;
 		break;
-
+	case BPF_CGROUP_INET_SOCK:
+		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -892,6 +894,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	switch (attr->attach_type) {
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
+	case BPF_CGROUP_INET_SOCK:
 		cgrp = cgroup_get_from_fd(attr->target_fd);
 		if (IS_ERR(cgrp))
 			return PTR_ERR(cgrp);
diff --git a/net/core/filter.c b/net/core/filter.c
index ea315af56511..593b6b664c0c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2645,6 +2645,12 @@ cg_skb_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+cg_sock_func_proto(enum bpf_func_id func_id)
+{
+	return NULL;
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2682,6 +2688,29 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool sock_filter_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sock, bound_dev_if):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	if (off < 0 || off + size > sizeof(struct bpf_sock))
+		return false;
+
+	/* The verifier guarantees that size > 0. */
+	if (off % size != 0)
+		return false;
+
+	return true;
+}
+
 static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write,
 			       const struct bpf_prog *prog)
 {
@@ -2940,6 +2969,30 @@ static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
+					  int dst_reg, int src_reg,
+					  int ctx_off,
+					  struct bpf_insn *insn_buf,
+					  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct bpf_sock, bound_dev_if):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_bound_dev_if) != 4);
+
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg,
+					offsetof(struct sock, sk_bound_dev_if));
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sock, sk_bound_dev_if));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 					 int src_reg, int ctx_off,
 					 struct bpf_insn *insn_buf,
@@ -3013,6 +3066,12 @@ static const struct bpf_verifier_ops cg_skb_ops = {
 	.convert_ctx_access	= sk_filter_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_sock_ops = {
+	.get_func_proto		= cg_sock_func_proto,
+	.is_valid_access	= sock_filter_is_valid_access,
+	.convert_ctx_access	= sock_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3038,6 +3097,11 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+static struct bpf_prog_type_list cg_sock_type __read_mostly = {
+	.ops	= &cg_sock_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCK
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
@@ -3045,6 +3109,7 @@ static int __init register_sk_filter_ops(void)
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
 	bpf_register_prog_type(&cg_skb_type);
+	bpf_register_prog_type(&cg_sock_type);
 
 	return 0;
 }
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5ddf5cda07f4..24d2550492ee 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 
 	if (sk->sk_prot->init) {
 		err = sk->sk_prot->init(sk);
-		if (err)
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
 			sk_common_release(sk);
+			goto out;
+		}
 	}
 out:
 	return err;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d424f3a3737a..237e654ba717 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -258,6 +258,14 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 			goto out;
 		}
 	}
+
+	if (!kern) {
+		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);
+		if (err) {
+			sk_common_release(sk);
+			goto out;
+		}
+	}
 out:
 	return err;
 out_rcu_unlock:
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v3 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if
From: David Ahern @ 2016-11-28 15:48 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, daniel, maheshb, tgraf, David Ahern
In-Reply-To: <1480348130-31354-1-git-send-email-dsa@cumulusnetworks.com>

Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v3
- revert to BPF_PROG_TYPE_CGROUP_SOCK prog type

v2
- removed bpf_sock_store_u32 references
- changed BPF_CGROUP_INET_SOCK_CREATE to BPF_CGROUP_INET_SOCK
- remove BPF_PROG_TYPE_CGROUP_SOCK prog type and add prog_subtype

 samples/bpf/Makefile          |  2 ++
 samples/bpf/test_cgrp2_sock.c | 83 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_sock.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index fb17206ddb57..7de5cd9849c2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -23,6 +23,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += test_cgrp2_attach
+hostprogs-y += test_cgrp2_sock
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -51,6 +52,7 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
+test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
new file mode 100644
index 000000000000..2831e5f41f86
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -0,0 +1,83 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
+ *   sockets opened by processes in the cgroup.
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+static int prog_load(int idx)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_MOV64_IMM(BPF_REG_3, idx),
+		BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct bpf_sock, bound_dev_if)),
+		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK, prog, sizeof(prog),
+			     "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+	printf("Usage: %s cg-path device-index\n", argv0);
+	return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, prog_fd, ret;
+	int idx = 0;
+
+	if (argc < 2)
+		return usage(argv[0]);
+
+	idx = atoi(argv[2]);
+	if (!idx) {
+		printf("Invalid device index\n");
+		return EXIT_FAILURE;
+	}
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(idx);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, BPF_CGROUP_INET_SOCK);
+	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v4 1/4] net: phy: add an option to disable EEE advertisement
From: Jerome Brunet @ 2016-11-28 15:50 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Julia Lawall, Yegor Yefremov,
	Andreas Färber
In-Reply-To: <1480348229-25672-1-git-send-email-jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>


This patch adds an option to disable EEE advertisement in the generic PHY
by providing a mask of prohibited modes corresponding to the value found in
the MDIO_AN_EEE_ADV register.

On some platforms, PHY Low power idle seems to be causing issues, even
breaking the link some cases. The patch provides a convenient way for these
platforms to disable EEE advertisement and work around the issue.

Signed-off-by: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
Tested-by: Yegor Yefremov <yegorslists-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org>
Tested-by: Andreas Färber <afaerber-l3A5Bk7waGM@public.gmane.org>
Tested-by: Neil Armstrong <narmstrong-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
---
 drivers/net/phy/phy.c        |  3 ++
 drivers/net/phy/phy_device.c | 80 +++++++++++++++++++++++++++++++++++++++-----
 include/linux/phy.h          |  3 ++
 3 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 73adbaa9ac86..a3981cc6448a 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1396,6 +1396,9 @@ int phy_ethtool_set_eee(struct phy_device *phydev, struct ethtool_eee *data)
 {
 	int val = ethtool_adv_to_mmd_eee_adv_t(data->advertised);
 
+	/* Mask prohibited EEE modes */
+	val &= ~phydev->eee_broken_modes;
+
 	phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV, MDIO_MMD_AN, val);
 
 	return 0;
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ba86c191a13e..cb4aca205cf8 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1121,6 +1121,43 @@ static int genphy_config_advert(struct phy_device *phydev)
 }
 
 /**
+ * genphy_config_eee_advert - disable unwanted eee mode advertisement
+ * @phydev: target phy_device struct
+ *
+ * Description: Writes MDIO_AN_EEE_ADV after disabling unsupported energy
+ *   efficent ethernet modes. Returns 0 if the PHY's advertisement hasn't
+ *   changed, and 1 if it has changed.
+ */
+static int genphy_config_eee_advert(struct phy_device *phydev)
+{
+	int broken = phydev->eee_broken_modes;
+	int old_adv, adv;
+
+	/* Nothing to disable */
+	if (!broken)
+		return 0;
+
+	/* If the following call fails, we assume that EEE is not
+	 * supported by the phy. If we read 0, EEE is not advertised
+	 * In both case, we don't need to continue
+	 */
+	adv = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV, MDIO_MMD_AN);
+	if (adv <= 0)
+		return 0;
+
+	old_adv = adv;
+	adv &= ~broken;
+
+	/* Advertising remains unchanged with the broken mask */
+	if (old_adv == adv)
+		return 0;
+
+	phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV, MDIO_MMD_AN, adv);
+
+	return 1;
+}
+
+/**
  * genphy_setup_forced - configures/forces speed/duplex from @phydev
  * @phydev: target phy_device struct
  *
@@ -1178,15 +1215,20 @@ EXPORT_SYMBOL(genphy_restart_aneg);
  */
 int genphy_config_aneg(struct phy_device *phydev)
 {
-	int result;
+	int err, changed;
+
+	changed = genphy_config_eee_advert(phydev);
 
 	if (AUTONEG_ENABLE != phydev->autoneg)
 		return genphy_setup_forced(phydev);
 
-	result = genphy_config_advert(phydev);
-	if (result < 0) /* error */
-		return result;
-	if (result == 0) {
+	err = genphy_config_advert(phydev);
+	if (err < 0) /* error */
+		return err;
+
+	changed |= err;
+
+	if (changed == 0) {
 		/* Advertisement hasn't changed, but maybe aneg was never on to
 		 * begin with?  Or maybe phy was isolated?
 		 */
@@ -1196,16 +1238,16 @@ int genphy_config_aneg(struct phy_device *phydev)
 			return ctl;
 
 		if (!(ctl & BMCR_ANENABLE) || (ctl & BMCR_ISOLATE))
-			result = 1; /* do restart aneg */
+			changed = 1; /* do restart aneg */
 	}
 
 	/* Only restart aneg if we are advertising something different
 	 * than we were before.
 	 */
-	if (result > 0)
-		result = genphy_restart_aneg(phydev);
+	if (changed > 0)
+		return genphy_restart_aneg(phydev);
 
-	return result;
+	return 0;
 }
 EXPORT_SYMBOL(genphy_config_aneg);
 
@@ -1563,6 +1605,21 @@ static void of_set_phy_supported(struct phy_device *phydev)
 		__set_phy_supported(phydev, max_speed);
 }
 
+static void of_set_phy_eee_broken(struct phy_device *phydev)
+{
+	struct device_node *node = phydev->mdio.dev.of_node;
+	u32 broken;
+
+	if (!IS_ENABLED(CONFIG_OF_MDIO))
+		return;
+
+	if (!node)
+		return;
+
+	if (!of_property_read_u32(node, "eee-broken-modes", &broken))
+		phydev->eee_broken_modes = broken;
+}
+
 /**
  * phy_probe - probe and init a PHY device
  * @dev: device to probe and init
@@ -1600,6 +1657,11 @@ static int phy_probe(struct device *dev)
 	of_set_phy_supported(phydev);
 	phydev->advertising = phydev->supported;
 
+	/* Get the EEE modes we want to prohibit. We will ask
+	 * the PHY stop advertising these mode later on
+	 */
+	of_set_phy_eee_broken(phydev);
+
 	/* Set the state to READY by default */
 	phydev->state = PHY_READY;
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index edde28ce163a..b53177fd38af 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -417,6 +417,9 @@ struct phy_device {
 	u32 advertising;
 	u32 lp_advertising;
 
+	/* Energy efficient ethernet modes which should be prohibited */
+	u32 eee_broken_modes;
+
 	int autoneg;
 
 	int link_timeout;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net-next v4 3/4] dt: bindings: add ethernet phy eee-broken-modes option documentation
From: Jerome Brunet @ 2016-11-28 15:50 UTC (permalink / raw)
  To: netdev, devicetree, Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic, linux-arm-kernel, linux-kernel,
	Julia Lawall, Yegor Yefremov, Andreas Färber
In-Reply-To: <1480348229-25672-1-git-send-email-jbrunet@baylibre.com>

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Tested-by: Neil Armstrong <narmstrong@baylibre.com>
---
 Documentation/devicetree/bindings/net/phy.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index 4627da3d52c4..54749b60a466 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -38,6 +38,8 @@ Optional Properties:
 - enet-phy-lane-swap: If set, indicates the PHY will swap the TX/RX lanes to
   compensate for the board being designed with the lanes swapped.
 
+- eee-broken-modes: Bits to clear in the MDIO_AN_EEE_ADV register to
+  disable EEE broken modes.
 
 Example:
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v4 4/4] ARM64: dts: meson: odroidc2: disable advertisement EEE for GbE.
From: Jerome Brunet @ 2016-11-28 15:50 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Julia Lawall, Yegor Yefremov,
	Andreas Färber
In-Reply-To: <1480348229-25672-1-git-send-email-jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>

Signed-off-by: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
Tested-by: Neil Armstrong <narmstrong-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
---
 arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
index e6e3491d48a5..2036582ca0d6 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
@@ -46,6 +46,7 @@
 
 #include "meson-gxbb.dtsi"
 #include <dt-bindings/gpio/gpio.h>
+#include <dt-bindings/net/mdio.h>
 
 / {
 	compatible = "hardkernel,odroid-c2", "amlogic,meson-gxbb";
@@ -85,6 +86,19 @@
 	status = "okay";
 	pinctrl-0 = <&eth_pins>;
 	pinctrl-names = "default";
+
+	phy-handle = <&eth_phy0>;
+
+	mdio {
+		compatible = "snps,dwmac-mdio";
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		eth_phy0: ethernet-phy@0 {
+			reg = <0>;
+			eee-broken-modes = <MDIO_EEE_1000T>;
+		};
+	};
 };
 
 &ir {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net-next v4 0/4] Fix OdroidC2 Gigabit Tx link issue
From: Jerome Brunet @ 2016-11-28 15:50 UTC (permalink / raw)
  To: netdev, devicetree, Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic, linux-arm-kernel, linux-kernel,
	Julia Lawall, Yegor Yefremov, Andreas Färber

This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
The platform seems to enter LPI on the Rx path too often while performing
relatively high TX transfer. This eventually break the link (both Tx and
Rx), and require to bring the interface down and up again to get the Rx
path working again.

The root cause of this issue is not fully understood yet but disabling EEE
advertisement on the PHY prevent this feature to be negotiated.
With this change, the link is stable and reliable, with the expected
throughput performance.

The patchset adds options in the generic phy driver to disable EEE
advertisement, through device tree. The way it is done is very similar
to the handling of the max-speed property.

Patch 4 is provided here for testing purpose only. Please don't merge
patch 4, this change will go through the amlogic's tree.

Chnages since V3: [3]
 - Fix signess error reported by kbuild test robot (Thx Julia)

Changes since V2: [2]
 - Rename "eee-advert-disable" to "eee-broken-modes" to make the intended
   purpose of this option clear (flag broken configuration, not a
   configuration option)
 - Add DT bindings constants so the DT configuration is more user friendly
 - Submit to net-next instead of net.

Changes since V1: [1]
 - Disable the advertisement of EEE in the generic code instead of the
   realtek driver.

[1] : http://lkml.kernel.org/r/1479220154-25851-1-git-send-email-jbrunet@baylibre.com
[2] : http://lkml.kernel.org/r/1479742524-30222-1-git-send-email-jbrunet@baylibre.com
[3] : http://lkml.kernel.org/r/1480326409-25419-1-git-send-email-jbrunet@baylibre.com

Jerome Brunet (4):
  net: phy: add an option to disable EEE advertisement
  dt-bindings: net: add EEE capability constants
  dt: bindings: add ethernet phy eee-broken-modes option documentation
  ARM64: dts: meson: odroidc2: disable advertisement EEE for GbE.

 Documentation/devicetree/bindings/net/phy.txt      |  2 +
 .../arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts | 14 ++++
 drivers/net/phy/phy.c                              |  3 +
 drivers/net/phy/phy_device.c                       | 80 +++++++++++++++++++---
 include/dt-bindings/net/mdio.h                     | 19 +++++
 include/linux/phy.h                                |  3 +
 6 files changed, 112 insertions(+), 9 deletions(-)
 create mode 100644 include/dt-bindings/net/mdio.h

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next v4 2/4] dt-bindings: net: add EEE capability constants
From: Jerome Brunet @ 2016-11-28 15:50 UTC (permalink / raw)
  To: netdev, devicetree, Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Andrew Lunn,
	Neil Armstrong, linux-amlogic, linux-arm-kernel, linux-kernel,
	Julia Lawall, Yegor Yefremov, Andreas Färber
In-Reply-To: <1480348229-25672-1-git-send-email-jbrunet@baylibre.com>

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Tested-by: Neil Armstrong <narmstrong@baylibre.com>
---
 include/dt-bindings/net/mdio.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
 create mode 100644 include/dt-bindings/net/mdio.h

diff --git a/include/dt-bindings/net/mdio.h b/include/dt-bindings/net/mdio.h
new file mode 100644
index 000000000000..99c6d903d439
--- /dev/null
+++ b/include/dt-bindings/net/mdio.h
@@ -0,0 +1,19 @@
+/*
+ * This header provides generic constants for ethernet MDIO bindings
+ */
+
+#ifndef _DT_BINDINGS_NET_MDIO_H
+#define _DT_BINDINGS_NET_MDIO_H
+
+/*
+ * EEE capability Advertisement
+ */
+
+#define MDIO_EEE_100TX		0x0002	/* 100TX EEE cap */
+#define MDIO_EEE_1000T		0x0004	/* 1000T EEE cap */
+#define MDIO_EEE_10GT		0x0008	/* 10GT EEE cap */
+#define MDIO_EEE_1000KX		0x0010	/* 1000KX EEE cap */
+#define MDIO_EEE_10GKX4		0x0020	/* 10G KX4 EEE cap */
+#define MDIO_EEE_10GKR		0x0040	/* 10G KR EEE cap */
+
+#endif
-- 
2.7.4

^ permalink raw reply related

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: Lino Sanfilippo @ 2016-11-28 15:57 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: pavel, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <1480347103.18162.58.camel@edumazet-glaptop3.roam.corp.google.com>



On 28.11.2016 16:31, Eric Dumazet wrote:
> On Mon, 2016-11-28 at 09:54 -0500, David Miller wrote:
>> From: Lino Sanfilippo <lsanfil@marvell.com>
>> Date: Mon, 28 Nov 2016 14:07:51 +0100
>>
>>> Calling skb_orphan() in the xmit handler made this issue disappear.
>>
>> This is not the way to handle this problem.
>>
>> The solution is to free the SKBs in a timely manner after the
>> chip has transmitted the frame.
>
> Note that the 'pauses' described by Pavel are also caused by a too small
> SO_SNDBUF value on the UDP socket.
>
> An immediate fix, with no kernel change is to increase it.
>
> echo 1000000 >/proc/sys/net/core/wmem_default
>
> or
>
> val = 1000000;
> setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &val, sizeof(val));
>

I wonder if the best fix would be indeed to deactivate irq coalescing completely.
Does it make any sense at all to use it if a driver uses NAPI already?

Regards,
Lino

^ permalink raw reply

* Re: [[PATCH net-next RFC] 4/4] net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
From: Vivien Didelot @ 2016-11-28 16:00 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Andrew Lunn
In-Reply-To: <1479944598-20372-5-git-send-email-andrew@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

>  static const struct mv88e6xxx_ops mv88e6391_ops = {
> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> index a0d0f79a7de8..b846a33c024c 100644
> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> @@ -123,6 +123,10 @@
>  #define PORT_CONTROL_USE_TAG		BIT(4)
>  #define PORT_CONTROL_FORWARD_UNKNOWN_MC	BIT(3)
>  #define PORT_CONTROL_FORWARD_UNKNOWN	BIT(2)
> +#define PORT_CONTROL_NOT_EGREES_UNKNOWN_DA		(0x0 << 2)
> +#define PORT_CONTROL_NOT_EGREES_UNKNOWN_MULTICAST_DA	(0x1 << 2)
> +#define PORT_CONTROL_NOT_EGREES_UNKNOWN_UNITCAST_DA	(0x2 << 2)

s/EGREES/EGRESS/.

> +#define PORT_CONTROL_EGREES_ALL_UNKNOWN_DA		(0x3 << 2)
>  #define PORT_CONTROL_STATE_MASK		0x03
>  #define PORT_CONTROL_STATE_DISABLED	0x00
>  #define PORT_CONTROL_STATE_BLOCKING	0x01
> @@ -821,6 +825,8 @@ struct mv88e6xxx_ops {
>  				uint64_t *data);
>  	int (*tag_remap)(struct mv88e6xxx_chip *chip, int port);
>  	int (*monitor_ctrl)(struct mv88e6xxx_chip *chip, int upstream_port);
> +	int (*cpu_port_config)(struct mv88e6xxx_chip *chip, int port);
> +	int (*dsa_port_config)(struct mv88e6xxx_chip *chip, int port);
>  };

Hum, I dislike having config wrappers in the library. The functions
stored in the mv88e6xxx_ops structure should reflect the features
exposed by the SMI device (Port, Global X, PHY, etc.) registers.

They should not handle configuration logic, we should see them as an
MV88E6XXX API used to implement a chip (usually wrapped in chip.c).

>  
>  #define STATS_TYPE_PORT		BIT(0)
> diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
> index b7fab70f6cd7..a37d7d72df47 100644
> --- a/drivers/net/dsa/mv88e6xxx/port.c
> +++ b/drivers/net/dsa/mv88e6xxx/port.c
> @@ -553,3 +553,78 @@ int mv88e6390_tag_remap(struct mv88e6xxx_chip *chip, int port)
>  
>  	return 0;
>  }
> +
> +int mv88e6095_cpu_port_config(struct mv88e6xxx_chip *chip, int port)
> +{
> +	u16 reg;
> +	int err;
> +
> +	err = mv88e6xxx_port_read(chip, port, PORT_CONTROL, &reg);
> +	if (err)
> +		return err;
> +
> +	reg |= PORT_CONTROL_DSA_TAG |
> +		PORT_CONTROL_EGRESS_ADD_TAG |
> +		PORT_CONTROL_FORWARD_UNKNOWN;
> +
> +	return mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
> +}
> +
> +int mv88e6351_cpu_port_config(struct mv88e6xxx_chip *chip, int port)
> +{
> +	u16 reg;
> +	int err;
> +
> +	err = mv88e6xxx_port_read(chip, port, PORT_CONTROL, &reg);
> +	if (err)
> +		return err;
> +
> +	if (chip->info->tag_protocol == DSA_TAG_PROTO_EDSA) {
> +		reg |= PORT_CONTROL_FRAME_ETHER_TYPE_DSA |
> +			PORT_CONTROL_EGRESS_ADD_TAG;
> +
> +		/* Port Ethertype: use the Ethertype DSA Ethertype
> +		 * value.
> +		 */
> +		err = mv88e6xxx_port_write(chip, port, PORT_ETH_TYPE,
> +					   ETH_P_EDSA);
> +		if (err)
> +			return err;
> +	} else {
> +		reg |= PORT_CONTROL_FRAME_MODE_DSA;
> +	}
> +
> +	reg |= PORT_CONTROL_EGREES_ALL_UNKNOWN_DA;
> +
> +	return mv88e6xxx_port_write(chip, port, PORT_CONTROL, reg);
> +}

The device (Port, Global X, etc.) functions should reflect and implement
the feature described in the register(s) they read from/write to.

Please provide generic functions to configure a port's Frame mode and
Egress mode under the comment /* Offset 0x04: Port Control Register */

mv88e6xxx_port_set_{frame,egress}_mode() would work. The Frame mode bits
changed between chips, but the Egress mode bits are the same since 6065.

A frame mode can be "Normal Network mode", "DSA mode", "Provider mode",
or "Ether Type DSA mode". Maybe use an enum, or switching on the u8
frame_mode value might be enough, as you wish.

Please also provide a generic function to set the Port EType under an
ordered comment:

/* Offset 0x0F: Port E Type */

int mv88e6xxx_port_set_etype(struct mv88e6xxx_chip *chip, int port,
                             u16 etype)
{
    return mv88e6xxx_port_write(chip, port, PORT_ETH_TYPE, etype);
}

Then, some nice wrappers in chip.c can use them depending on the chip's
tag_protocol to configure what net/dsa calls cpu, dsa, or user port.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH] stmmac ethernet: remove cut & paste code
From: Joe Perches @ 2016-11-28 16:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: peppe.cavallaro, netdev, kernel list, ezequiel, sonic.zhang,
	fabrice.gasnier
In-Reply-To: <20161128143539.GB31224@amd>

On Mon, 2016-11-28 at 15:35 +0100, Pavel Machek wrote:
> On Mon 2016-11-28 06:24:28, Joe Perches wrote:
> > On Mon, 2016-11-28 at 12:50 +0100, Pavel Machek wrote:
> > > On Thu 2016-11-24 14:27:13, Joe Perches wrote:
> > > > On Thu, 2016-11-24 at 22:44 +0100, Pavel Machek wrote:
> > > > > On Thu 2016-11-24 12:05:25, Joe Perches wrote:
> > > > > > On Thu, 2016-11-24 at 12:05 +0100, Pavel Machek wrote:
> > > > > > > Remove duplicate code from _tx routines.
> > > > > > 
> > > > > > trivia:
> > > > > > 
> > > > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > > 
> > > > > > []
> > > > > > > @@ -1960,6 +1960,38 @@ static void stmmac_tso_allocator(struct stmmac_priv *priv, unsigned int des,
> > > > > > >  	}
> > > > > > >  }
> > > > > > >  
> > > > > > > +static void stmmac_xmit_common(struct sk_buff *skb, struct net_device *dev, int nfrags, struct dma_desc *desc)
> > > > > > > +{
> > > > > > > +	struct stmmac_priv *priv = netdev_priv(dev);
> > > > > > > +
> > > > > > > +	if (unlikely(stmmac_tx_avail(priv) <= (MAX_SKB_FRAGS + 1))) {
> > > > > > > +		if (netif_msg_hw(priv))
> > > > > > > +			pr_debug("%s: stop transmitted packets\n", __func__);
> > > > > > 
> > > > > > 		netif_dbg(priv, hw, dev, "%s: stop transmitted packets\n",
> > > > > > 			  __func__);
> > > > > 
> > > > > Not now. Modifying the code while de-duplicating would be bad idea.
> > > > 
> > > > Too many people think overly granular patches are the
> > > > best and only way to make changes.
> > > > Deduplication and consolidation can happen simultaneously.
> > > 
> > > Can, but should not at this point. Please take a look at the driver in
> > > question before commenting on trivial printk style.
> > 
> > I had.
> > 
> > It's perfectly acceptable and already uses netif_<level> properly.
> > 
> > This consolidation now introduces the _only_ instance where it is
> > now improperly using a netif_msg_<type> then single pr_<level>
> > function sequence that should be consolidated into netif_dbg.
> > Every other use of netif_msg_<level> then either emits multiple
> > lines or is used in an if/else.
> 
> Are you looking at right driver?

Yes and I think you should make changes against -next
and not Linus' where this is:

b3e51069627e2 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c (LABBE Corentin            2016-11-16 20:09:41 +0100  755)                              netif_warn(priv, link, priv->dev,
b3e51069627e2 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c (LABBE Corentin            2016-11-16 20:09:41 +0100  756)                                         "Speed (%d) not 10/100\n",
b3e51069627e2 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c (LABBE Corentin            2016-11-16 20:09:41 +0100  757)                                         phydev->speed);

>  I don't see single use of
> netif_msg_<level>, but see this at stmmac_main.c:756. Code is actually
> pretty consistent using pr_*.
> 
>                                 if (netif_msg_link(priv))
>                                         pr_warn("%s: Speed (%d) not 10/100\n",
>                                                 dev->name, phydev->speed);
> 
> Anyway, I'm moving code around, if you want to do trivial cleanups, do
> them yourself.

cheers, Joe

^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Andrew Lunn @ 2016-11-28 16:14 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Rob Herring, Frank Rowand, Andreas Färber,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Michal Hrusecki, Tomas Hlavacek, Bed??icha Ko??atu,
	Vivien Didelot, Florian Fainelli,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <2c59cc79-b6dc-9920-1725-a7785ff3b6bf-rXY34ruvC2xidJT2blvkqNi2O/JbrIOy@public.gmane.org>

On Mon, Nov 28, 2016 at 04:44:47PM +0100, Uwe Kleine-König wrote:
> On 11/28/2016 02:17 PM, Andrew Lunn wrote:
> >> I still wonder (and didn't get an answer back when I asked about this)
> >> why a comment is preferred here. For other devices I know it's usual and
> >> requested by the maintainers to use:
> >>
> >> 	compatible = "exact name", "earlyer device to match driver";
> >>
> >> . This is more robust, documents the situation more formally and makes
> >> it better greppable. The price to pay is only a few bytes in the dtb
> >> which IMO is ok.
> > 
> > We did discuss this a while back. The information is useless and
> > should to be ignored if present.
> 
> Who is "we"?

Anybody on netdev a while back, but mostly Vivien and myself.

> I'd say fail to probe is the right thing to do. Of course that doesn't
> work for already supported models because it will break compatibility.

And that is the first rule of device tree, never break backwards
compatibility.

> Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
> hardware is really a "marvell,mv88e6176".

Why? Remember, the property name is called "compatible", not "is".

> > Linus has said he does not like ARM devices because of all the busses
> > which are not enumerable. Here we have a device which with a little
> > bit of help we can enumerate. So we should. 
> 
> If you write
> 
> 	compatible = "marvell,mv88e6176", "marvell,mv88e6085";
> 
> you can still enumerate in the same way as before.

Sure it would. But if the driver decides to ignore it, it is likely to
be wrong. Developers are used to comments being wrong. We are
suspicious of comments, we don't trust them 100%. But if somebody sees
a property in a device tree, they put a higher 'trustability' value on
it. But it actually has less trustable than a comment.
 
> There are several more instances where the device tree specifies
> something that could be probed instead. Some examples:
> 
> 	compatible = "ethernet-phy-id0141.0DD1", "ethernet-phy-ieee802.3-c22";

There are examples of phys which don't implemented register 2 and
3. In that case, you do need to specify the ID, if you want the
correct driver to load.

> 	compatible = "spansion,s25fl164k", "jedec,spi-nor";

And there was a push recently to add "jedec,spi-nor" everywhere and
deprecate a specific compatible because it is also not needed.

> 	compatible = "fsl,imx25-flexcan", "fsl,p1010-flexcan";
> 	compatible = "arm,pl011", "arm,primecell";

I don't know these hardwares, so cannot comment.

  Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-11-28 16:22 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <1480017556-25988-1-git-send-email-david.lebrun@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Thu, 24 Nov 2016 20:59:16 +0100

> When multiple nexthops are available for a given route, the routing engine
> chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> and by taking that value modulo the number of nexthops. The resulting value
> indexes the nexthop to select. This method causes issues when a new nexthop
> is added or one is removed (e.g. link failure). In that case, the number
> of nexthops changes and potentially all the flows get re-routed to another
> nexthop.
> 
> This patch implements a consistent hash method to select the nexthop in
> case of ECMP. The idea is to generate N random numbers (__u32) for each
> nexthop, where N is configurable. In order to select a nexthop, we find the
> number that is directly higher than the flow hash (which is also __u32).
> The nexthop associated to the number is then selected. The lookup method
> performs a binary search over the sorted array of numbers, which yields a
> time complexity of O(log n), where n is the number of nexthops times the
> number of random values generated for each number.
> 
> This feature can be enabled through the CONFIG_IPV6_MULTIPATH_CONSISTENT
> option and the number of random values generated for each nexthop is
> defined through CONFIG_IPV6_MPCONSIST_BUCKETSIZE.
> 
> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>

Thanks for trying to solve this problem.

But we really don't want this to be Kconfig gated.  If we decide to
support this it should be a run-time selectable option.  Every
distribution on the planet is going to turn your Kconfig option on, so
whatever you think we might be saving by putting this behind Kconfig
checks won't be realized for large swaths of the userbase.

I also question how necessary %100 consistent hashing of ECMP really
is.

If we can do something at extremely low cost and arrive at a very low
disruption rate, honestly that's probably good enough.

I assume you've looked over RFC2992.  A low disrutpion algorithm is
described there and if we could just get rid of the divide it might
be extremely desirable.

^ permalink raw reply

* Re: [PATCH] net/iucv: use explicit clean up labels in iucv_init()
From: David Miller @ 2016-11-28 16:24 UTC (permalink / raw)
  To: bigeasy; +Cc: ubraun, tglx, linux-kernel, rt, linux-s390, netdev
In-Reply-To: <20161124161013.dukr42y2nwscosk6@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu, 24 Nov 2016 17:10:13 +0100

> Ursula suggested to use explicit labels for clean up in the error path
> instead of one `out_free' label which handles multiple exits.
> Since the previous patch got already applied, here is a follow up patch.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

"Previous patch" doesn't tell readers anything specific enough to identify
the change you are referring to.  This will be even more true years down
the line if someone tries to read this commit and figure out what you
are referring to.

We have a standard mechanism to refer to commits, via SHA1_ID and commit
header line text, please use it.

Thank you.

^ permalink raw reply

* Re: [PATCH] rtlwifi: Add updates for RTL8723BE and RTL8821AE
From: Kyle McMartin @ 2016-11-28 16:24 UTC (permalink / raw)
  To: Larry Finger
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-firmware-DgEjT+Ai2ygdnm+yROfE0A
In-Reply-To: <20161127192834.19907-1-Larry.Finger-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>

On Sun, Nov 27, 2016 at 01:28:34PM -0600, Larry Finger wrote:
> The new versions will only work with new versions of the drivers. For
> that reason, they are given new names and the old versions are retained.
> 
> Signed-off-by: Larry Finger <Larry.Finger-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
> ---
>  WHENCE                     |   4 ++++

applied, thanks Larry.

regards, --kyle

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: enable multiqueue by default
From: David Miller @ 2016-11-28 16:28 UTC (permalink / raw)
  To: mst
  Cc: nhorman, jeder, hannes, netdev, linux-kernel, virtualization,
	myllynen, maxime.coquelin
In-Reply-To: <20161125064201-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 25 Nov 2016 06:43:08 +0200

> On Fri, Nov 25, 2016 at 12:37:26PM +0800, Jason Wang wrote:
>> We use single queue even if multiqueue is enabled and let admin to
>> enable it through ethtool later. This is used to avoid possible
>> regression (small packet TCP stream transmission). But looks like an
>> overkill since:
>> 
>> - single queue user can disable multiqueue when launching qemu
>> - brings extra troubles for the management since it needs extra admin
>>   tool in guest to enable multiqueue
>> - multiqueue performs much better than single queue in most of the
>>   cases
>> 
>> So this patch enables multiqueue by default: if #queues is less than or
>> equal to #vcpu, enable as much as queue pairs; if #queues is greater
>> than #vcpu, enable #vcpu queue pairs.
>> 
>> Cc: Hannes Frederic Sowa <hannes@redhat.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Neil Horman <nhorman@redhat.com>
>> Cc: Jeremy Eder <jeder@redhat.com>
>> Cc: Marko Myllynen <myllynen@redhat.com>
>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> OK at some level but all uses of num_online_cpus()
> like this are racy versus hotplug.
> I know we already have this bug but shouldn't we fix it
> before we add more?

This is more being used like a heuristic in this scenerio, and in
fact I would say one would keep the code this way even once proper
hotplug handlers are installed to adjust the queued dynamically if
there is a desired (which is also not necessarily the case).

I really don't think this change should be held on up on this issue.
So can we please make some forward progress here?

Thanks.

^ permalink raw reply

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: David Miller @ 2016-11-28 16:30 UTC (permalink / raw)
  To: lsanfil; +Cc: eric.dumazet, pavel, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <920b1148-a77d-0b69-01b0-b14bcbdf8648@marvell.com>

From: Lino Sanfilippo <lsanfil@marvell.com>
Date: Mon, 28 Nov 2016 16:57:35 +0100

> I wonder if the best fix would be indeed to deactivate irq coalescing
> completely.
> Does it make any sense at all to use it if a driver uses NAPI already?

It absolutely does make sense, when it is implemented and functions
properly.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox