Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: net: GPF in eth_header
From: Dmitry Vyukov @ 2016-11-28 19:34 UTC (permalink / raw)
  To: syzkaller
  Cc: Hannes Frederic Sowa, David Miller, Tom Herbert, Alexander Duyck,
	Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <CAAeHK+wWVp46ej0tk5mDnOHpH5Ky0+HiH9_SUqGcT0iTppLaGw@mail.gmail.com>

On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller
<syzkaller@googlegroups.com> wrote:
> On Mon, Nov 28, 2016 at 7:50 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sat, 2016-11-26 at 20:07 +0100, Andrey Konovalov wrote:
>>> On Sat, Nov 26, 2016 at 7:28 PM, 'Eric Dumazet' via syzkaller
>>> <syzkaller@googlegroups.com> wrote:
>>> > On Sat, Nov 26, 2016 at 9:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> >> Hello,
>>> >>
>>> >> The following program triggers GPF in eth_header:
>>> >>
>>> >> https://gist.githubusercontent.com/dvyukov/613cadf05543b55a419f237e419cd495/raw/5471231523d1a07c3de55f11f87472c2816ee06c/gistfile1.txt
>>> >>
>>> >> On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).
>>> >>
>>> >> BUG: unable to handle kernel paging request at ffffed002d14d74a
>>> >> IP: [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >> PGD 7fff6067 [   50.787819] PUD 7fff5067
>>> >> PMD 0 [   50.787819]
>>> >> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>>> >> Modules linked in:
>>> >> CPU: 2 PID: 6712 Comm: a.out Not tainted 4.9.0-rc6+ #55
>>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> >> task: ffff88003a1841c0 task.stack: ffff880034d08000
>>> >> RIP: 0010:[<ffffffff86be3295>]  [<ffffffff86be3295>]
>>> >> eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >> RSP: 0018:ffff880034d0eb68  EFLAGS: 00010a03
>>> >> RAX: 1ffff1002d14d74a RBX: ffff880168a6ba4a RCX: ffff88006a9c7858
>>> >> RDX: 000000000000dd86 RSI: dffffc0000000000 RDI: ffff880168a6ba56
>>> >> RBP: ffff880034d0eb98 R08: 0000000000000000 R09: 0000000000000031
>>> >> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
>>> >> R13: ffff88006c208d80 R14: 00000000000086dd R15: ffff88006a9c7858
>>> >> FS:  0000000001a02940(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
>>> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> >> CR2: ffffed002d14d74a CR3: 0000000037373000 CR4: 00000000000006e0
>>> >> Stack:
>>> >>  000000316881ab40 ffff88006a9c76c0 ffff88006881ab40 ffff88006a9c77f8
>>> >>  0000000000000000 dffffc0000000000 ffff880034d0ee98 ffffffff86b31af9
>>> >>  ffffffff8719605c ffff880034d0f0f8 ffffffff000086dd ffffffff86be3220
>>> >> Call Trace:
>>> >>  [<     inline     >] dev_hard_header ./include/linux/netdevice.h:2762
>>> >>  [<ffffffff86b31af9>] neigh_resolve_output+0x659/0xb20 net/core/neighbour.c:1302
>>> >>  [<     inline     >] dst_neigh_output ./include/net/dst.h:464
>>> >>  [<ffffffff8719605c>] ip6_finish_output2+0xb3c/0x2500 net/ipv6/ip6_output.c:121
>>> >>  [<ffffffff871a0b0b>] ip6_finish_output+0x2eb/0x760 net/ipv6/ip6_output.c:139
>>> >>  [<     inline     >] NF_HOOK_COND ./include/linux/netfilter.h:246
>>> >>  [<ffffffff871a1157>] ip6_output+0x1d7/0x9a0 net/ipv6/ip6_output.c:153
>>> >>  [<     inline     >] dst_output ./include/net/dst.h:501
>>> >>  [<ffffffff873312ea>] ip6_local_out+0x9a/0x180 net/ipv6/output_core.c:170
>>> >>  [<ffffffff871a3886>] ip6_send_skb+0xa6/0x340 net/ipv6/ip6_output.c:1712
>>> >>  [<ffffffff871a3bd8>] ip6_push_pending_frames+0xb8/0xe0
>>> >> net/ipv6/ip6_output.c:1732
>>> >>  [<     inline     >] rawv6_push_pending_frames net/ipv6/raw.c:607
>>> >>  [<ffffffff8722acfb>] rawv6_sendmsg+0x250b/0x2c20 net/ipv6/raw.c:920
>>> >>  [<ffffffff8701c4f5>] inet_sendmsg+0x385/0x590 net/ipv4/af_inet.c:734
>>> >>  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
>>> >>  [<ffffffff86a6ea9f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
>>> >>  [<ffffffff86a6ee0b>] sock_write_iter+0x32b/0x620 net/socket.c:829
>>> >>  [<ffffffff81a6f153>] do_iter_readv_writev+0x363/0x670 fs/read_write.c:695
>>> >>  [<ffffffff81a71ba1>] do_readv_writev+0x431/0x9b0 fs/read_write.c:872
>>> >>  [<ffffffff81a726dc>] vfs_writev+0x8c/0xc0 fs/read_write.c:911
>>> >>  [<ffffffff81a72825>] do_writev+0x115/0x2d0 fs/read_write.c:944
>>> >>  [<     inline     >] SYSC_writev fs/read_write.c:1017
>>> >>  [<ffffffff81a75fdc>] SyS_writev+0x2c/0x40 fs/read_write.c:1014
>>> >>  [<ffffffff8814cf85>] entry_SYSCALL_64_fastpath+0x23/0xc6
>>> >> arch/x86/entry/entry_64.S:209
>>> >> Code: 41 83 fe 04 0f 84 aa 00 00 00 e8 17 4e b0 fa 48 8d 7b 0c 48 be
>>> >> 00 00 00 00 00 fc ff df 44 89 f2 66 c1 c2 08 48 89 f8 48 c1 e8 03 <0f>
>>> >> b6 0c 30 48 8d 43 0d 49 89 c0 49 c1 e8 03 41 0f b6 34 30 49
>>> >> RIP  [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >>  RSP <ffff880034d0eb68>
>>> >> CR2: ffffed002d14d74a
>>> >> ---[ end trace a73fedfdc11bd60c ]---
>>> >
>>> >
>>> > Hi Dmitry
>>> >
>>> > I could not reproduce the issue. Might need some specific configuration...
>>> >
>>> > loopback device has proper ethernet header (all 0)
>>> >
>>> > Fault happens in :
>>> >
>>> > 0f b6 0c 30             movzbl (%rax,%rsi,1),%ecx
>>> >
>>> > RAX=1ffff1002d14d74a  which is RDI>>3, and RSI=dffffc0000000000
>>> >
>>> > Could this be a KASAN problem ?
>>>
>>> Hi Eric,
>>>
>>> The crash happens when the kernel tries to access shadow for nonmapped memory.
>>>
>>> The issue here is an integer overflow which happens in neigh_resolve_output().
>>> skb_network_offset(skb) can return negative number, but __skb_pull()
>>> accepts unsigned int as len.
>>> As a result, the least significat bit in higher 32 bits of skb->data
>>> gets set and we get an out-of-bounds with offset of 4 GB.
>>>
>>> I've attached a short reproducer, but you either need KASAN or to add
>>> a BUG_ON to see the crash.
>>> In this reproducer skb_network_offset() becomes negative after merging
>>> two ipv6 fragments.
>>>
>>> I actually see multiple places where skb_network_offset() is used as
>>> an argument to skb_pull().
>>> So I guess every place can potentially be buggy.
>>>
>>> Thanks!
>>
>> I can not reproduce the bug on my hosts.
>> Quite hard to debug for me.
>>
>> skb_network_offset() can not be negative at this point, unless there is
>> a bug upper in the stack.
>
> Hi Eric,
>
> As far as I can see, skb_network_offset() becomes negative after
> pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> At least I'm able to detect that with a BUG_ON().
>
> Also it seems that the issue is only reproducible (at least with the
> poc I provided) for a short time after boot.


Eric,

Is it enough to debug? Or maybe Andrey can trace some values for you.

^ permalink raw reply

* Re: [PATCH ethtool 2/2] Ethtool: Implements PHY Loopback
From: Allan W. Nielsen @ 2016-11-28 19:34 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, linville, andrew, raju.lakkaraju
In-Reply-To: <73aa42bc-b476-ba5a-cf22-115b409a40e3@gmail.com>

On 28/11/16 09:56, Florian Fainelli wrote:
> On 11/28/2016 05:25 AM, Allan W. Nielsen wrote:
> > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> >
> > Add loopback in ethtool tunables to access PHY drivers.
> >
> > Ethtool Help: ethtool -h for PHY tunables
> >     ethtool --set-phy-tunable DEVNAME      Set PHY tunable
> >                 [ loopback off|near|far|extn ]
> >     ethtool --get-phy-tunable DEVNAME      Get PHY tunable
> >                 [ loopback ]
> >
> > Ethtool ex:
> >   ethtool --set-phy-tunable eth0 loopback near
> >   ethtool --set-phy-tunable eth0 loopback far
> >   ethtool --set-phy-tunable eth0 loopback extn
> >   ethtool --set-phy-tunable eth0 loopback off
> >
> >   ethtool --get-phy-tunable eth0 loopback
> >
> > Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> > Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
> > ---
> 
> > +Near-End Loopback:
> > +Transmitted data (TXD) is looped back in the PCS block onto the receive data
> > +signal (RXD). When Near-End loopback enable, no data is transmitted over
> > +the network. no data receive from the network.
> 
> This is also known as the local loopback test mode, right?
Yes - Traffic transmitted/generated by the host list loopback to the
host instead of transmitting it on the wire.

> > +
> > +Far-End Loopback:
> > +This loopback is a special test mode to allow testing the PHY from link
> > +partner side. In this mode data that is received from the link partner pass
> > +through the PHY's receiver, looped back on the MII and transmitted back to
> > +the link partner.
> 
> And this is the remote loopback mode.
Yes - Traffic receiwed on the "wire" is transmitted back on the wire.

> > +
> > +External Loopback:
> > +An RJ45 loopback cable can be used to route the transmit signals an the
> > +output of the trnsformer back to the receiver inputs and this loopback will
> > +work at either 10 or 100 or 1000 Mbps speed.
> > +RJ45 Loopback cable created by conncting pin 1 to pin 3 and connecting pin
> > +2 to pin 6.
> 
> OK, this name makes sense to me, but for the two other names, we need to
> use a terminology that is clearer to the reader and/or people familiar
> and targeted at using this feature (e.g: in a lab or during manufacturing).
Sure, we can find better names and/or improve the documentation. But
before jumping to that, then it is properly a good idea to agree on
the overall concept.

We will get back to the naming when we agree on the other parts.

/Allan

^ permalink raw reply

* Re: [PATCH net-next] bpf: samples: Fix compile of test_lru_dist.c
From: Martin Lau @ 2016-11-28 19:36 UTC (permalink / raw)
  To: Daniel Borkmann, David Ahern, netdev@vger.kernel.org; +Cc: Alexei Starovoitov
In-Reply-To: <583BF623.7030103@iogearbox.net>


On Mon, Nov 28, 2016 at 10:17:23AM +0100, Daniel Borkmann wrote:
> On 11/28/2016 05:32 AM, David Ahern wrote:
> >Build of samples/bpf on debian/jessie fails with:
> >
> >   HOSTCC  /home/dsa/kernel-3.git/samples/bpf/test_lru_dist.o
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c: In function ‘main’:
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: error: variable ‘r’ has initializer but incomplete type
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >          ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:21: error: ‘RLIM_INFINITY’ undeclared (first use in this function)
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >                      ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:21: note: each undeclared identifier is reported only once for each function it appears in
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: excess elements in struct initializer
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >          ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: (near initialization for ‘r’)
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: excess elements in struct initializer
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: (near initialization for ‘r’)
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:16: error: storage size of ‘r’ isn’t known
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >
> >Add sys/resource.h to the include list
> >
> >Fixes: 5db58faf989f ("bpf: Add tests for the LRU bpf_htab")
> >Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> >Cc: Martin KaFai Lau <kafai@fb.com>
>
> Ran into the same issue, fixed here already:
>
> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e00c7b216f34444252f3771f7d4ed48d4f032636

Thanks for David's report/patch and the earlier Daniel's fix.  Sorry that I
missed it since it compiles fine in my current dev setup (CentOS6).  I
can also reproduce in my newer arch-linux setup.

--Martin

^ permalink raw reply

* [PATCH net 0/5] l2tp: fixes for l2tp_ip and l2tp_ip6 socket handling
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston

This series addresses problems found while working on commit 32c231164b76
("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").

The first three patches fix races in socket's connect, recv and bind
operations. The last two ones fix scenarios where l2tp fails to
correctly lookup its userspace sockets.

Apart from the last patch, which is l2tp_ip6 specific, every patch fixes
the same problem in the L2TP IPv4 and IPv6 code.

Guillaume Nault (5):
  l2tp: lock socket before checking flags in connect()
  l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
  l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
  l2tp: fix lookup for sockets not bound to a device in l2tp_ip
  l2tp: fix address test in __l2tp_ip6_bind_lookup()

 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 ++-
 net/l2tp/l2tp_ip.c  | 61 +++++++++++++++++++++--------------------
 net/l2tp/l2tp_ip6.c | 79 ++++++++++++++++++++++++++++-------------------------
 4 files changed, 79 insertions(+), 67 deletions(-)

-- 
2.10.2

^ permalink raw reply

* [PATCH net 2/5] l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 11 ++++++-----
 net/l2tp/l2tp_ip6.c | 11 ++++++-----
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 1f57094..4d1c942 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -183,14 +183,15 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 
 		read_lock_bh(&l2tp_ip_lock);
 		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index af9abff..e3fc778 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -198,14 +198,15 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		read_lock_bh(&l2tp_ip6_lock);
 		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
 					    0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip6_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip6_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 1/5] l2tp: lock socket before checking flags in connect()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 +++-
 net/l2tp/l2tp_ip.c  | 19 ++++++++++++-------
 net/l2tp/l2tp_ip6.c | 16 +++++++++++-----
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8fed1cd..f11ca83 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -970,6 +970,8 @@ int compat_ipv6_setsockopt(struct sock *sk, int level, int optname,
 int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen);
 
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *addr,
+			   int addr_len);
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 				 int addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 37874e2..ccf4055 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -139,7 +139,8 @@ void ip6_datagram_release_cb(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
 
-static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
+			   int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock	*inet = inet_sk(sk);
@@ -252,6 +253,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 out:
 	return err;
 }
+EXPORT_SYMBOL_GPL(__ip6_datagram_connect);
 
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 982f6c4..1f57094 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -308,21 +308,24 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	struct sockaddr_l2tpip *lsa = (struct sockaddr_l2tpip *) uaddr;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
 	if (ipv4_is_multicast(lsa->l2tp_addr.s_addr))
 		return -EINVAL;
 
-	rc = ip4_datagram_connect(sk, uaddr, addr_len);
-	if (rc < 0)
-		return rc;
-
 	lock_sock(sk);
 
+	/* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip4_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip_lock);
@@ -330,7 +333,9 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	write_unlock_bh(&l2tp_ip_lock);
 
+out_sk:
 	release_sock(sk);
+
 	return rc;
 }
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9978d01..af9abff 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -371,9 +371,6 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	int	addr_type;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
@@ -390,10 +387,18 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 			return -EINVAL;
 	}
 
-	rc = ip6_datagram_connect(sk, uaddr, addr_len);
-
 	lock_sock(sk);
 
+	 /* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip6_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip6_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip6_lock);
@@ -401,6 +406,7 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	write_unlock_bh(&l2tp_ip6_lock);
 
+out_sk:
 	release_sock(sk);
 
 	return rc;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 3/5] l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

It's not enough to check for sockets bound to same address at the
beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
time, a socket with the same address could be bound before we take
the l2tp lock again.

This patch moves the lookup right before inserting the new socket, so
that no change can ever happen to the list between address lookup and
socket insertion.

Care is taken to avoid side effects on the socket in case of failure.
That is, modifications of the socket are done after the lookup, when
binding is guaranteed to succeed, and before releasing the l2tp lock,
so that concurrent lookups will always see fully initialised sockets.

For l2tp_ip6, the lookup is now always done with the correct bound
device. Before this patch, when binding to a link-local address, the
lookup was done with the original sk->sk_bound_dev_if, which was later
overwritten with addr->l2tp_scope_id. Lookup is now performed with the
final sk->sk_bound_dev_if value.

Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
already been checked at this point (this part of the code seems to have
been copy-pasted from net/ipv6/raw.c).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 25 ++++++++++---------------
 net/l2tp/l2tp_ip6.c | 43 ++++++++++++++++++++-----------------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4d1c942..ea818f3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -257,14 +257,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr->l2tp_family != AF_INET)
 		return -EINVAL;
 
-	ret = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip_lock);
-	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
-				  sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-
-	read_unlock_bh(&l2tp_ip_lock);
-
 	lock_sock(sk);
 	if (!sock_flag(sk, SOCK_ZAPPED))
 		goto out;
@@ -282,14 +274,22 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		inet->inet_rcv_saddr = inet->inet_saddr = addr->l2tp_addr.s_addr;
 	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
 		inet->inet_saddr = 0;  /* Use device */
-	sk_dst_reset(sk);
 
+	write_lock_bh(&l2tp_ip_lock);
+	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
+				  sk->sk_bound_dev_if, addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip_lock);
+		ret = -EADDRINUSE;
+		goto out;
+	}
+
+	sk_dst_reset(sk);
 	l2tp_ip_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip_lock);
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip_lock);
+
 	ret = 0;
 	sock_reset_flag(sk, SOCK_ZAPPED);
 
@@ -297,11 +297,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	release_sock(sk);
 
 	return ret;
-
-out_in_use:
-	read_unlock_bh(&l2tp_ip_lock);
-
-	return ret;
 }
 
 static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index e3fc778..5f2ae61 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -267,6 +267,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_l2tpip6 *addr = (struct sockaddr_l2tpip6 *) uaddr;
 	struct net *net = sock_net(sk);
 	__be32 v4addr = 0;
+	int bound_dev_if;
 	int addr_type;
 	int err;
 
@@ -285,13 +286,6 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr_type & IPV6_ADDR_MULTICAST)
 		return -EADDRNOTAVAIL;
 
-	err = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip6_lock);
-	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr,
-				   sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-	read_unlock_bh(&l2tp_ip6_lock);
-
 	lock_sock(sk);
 
 	err = -EINVAL;
@@ -301,28 +295,25 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (sk->sk_state != TCP_CLOSE)
 		goto out_unlock;
 
+	bound_dev_if = sk->sk_bound_dev_if;
+
 	/* Check if the address belongs to the host. */
 	rcu_read_lock();
 	if (addr_type != IPV6_ADDR_ANY) {
 		struct net_device *dev = NULL;
 
 		if (addr_type & IPV6_ADDR_LINKLOCAL) {
-			if (addr_len >= sizeof(struct sockaddr_in6) &&
-			    addr->l2tp_scope_id) {
-				/* Override any existing binding, if another
-				 * one is supplied by user.
-				 */
-				sk->sk_bound_dev_if = addr->l2tp_scope_id;
-			}
+			if (addr->l2tp_scope_id)
+				bound_dev_if = addr->l2tp_scope_id;
 
 			/* Binding to link-local address requires an
-			   interface */
-			if (!sk->sk_bound_dev_if)
+			 * interface.
+			 */
+			if (!bound_dev_if)
 				goto out_unlock_rcu;
 
 			err = -ENODEV;
-			dev = dev_get_by_index_rcu(sock_net(sk),
-						   sk->sk_bound_dev_if);
+			dev = dev_get_by_index_rcu(sock_net(sk), bound_dev_if);
 			if (!dev)
 				goto out_unlock_rcu;
 		}
@@ -337,13 +328,22 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	rcu_read_unlock();
 
-	inet->inet_rcv_saddr = inet->inet_saddr = v4addr;
+	write_lock_bh(&l2tp_ip6_lock);
+	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr, bound_dev_if,
+				   addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip6_lock);
+		err = -EADDRINUSE;
+		goto out_unlock;
+	}
+
+	inet->inet_saddr = v4addr;
+	inet->inet_rcv_saddr = v4addr;
+	sk->sk_bound_dev_if = bound_dev_if;
 	sk->sk_v6_rcv_saddr = addr->l2tp_addr;
 	np->saddr = addr->l2tp_addr;
 
 	l2tp_ip6_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip6_lock);
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip6_lock);
@@ -356,10 +356,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rcu_read_unlock();
 out_unlock:
 	release_sock(sk);
-	return err;
 
-out_in_use:
-	read_unlock_bh(&l2tp_ip6_lock);
 	return err;
 }
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 4/5] l2tp: fix lookup for sockets not bound to a device in l2tp_ip
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

When looking up an l2tp socket, we must consider a null netdevice id as
wild card. There are currently two problems caused by
__l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:

  * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
    never receives any packet. Since __l2tp_ip_bind_lookup() is called
    with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
    different from 'dif' so the socket doesn't match.

  * Two sockets, one bound to a device but not the other, can be bound
    to the same address. If the first socket binding to the address is
    the one that is also bound to a device, the second socket can bind
    to the same address without __l2tp_ip_bind_lookup() noticing the
    overlap.

To fix this issue, we need to consider that any null device index, be
it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
We also need to pass the input device index to __l2tp_ip_bind_lookup()
on reception so that sockets bound to a device never receive packets
from other devices.

This patch fixes l2tp_ip6 in the same way.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 6 ++++--
 net/l2tp/l2tp_ip6.c | 7 ++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index ea818f3..aa42e10 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -61,7 +61,8 @@ static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -182,7 +183,8 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 		struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
 
 		read_lock_bh(&l2tp_ip_lock);
-		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		sk = __l2tp_ip_bind_lookup(net, iph->daddr, inet_iif(skb),
+					   tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip_lock);
 			goto discard;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 5f2ae61..4a86440 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -73,7 +73,8 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(addr && ipv6_addr_equal(addr, laddr)) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -196,8 +197,8 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		struct ipv6hdr *iph = ipv6_hdr(skb);
 
 		read_lock_bh(&l2tp_ip6_lock);
-		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
-					    0, tunnel_id);
+		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, inet6_iif(skb),
+					    tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip6_lock);
 			goto discard;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 5/5] l2tp: fix address test in __l2tp_ip6_bind_lookup()
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
matches if addr is NULL or if addr != laddr.
But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
the same address, so the ipv6_addr_equal() condition needs to be
inverted.

For better clarity and consistency with the rest of the expression, the
(!X || X == Y) notation is used instead of !(X && X != Y).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 4a86440..aa821cb 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -72,7 +72,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,

 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
-		    !(addr && ipv6_addr_equal(addr, laddr)) &&
+		    (!addr || ipv6_addr_equal(addr, laddr)) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net] bpf: fix states equal logic for varlen access
From: Josef Bacik @ 2016-11-28 19:44 UTC (permalink / raw)
  To: davem, netdev, daniel, ast, jannh

If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 kernel/bpf/verifier.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 89f787c..2c8a688 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 {
 	struct bpf_reg_state *rold, *rcur;
 	int i;
+	bool map_access = env->varlen_map_value_access;
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		rold = &old->regs[i];
@@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
 		/* If the ranges were not the same, but everything else was and
 		 * we didn't do a variable access into a map then we are a-ok.
 		 */
-		if (!env->varlen_map_value_access &&
+		if (!map_access &&
 		    rold->type == rcur->type && rold->imm == rcur->imm)
 			continue;
 
+		/* If we didn't map access then again we don't care about the
+		 * mismatched range values and it's ok if our old type was
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 */
 		if (rold->type == NOT_INIT ||
-		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
+		    (!map_access && (rold->type == UNKNOWN_VALUE &&
+				     rcur->type != NOT_INIT)))
 			continue;
 
 		if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v2 3/4] Documentation: net: phy: Add blurb about RGMII
From: Florian Fainelli @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Mason, Sebastian Frias, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <b6a19ca6-8bda-8573-eb7b-9ed7f80ca480@free.fr>

On 11/28/2016 11:15 AM, Mason wrote:
> On 28/11/2016 18:43, Florian Fainelli wrote:
> 
>> On 11/28/2016 02:34 AM, Sebastian Frias wrote:
>>
>>> For what is worth, the Atheros at803x driver comes with RX delay enabled
>>> as per HW reset.
>>
>> Always, or is this a behavior possibly defined via a stra-pin that can
>> be changed?
> 
> Here's the data sheet:
> 
>   http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf
> 
> 4.2.25 rgmii rx clock delay control
> Offset: 0x00
> bit 15: Control bit for rgmii interface rx clock delay:
> 1 = rgmii rx clock delay enable
> 0 = rgmii rx clock delay disable
> HW Rst. 1
> SW Rst. 1
> 
> As far as I can tell, rx delay is enabled by default, always.
> 
> The "Features" list mentions
> "RGMII timing modes support internal delay and external delay on Rx path"
> (Not sure about the internal vs external distinction.)
> 
> Table 3-6. RGMII AC Characteristics — no Internal Delay
> Table 3-7. RGMII AC Characteristics — with internal delay added (Default)
> 
> Delay is 2 ns apparently.
> 
> There's also
> 4.2.27 Hib ctrl and rgmii gtx clock delay register
> Offset: 0x0B
> 
> bits 6:5 Gtx_dly_val
> Select the delay of gtx_clk.
> 00:0.25ns
> 01:1.3ns
> 10:2.4ns
> 11:3.4ns
> 
> I don't know what that is used for.
> Maybe this is the external vs internal delay?

First, this has little to do with the initial patch series being
discussed now, and second, this still looks like an internal delay
programming, just that it applies to the received transmit clock from
the MAC, that's how I read it though.
-- 
Florian

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Hannes Frederic Sowa, David Miller, Tom Herbert,
	Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <CACT4Y+Yr6UJxXosjx2FO=KtxGr40X07_UBiscs1BP1s8cTaqAA@mail.gmail.com>

On Mon, 2016-11-28 at 20:34 +0100, Dmitry Vyukov wrote:
> On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller

> > Hi Eric,
> >
> > As far as I can see, skb_network_offset() becomes negative after
> > pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> > At least I'm able to detect that with a BUG_ON().
> >
> > Also it seems that the issue is only reproducible (at least with the
> > poc I provided) for a short time after boot.
> 
> 
> Eric,
> 
> Is it enough to debug? Or maybe Andrey can trace some values for you.

Well, now we are talking, if you tell me how many modules you load, it
might help ;)

nf_ct_frag6_queue is nowhere to be seen in my kernels, that might
explain why I could not reproduce the bug.

Let me try ;)

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Lance Richardson @ 2016-11-28 19:49 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Sven-Haegar Koch, Eli Cooper, netdev, Eric Dumazet
In-Reply-To: <1428298236.4223309.1480355647336.JavaMail.zimbra@redhat.com>

> From: "Lance Richardson" <lrichard@redhat.com>
> To: "Stephen Rothwell" <sfr@canb.auug.org.au>
> Cc: "Sven-Haegar Koch" <haegar@sdinet.de>, "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric Dumazet"
> <eric.dumazet@gmail.com>
> Sent: Monday, November 28, 2016 12:54:07 PM
> Subject: Re: Large performance regression with 6in4 tunnel (sit)
> 
> > From: "Stephen Rothwell" <sfr@canb.auug.org.au>
> > To: "Sven-Haegar Koch" <haegar@sdinet.de>
> > Cc: "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric
> > Dumazet" <eric.dumazet@gmail.com>
> > Sent: Saturday, November 26, 2016 10:23:40 PM
> > Subject: Re: Large performance regression with 6in4 tunnel (sit)
> > 
> > Hi Sven-Haegar,
> > 
> > On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch
> > <haegar@sdinet.de>
> > wrote:
> > >
> > > Somehow this problem description really reminds me of a report on
> > > netdev a bit ago, which the following patch fixed:
> > > 
> > > commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
> > > Author: Lance Richardson <lrichard@redhat.com>
> > > Date:   Wed Nov 2 16:36:17 2016 -0400
> > > 
> > >     ipv4: allow local fragmentation in ip_finish_output_gso()
> > >     
> > >     Some configurations (e.g. geneve interface with default
> > >     MTU of 1500 over an ethernet interface with 1500 MTU) result
> > >     in the transmission of packets that exceed the configured MTU.
> > >     While this should be considered to be a "bad" configuration,
> > >     it is still allowed and should not result in the sending
> > >     of packets that exceed the configured MTU.
> > > 
> > > Could this be related?
> > > 
> > > I suppose it would be difficult to test this patch on this machine?
> > 
> > The kernel I am running on is based on 4.7.8, so the above patch
> > doesn't come close to applying. Most fo what it is reverting was
> > introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> > bit to inet_skb_parm.flags") in v4.8-rc1.
> > 
> > --
> > Cheers,
> > Stephen Rothwell
> > 
> 
> This should be equivalent for 4.7.x:
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 4bd4921..8a253e2 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -224,8 +224,7 @@ static int ip_finish_output_gso(struct net *net, struct
> sock *sk,
>         int ret = 0;
>  
>         /* common case: locally created skb or seglen is <= mtu */
> -       if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -             skb_gso_network_seglen(skb) <= mtu)
> +       if (skb_gso_network_seglen(skb) <= mtu)
>                 return ip_finish_output2(net, sk, skb);
>  
>         /* Slowpath -  GSO segment length is exceeding the dst MTU.
> 

BTW, I do think this would be worth trying. For the geneve case, I
measured on the order of a 10X-100X performance hit without this
patch, traces were similar to what you describe (too-large gso packets
were dropped, corresponding TCP segments were retransmitted later via
a non-gso code path).

Regards,

   Lance

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357935.18162.76.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:32:15 -0800

> On Mon, 2016-11-28 at 10:17 -0800, Eric Dumazet wrote:
> 
>> I was referring to Stephen bug report.
>> 
>> It appears that Eli changelog was not very precise, because his bug was
>> because of XFRM being involved.
>> 
>> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
>> 
>> So XFRM calls skb_gso_segment() before ip_output() or
>> ip6_finish_output2() had a chance to change skb->protocol
> 
> So maybe the real fix would be to set skb->protocol at the right place,
> before xfrm can be called, instead of chasing all skb producers ;)

And the key for this would be dst_output() invocations.

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: Alexei Starovoitov @ 2016-11-28 20:06 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-2-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
> Code move only; no functional change intended.

not quite...

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
>   * @sk: The socken sending or receiving traffic
> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>  
>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>  	if (prog) {
> -		unsigned int offset = skb->data - skb_network_header(skb);
> -
> -		__skb_push(skb, offset);
> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
> -		__skb_pull(skb, offset);
> +		switch (type) {
> +		case BPF_CGROUP_INET_INGRESS:
> +		case BPF_CGROUP_INET_EGRESS:
> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
> +			break;

hmm. what's a point of double jump table? It's only burning cycles
in the fast path. We already have
prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
Could you do a variant of __cgroup_bpf_run_filter() instead ?
That doesnt't take 'skb' as an argument.
It will also solve scary looking NULL skb from patch 2:
__cgroup_bpf_run_filter(sk, NULL, ...

and to avoid copy-pasting first dozen lines of current
__cgroup_bpf_run_filter can be moved into a helper that
__cgroup_bpf_run_filter_skb and
__cgroup_bpf_run_filter_sk will call.
Or some other way to rearrange that code.

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357043.18162.70.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:17:23 -0800

> It appears that Eli changelog was not very precise, because his bug was
> because of XFRM being involved.
> 
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> So XFRM calls skb_gso_segment() before ip_output() or
> ip6_finish_output2() had a chance to change skb->protocol

I think if we do it in ip6_local_out and ip_local_out, then none of
these hacks will be necessary at all.

^ permalink raw reply

* Re: [PATCH net-next 1/1] driver: ipvlan: Add the sanity check for ipvlan mode
From: David Miller @ 2016-11-28 20:08 UTC (permalink / raw)
  To: maheshb; +Cc: fgao, edumazet, netdev, gfree.wind
In-Reply-To: <CAF2d9ji2_JhiNv9P9aUt7wV2Vfjv+z_q-MWSKKH8tzO_1THobg@mail.gmail.com>

From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com>
Date: Mon, 28 Nov 2016 11:02:45 -0800

> On Mon, Nov 28, 2016 at 5:23 AM, <fgao@ikuai8.com> wrote:
> 
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> The ipvlan mode variable "nval" is from userspace, so the ipvlan codes
>> should check if the mode variable "nval" is valid.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>>  drivers/net/ipvlan/ipvlan_main.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_
>> main.c
>> index ab90b22..537b5a9 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -65,6 +65,9 @@ static int ipvlan_set_port_mode(struct ipvl_port *port,
>> u16 nval)
>>         struct net_device *mdev = port->dev;
>>         int err = 0;
>>
>> +       if (nval >= IPVLAN_MODE_MAX)
>> +               return -EINVAL;
>> +
>>
> I'm curious to know how you encountered this issue? The values are
> validated in ipvlan_nl_validate() and it should fail at that time itself.

I'm not applying this without at least a better explanation.

^ permalink raw reply

* [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Woojung.Huh @ 2016-11-28 20:03 UTC (permalink / raw)
  To: davem, f.fainelli, andrew; +Cc: netdev, UNGLinuxDriver

From: Woojung Huh <woojung.huh@microchip.com>

Add LAN7801 MAC-only configuration support.

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
---
 drivers/net/usb/Kconfig   |   5 +++
 drivers/net/usb/lan78xx.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/net/usb/lan78xx.h |  14 ++++++
 3 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index cdde590..3dd490f5 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -114,6 +114,11 @@ config USB_LAN78XX
 	help
 	  This option adds support for Microchip LAN78XX based USB 2
 	  & USB 3 10/100/1000 Ethernet adapters.
+	  LAN7800 : USB 3 to 10/100/1000 Ethernet adapter
+	  LAN7850 : USB 2 to 10/100/1000 Ethernet adapter
+	  LAN7801 : USB 3 to 10/100/1000 Ethernet adapter (MAC only)
+
+	  Proper PHY driver is required for LAN7801.
 
 	  To compile this driver as a module, choose M here: the
 	  module will be called lan78xx.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0c459e9..08f2895 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -40,7 +40,7 @@
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
 #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME	"lan78xx"
-#define DRIVER_VERSION	"1.0.5"
+#define DRIVER_VERSION	"1.0.6"
 
 #define TX_TIMEOUT_JIFFIES		(5 * HZ)
 #define THROTTLE_JIFFIES		(HZ / 8)
@@ -67,6 +67,7 @@
 #define LAN78XX_USB_VENDOR_ID		(0x0424)
 #define LAN7800_USB_PRODUCT_ID		(0x7800)
 #define LAN7850_USB_PRODUCT_ID		(0x7850)
+#define LAN7801_USB_PRODUCT_ID		(0x7801)
 #define LAN78XX_EEPROM_MAGIC		(0x78A5)
 #define LAN78XX_OTP_MAGIC		(0x78F3)
 
@@ -400,6 +401,21 @@ struct lan78xx_net {
 	struct irq_domain_data	domain_data;
 };
 
+/* define external phy id */
+#define	PHY_LAN8835			(0x0007C130)
+#define	PHY_KSZ9031RNX			(0x00221620)
+
+/* phyid : masked external phy id
+ * pre_config : if needed, configure MAC and/or external PHY
+ *		such as irq pin mux and RGMII timing.
+ */
+struct ext_phy_config_table {
+	int phyid;
+	void (*pre_config)(struct lan78xx_net *dev,
+			   struct phy_device *phydev,
+			   phy_interface_t *interface);
+};
+
 /* use ethtool to change the level for any given device */
 static int msg_level = -1;
 module_param(msg_level, int, 0);
@@ -1697,6 +1713,7 @@ static int lan78xx_mdiobus_read(struct mii_bus *bus, int phy_id, int idx)
 done:
 	mutex_unlock(&dev->phy_mutex);
 	usb_autopm_put_interface(dev->intf);
+
 	return ret;
 }
 
@@ -1759,6 +1776,10 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
 		/* set to internal PHY id */
 		dev->mdiobus->phy_mask = ~(1 << 1);
 		break;
+	case ID_REV_CHIP_ID_7801_:
+		/* scan thru PHYAD[2..0] */
+		dev->mdiobus->phy_mask = ~(0xFF);
+		break;
 	}
 
 	ret = mdiobus_register(dev->mdiobus);
@@ -1933,11 +1954,58 @@ static void lan78xx_remove_irq_domain(struct lan78xx_net *dev)
 	dev->domain_data.irqdomain = NULL;
 }
 
+static void lan8835_pre_config(struct lan78xx_net *dev,
+			       struct phy_device *phydev,
+			       phy_interface_t *interface)
+{
+	int buf;
+	int ret;
+
+	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
+	buf = phy_read_mmd_indirect(phydev, 32784, 3);
+	buf &= ~0x1800;
+	buf |= 0x0800;
+	phy_write_mmd_indirect(phydev, 32784, 3, buf);
+
+	/* RGMII MAC TXC Delay Enable */
+	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+				MAC_RGMII_ID_TXC_DELAY_EN_);
+
+	/* RGMII TX DLL Tune Adjust */
+	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
+}
+
+static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
+				  struct phy_device *phydev,
+				  phy_interface_t *interface)
+{
+	/* Micrel9301RNX PHY configuration */
+	/* RGMII Control Signal Pad Skew */
+	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
+	/* RGMII RX Data Pad Skew */
+	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
+	/* RGMII RX Clock Pad Skew */
+	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
+}
+
+/* external phyid & configure routine for LAN7801 */
+static struct ext_phy_config_table ext_phy_table[] = {
+	{ PHY_LAN8835, lan8835_pre_config },
+	{ PHY_KSZ9031RNX, ksz9031rnx_pre_config },
+	{}
+};
+
 static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
 	int ret;
 	u32 mii_adv;
+	u32 masked_phyid;
 	struct phy_device *phydev = dev->net->phydev;
+	phy_interface_t interface;
 
 	phydev = phy_find_first(dev->mdiobus);
 	if (!phydev) {
@@ -1945,6 +2013,37 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 		return -EIO;
 	}
 
+	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
+	    (dev->chipid == ID_REV_CHIP_ID_7850_)) {
+		phydev->is_internal = true;
+		interface = PHY_INTERFACE_MODE_GMII;
+
+	} else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+		int i;
+
+		if (!phydev->drv) {
+			netdev_err(dev->net, "no PHY driver found\n");
+			return -EIO;
+		}
+
+		masked_phyid = phydev->phy_id & phydev->drv->phy_id_mask;
+		interface = PHY_INTERFACE_MODE_RGMII;
+
+		for (i = 0; ext_phy_table[i].phyid != 0; i++) {
+			if ((masked_phyid == ext_phy_table[i].phyid) &&
+			    (ext_phy_table[i].pre_config)) {
+				ext_phy_table[i].pre_config(dev,
+							    phydev,
+							    &interface);
+			}
+		}
+
+		phydev->is_internal = false;
+	} else {
+		netdev_err(dev->net, "unknown ID found\n");
+		return -EIO;
+	}
+
 	/* if phyirq is not set, use polling mode in phylib */
 	if (dev->domain_data.phyirq > 0)
 		phydev->irq = dev->domain_data.phyirq;
@@ -1957,7 +2056,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 
 	ret = phy_connect_direct(dev->net, phydev,
 				 lan78xx_link_status_change,
-				 PHY_INTERFACE_MODE_GMII);
+				 interface);
 	if (ret) {
 		netdev_err(dev->net, "can't attach PHY to %s\n",
 			   dev->mdiobus->id);
@@ -2338,6 +2437,9 @@ static int lan78xx_reset(struct lan78xx_net *dev)
 	} while ((buf & PMT_CTL_PHY_RST_) || !(buf & PMT_CTL_READY_));
 
 	ret = lan78xx_read_reg(dev, MAC_CR, &buf);
+	/* LAN7801 only has RGMII mode */
+	if (dev->chipid == ID_REV_CHIP_ID_7801_)
+		buf &= ~MAC_CR_GMII_EN_;
 	buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
 	ret = lan78xx_write_reg(dev, MAC_CR, buf);
 
@@ -2466,6 +2568,7 @@ static int lan78xx_stop(struct net_device *net)
 
 	phy_stop(net->phydev);
 	phy_disconnect(net->phydev);
+
 	net->phydev = NULL;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
@@ -3887,6 +3990,10 @@ static const struct usb_device_id products[] = {
 	/* LAN7850 USB Gigabit Ethernet Device */
 	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7850_USB_PRODUCT_ID),
 	},
+	{
+	/* LAN7801 USB Gigabit Ethernet Device */
+	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7801_USB_PRODUCT_ID),
+	},
 	{},
 };
 MODULE_DEVICE_TABLE(usb, products);
diff --git a/drivers/net/usb/lan78xx.h b/drivers/net/usb/lan78xx.h
index 4092790..25aa546 100644
--- a/drivers/net/usb/lan78xx.h
+++ b/drivers/net/usb/lan78xx.h
@@ -108,6 +108,7 @@
 #define ID_REV_CHIP_REV_MASK_		(0x0000FFFF)
 #define ID_REV_CHIP_ID_7800_		(0x7800)
 #define ID_REV_CHIP_ID_7850_		(0x7850)
+#define ID_REV_CHIP_ID_7801_		(0x7801)
 
 #define FPGA_REV			(0x04)
 #define FPGA_REV_MINOR_MASK_		(0x0000FF00)
@@ -550,6 +551,7 @@
 #define LTM_INACTIVE1_TIMER10_		(0x0000FFFF)
 
 #define MAC_CR				(0x100)
+#define MAC_CR_GMII_EN_			(0x00080000)
 #define MAC_CR_EEE_TX_CLK_STOP_EN_	(0x00040000)
 #define MAC_CR_EEE_EN_			(0x00020000)
 #define MAC_CR_EEE_TLAR_EN_		(0x00010000)
@@ -787,6 +789,18 @@
 #define PHY_DEV_ID_MODEL_MASK_		(0x0FC00000)
 #define PHY_DEV_ID_OUI_MASK_		(0x003FFFFF)
 
+#define RGMII_TX_BYP_DLL		(0x708)
+#define RGMII_TX_BYP_DLL_TX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_TX_BYP_DLL_TX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_TX_BYP_DLL_TX_DLL_RESET_		(0x0000002)
+#define RGMII_TX_BYP_DLL_TX_DLL_BYPASS_		(0x0000001)
+
+#define RGMII_RX_BYP_DLL		(0x70C)
+#define RGMII_RX_BYP_DLL_RX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_RX_BYP_DLL_RX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_RX_BYP_DLL_RX_DLL_RESET_		(0x0000002)
+#define RGMII_RX_BYP_DLL_RX_DLL_BYPASS_		(0x0000001)
+
 #define OTP_BASE_ADDR			(0x00001000)
 #define OTP_ADDR_RANGE_			(0x1FF)
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net 11/16] net: ethernet: marvell: mvneta: fix fixed-link phydev leaks
From: Thomas Petazzoni @ 2016-11-28 20:10 UTC (permalink / raw)
  To: Johan Hovold
  Cc: David S. Miller, Vince Bridgers, Florian Fainelli, Fugang Duan,
	Pantelis Antoniou, Vitaly Bordug, Claudiu Manoil, Li Yang,
	Felix Fietkau, John Crispin, Matthias Brugger, Sergei Shtylyov,
	Lars Persson, Mugunthan V N, Grygorii Strashko, Rob Herring,
	Frank Rowand, Andrew Lunn, Vivien Didelot
In-Reply-To: <1480357509-28074-12-git-send-email-johan@kernel.org>

Hello,

On Mon, 28 Nov 2016 19:25:04 +0100, Johan Hovold wrote:
> Make sure to deregister and free any fixed-link PHY registered using
> of_phy_register_fixed_link() on probe errors and on driver unbind.
> 
> Fixes: 83895bedeee6 ("net: mvneta: add support for fixed links")
> Signed-off-by: Johan Hovold <johan@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 5 +++++
>  1 file changed, 5 insertions(+)

Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>

Thanks!

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next V2 00/10] Mellanox 100G mlx5 DCBX and ethtool updates
From: David Miller @ 2016-11-28 20:10 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <1480258932-10302-1-git-send-email-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Sun, 27 Nov 2016 17:02:02 +0200

> This series provides the following mlx5 updates:
 ...

Series applied, thank you.

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-28 20:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161128.112248.2198395724649783009.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

On 11/28/2016 05:22 PM, David Miller wrote:
> Thanks for trying to solve this problem.
> 
> But we really don't want this to be Kconfig gated.  If we decide to
> support this it should be a run-time selectable option.  Every
> distribution on the planet is going to turn your Kconfig option on, so
> whatever you think we might be saving by putting this behind Kconfig
> checks won't be realized for large swaths of the userbase.
> 

OK for that

> I also question how necessary %100 consistent hashing of ECMP really
> is.
> 
> If we can do something at extremely low cost and arrive at a very low
> disruption rate, honestly that's probably good enough.
> 
> I assume you've looked over RFC2992.  A low disrutpion algorithm is
> described there and if we could just get rid of the divide it might
> be extremely desirable.

I wanted a solution that has very low disruption for both scale up and
scale down. When a next-hop is added, only 1/N flows should be disrupted
(i.e. in this case, reaffected to the new next-hop). When a next-hop is
removed, only the flows that were handled by this particular next-hop
should be disrupted, and those flows should be equally rebalanced across
the remaining next-hops. Although the solution presented in RFC2992 has
a "reasonably" low disruption (looks like it can get up to 1/2
disruption though), I'm not sure if the equal-rebalancing part holds.

The advantage of my solution over RFC2992 is lowest possible disruption
and equal rebalancing of affected flows. The disadvantage is the lookup
complexity of O(log n) vs O(1). Although from a theoretical viewpoint
O(1) is obviously better, would O(log n) have an effectively measurable
negative impact on scalability ? If we consider 32 next-hops for a route
and 100 pseudo-random numbers generated per next-hop, the lookup
algorithm would have to perform in the worst case log2 3200 = 11
comparisons to select a next-hop for that route.

For my part I prefer the minimal disruption over constant time lookup.
I'll try to come up with some measurements to see the actual impact.

David

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-28 20:20 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20161128191832.46191e1c@jkicinski-Precision-T1700>

> Any particular reason not to allow the XDP prog being set while the
> device is closed?  You seem to preserve the program across
> close()/open() cycles so the edev->xdp_prog is alive and valid while
> device is closed IIUC.  

> I think other drivers are allowing setting XDP while closed and it
> would be cool to keep the behaviour identical across drivers :)

You're right; No reason to prevent this.
I'll fix it in v2.

^ permalink raw reply

* Re: [PATCH net-next 1/3] ethtool: (uapi) Add ETHTOOL_PHY_LOOPBACK to PHY tunables
From: Andrew Lunn @ 2016-11-28 20:21 UTC (permalink / raw)
  To: Allan W. Nielsen; +Cc: netdev, f.fainelli, raju.lakkaraju
In-Reply-To: <20161128192306.GA11474@microsemi.com>

On Mon, Nov 28, 2016 at 08:23:06PM +0100, Allan W. Nielsen wrote:
> Hi Andrew and Florian,
> 
> On 28/11/16 15:14, Andrew Lunn wrote:
> > On Mon, Nov 28, 2016 at 02:24:30PM +0100, Allan W. Nielsen wrote:
> > > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> > >
> > > 3 types of PHY loopback are supported.
> > > i.e. Near-End Loopback, Far-End Loopback and External Loopback.
> > Is this integrated with ethtool --test? You only want the PHY to go
> > into loopback mode when running ethtool --test external_lb or maybe
> > ethtool --test offline.
> There are other use-cases for enabling PHY loopback:
> 
> 1) If the PHY is connected to a switch then a loop-port is sometime
>    used to "force/enable" one or more extra pass through the switch
>    core. This "hack" can sometime be used to achieve new functionality
>    with existing silicon.

With Linux, switches are managed via switchdev, or DSA. You will have
to teach this infrastructure that something really odd is going on
with one of its ports before you do anything like this in the PHY
layer. I suggest you leave this use case alone until somebody
really-really wants it. From my knowledge of the Marvell DSA driver,
this is not easy.

> 2) Existing user-space application may expect to be able to do the
>    testing on its own (generate/validate the test traffic).

Please can you reference some existing user-space application and the
kernel API it uses to put the PHY into loopback mode?

> We are always happy to integrate with any existing functionality, but
> as I understand "ethtool --test" then intention is to perform a test
> and then bring back the PHY in to a "normal" state (I may be
> wrong...).

Correct.

> The idea with this patch is to allow configuring loopback more
> "permanently" (userspace can decide when to activate and when to
> de-activate). I should properly have made that clear in the cover
> letter.

Leaving it in loopback is a really bad idea. I've spent days once
working out why an Ethernet did not work. Turned out the PHY powered
up in loopback mode, and the embedded OS running on it did not
initialise it to sensible defaults on probe. Packets we going out,
dhcp server was replying but all incoming packets were discarded.

It is really not obvious when everything looks O.K, but nothing works,
because the PHY is in loopback. There needs to be a big red flag to
warn you.

If you really do what to do this, you should look at NETIF_F_LOOPBACK
and consider how this concept can be applied at the PHY, not the MAC.
But you need the full concept, so you see things like:

2: eth0: <PHY_LOOPBACK,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 80:ee:73:83:60:27 brd ff:ff:ff:ff:ff:ff

Humm, i've no idea how you actually enable the MAC loopback
NETIF_F_LOOPBACK represents. I don't see anything with ip link set.

     Andrew

^ permalink raw reply

* Re: Checking for MDIO phy address 0
From: Florian Fainelli @ 2016-11-28 20:25 UTC (permalink / raw)
  To: Phil Endecott, netdev
In-Reply-To: <1480359230293@dmwebmail.dmwebmail.chezphil.org>

On 11/28/2016 10:53 AM, Phil Endecott wrote:
> Dear Experts,
> 
> Is it true that phy address 0 is special, and should not be used?

It is special in that it can be made special, or not (very helpful, I
know). AFAIR, address 0 was defined a while ago (by Cisco) to be a
special broadcast address which could be used to e.g: blast the same
write to several devices. As you may imagine this can be helpful in a
switch environment where you may have to reset e.g: 48-ports.

In practice, the behavior varies a lot depending on:

- whether address 0 is potentially used to access a built-in PHY within
e.g: an Ethernet switch

- what kind of strapping options/configurations are selected for the
PHY, and whether address 0 is going to be snooped by the PHY devices's
MDIO block and how it

> I have this in my device tree (edited for brevity):
> 
>         mdio@17020000 {
>             compatible = "apm,xgene-mdio-rgmii";
>             #address-cells = <0x00000001>;
>             #size-cells = <0x00000000>;
>             phy@4 {
>                 reg = <0x00000000>;

Nit: why not make the unit address and the actual reg property match, e.g:

	phy@0 {
		reg = <0>;
	};

	phy@1 {
		reg = <1>;
	};

>                 linux,phandle = <0x00000021>;
>                 phandle = <0x00000021>;
>             };
>             phy@5 {
>                 reg = <0x00000001>;
>                 linux,phandle = <0x00000022>;
>                 phandle = <0x00000022>;
>             };
>         };
>         ethernet@1f210000 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000021>;
>         };
>         ethernet@1f210030 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000022>;
>         };
> 
> (This is on a Gigabyte MP30-AR1, which has an X-Gene ARM64 processor.)
> 
> I've spent a long time trying to get the two gigabit ethernet ports to 
> correctly auto-negotiate down to 100 Mbit, and it's possible that the 
> underlying problem is that one of the phys is using address 0.  Does 
> that make sense?

Maybe, but you are not exactly describing the problems you are seeing.

If the issue is that both PHYs are configured to respond to MDIO address
0, what could happen is that the Ethernet instance which references this
address 0 may end-up clobbering the other PHY's settings as well because
of the broadcast properties applied to this special address.

Few things for you to check:

- can you scan the entire bus range and see which addresses are valid,
outside of address 0 and 1, there is possibly another address at which
one of the two PHYs should respond, if not, check the next recommendation

- if you have access to the MV88E1512 datasheet, can you check if the
PHY is configured to respond to MDIO address 0, and what this implies
for reads and writes towards that address? Can you disable this and just
have address 0 be a normal (non-broadcast) address?

> 
> Anyway, my reason for this message is to suggest a runtime diagnostic 
> message somewhere if address 0 is used - this could have saved me a lot of 
> work! If someone can suggest the right place to add this I can prepare a 
> patch.

Address 0 can be made special or not, but there is no programmatic way
to tell without scanning the MDIO bus for devices, so I don't think a
warning is going to help at all to warn about that. There are also tons
of cases where the address is just treated like any other address, which
is typical with MDIO connected Ethernet switches where PHY@0 is just the
switch's port 0 built-in PHY for instance.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-28 20:31 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161128200641.GA7634@ast-mbp.thefacebook.com>

On 11/28/16 1:06 PM, Alexei Starovoitov wrote:
> On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
>> Code move only; no functional change intended.
> 
> not quite...
> 
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ...
>>   * @sk: The socken sending or receiving traffic
>> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>>  
>>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>>  	if (prog) {
>> -		unsigned int offset = skb->data - skb_network_header(skb);
>> -
>> -		__skb_push(skb, offset);
>> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
>> -		__skb_pull(skb, offset);
>> +		switch (type) {
>> +		case BPF_CGROUP_INET_INGRESS:
>> +		case BPF_CGROUP_INET_EGRESS:
>> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
>> +			break;
> 
> hmm. what's a point of double jump table? It's only burning cycles
> in the fast path. We already have
> prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
> Could you do a variant of __cgroup_bpf_run_filter() instead ?
> That doesnt't take 'skb' as an argument.
> It will also solve scary looking NULL skb from patch 2:
> __cgroup_bpf_run_filter(sk, NULL, ...
> 
> and to avoid copy-pasting first dozen lines of current
> __cgroup_bpf_run_filter can be moved into a helper that
> __cgroup_bpf_run_filter_skb and
> __cgroup_bpf_run_filter_sk will call.
> Or some other way to rearrange that code.
> 

sure
1. rename the existing __cgroup_bpf_run_filter to __cgroup_bpf_run_filter_skb

2. create new __cgroup_bpf_run_filter_sk for this new program type. the new run_filter does not need the family or full sock checks for this use case so the common code is only the sock_cgroup_ptr and prog lookups.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox