Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 0/2] mlx4 bug fixes for 4.9
From: David Miller @ 2016-11-28 20:34 UTC (permalink / raw)
  To: tariqt; +Cc: netdev, eranbe, sebott, swise
In-Reply-To: <1480267252-26146-1-git-send-email-tariqt@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>
Date: Sun, 27 Nov 2016 19:20:50 +0200

> This patchset includes 2 bug fixes:
> * In patch 1 we revert the commit that avoids invoking unregister_netdev
> in shutdown flow, as it introduces netdev presence issues where
> it can be accessed unsafely by ndo operations during the flow.
> * Patch 2 is a simple fix for a variable uninitialization issue.
> 
> Series generated against net commit:
> 6998cc6ec237 tipc: resolve connection flow control compatibility problem

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] bpf: Add new cgroup attach type to enable sock modifications
From: Alexei Starovoitov @ 2016-11-28 20:32 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-3-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:49AM -0800, David Ahern wrote:
> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> Currently only sk_bound_dev_if is exported to userspace for modification
> by a bpf program.
> 
> This allows a cgroup to be configured such that AF_INET{6} sockets opened
> by processes are automatically bound to a specific device. In turn, this
> enables the running of programs that do not support SO_BINDTODEVICE in a
> specific VRF context / L3 domain.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1f09c521adfe..808e158742a2 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -408,7 +408,7 @@ struct bpf_prog {
>  	enum bpf_prog_type	type;		/* Type of BPF program */
>  	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
>  	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
> -	unsigned int		(*bpf_func)(const struct sk_buff *skb,
> +	unsigned int		(*bpf_func)(const void *ctx,
>  					    const struct bpf_insn *filter);

Daniel already tweaked it. pls rebase.

> +static const struct bpf_func_proto *
> +cg_sock_func_proto(enum bpf_func_id func_id)
> +{
> +	return NULL;
> +}

if you don't want any helpers, just don't set .get_func_proto.
See check_call() in verifier.
Though why not allow socket filter like helpers that
sk_filter_func_proto() provides?
tail call, bpf_trace_printk, maps are useful things that you get for free.
Developing programs without bpf_trace_printk is pretty hard.

> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 5ddf5cda07f4..24d2550492ee 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -374,8 +374,18 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
>  
>  	if (sk->sk_prot->init) {
>  		err = sk->sk_prot->init(sk);
> -		if (err)
> +		if (err) {
> +			sk_common_release(sk);
> +			goto out;
> +		}
> +	}
> +
> +	if (!kern) {
> +		err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk);

i guess from vrf use case point of view this is the best place,
since so_bindtodevice can still override it,
but thinking little bit into other use case like port binding
restrictions and port rewrites can we move it into inet_bind ?
My understanding nothing will be using bound_dev_if until that
time, so we can set it there?
And at that point we can extend 'struct bpf_sock' with other
fields like port and sockaddr...
and single BPF_PROG_TYPE_CGROUP_SOCK type will be used for
vrf and port binding use cases...
More users, more testing of that code path...

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-11-28 20:32 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <583C9093.7060404@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Mon, 28 Nov 2016 21:16:19 +0100

> The advantage of my solution over RFC2992 is lowest possible disruption
> and equal rebalancing of affected flows. The disadvantage is the lookup
> complexity of O(log n) vs O(1). Although from a theoretical viewpoint
> O(1) is obviously better, would O(log n) have an effectively measurable
> negative impact on scalability ? If we consider 32 next-hops for a route
> and 100 pseudo-random numbers generated per next-hop, the lookup
> algorithm would have to perform in the worst case log2 3200 = 11
> comparisons to select a next-hop for that route.

When I was working on the routing cache removal in ipv4 I compared
using a stupid O(1) hash lookup of the FIB entries vs. the O(log n)
fib_trie stuff actually in use.

It did make a difference.

This is a lookup that can be invoked 20 million times per second or
more.

Every cycle matters.

We already have a lot of trouble getting under the cycle budget one
has for routing at wire speed for very high link rates, please don't
make it worse.

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: David Ahern @ 2016-11-28 20:31 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <20161128200641.GA7634@ast-mbp.thefacebook.com>

On 11/28/16 1:06 PM, Alexei Starovoitov wrote:
> On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
>> Code move only; no functional change intended.
> 
> not quite...
> 
>> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ...
>>   * @sk: The socken sending or receiving traffic
>> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>>  
>>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>>  	if (prog) {
>> -		unsigned int offset = skb->data - skb_network_header(skb);
>> -
>> -		__skb_push(skb, offset);
>> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
>> -		__skb_pull(skb, offset);
>> +		switch (type) {
>> +		case BPF_CGROUP_INET_INGRESS:
>> +		case BPF_CGROUP_INET_EGRESS:
>> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
>> +			break;
> 
> hmm. what's a point of double jump table? It's only burning cycles
> in the fast path. We already have
> prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
> Could you do a variant of __cgroup_bpf_run_filter() instead ?
> That doesnt't take 'skb' as an argument.
> It will also solve scary looking NULL skb from patch 2:
> __cgroup_bpf_run_filter(sk, NULL, ...
> 
> and to avoid copy-pasting first dozen lines of current
> __cgroup_bpf_run_filter can be moved into a helper that
> __cgroup_bpf_run_filter_skb and
> __cgroup_bpf_run_filter_sk will call.
> Or some other way to rearrange that code.
> 

sure
1. rename the existing __cgroup_bpf_run_filter to __cgroup_bpf_run_filter_skb

2. create new __cgroup_bpf_run_filter_sk for this new program type. the new run_filter does not need the family or full sock checks for this use case so the common code is only the sock_cgroup_ptr and prog lookups.

^ permalink raw reply

* Re: Checking for MDIO phy address 0
From: Florian Fainelli @ 2016-11-28 20:25 UTC (permalink / raw)
  To: Phil Endecott, netdev
In-Reply-To: <1480359230293@dmwebmail.dmwebmail.chezphil.org>

On 11/28/2016 10:53 AM, Phil Endecott wrote:
> Dear Experts,
> 
> Is it true that phy address 0 is special, and should not be used?

It is special in that it can be made special, or not (very helpful, I
know). AFAIR, address 0 was defined a while ago (by Cisco) to be a
special broadcast address which could be used to e.g: blast the same
write to several devices. As you may imagine this can be helpful in a
switch environment where you may have to reset e.g: 48-ports.

In practice, the behavior varies a lot depending on:

- whether address 0 is potentially used to access a built-in PHY within
e.g: an Ethernet switch

- what kind of strapping options/configurations are selected for the
PHY, and whether address 0 is going to be snooped by the PHY devices's
MDIO block and how it

> I have this in my device tree (edited for brevity):
> 
>         mdio@17020000 {
>             compatible = "apm,xgene-mdio-rgmii";
>             #address-cells = <0x00000001>;
>             #size-cells = <0x00000000>;
>             phy@4 {
>                 reg = <0x00000000>;

Nit: why not make the unit address and the actual reg property match, e.g:

	phy@0 {
		reg = <0>;
	};

	phy@1 {
		reg = <1>;
	};

>                 linux,phandle = <0x00000021>;
>                 phandle = <0x00000021>;
>             };
>             phy@5 {
>                 reg = <0x00000001>;
>                 linux,phandle = <0x00000022>;
>                 phandle = <0x00000022>;
>             };
>         };
>         ethernet@1f210000 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000021>;
>         };
>         ethernet@1f210030 {
>             compatible = "apm,xgene1-sgenet";
>             phy-connection-type = "sgmii";
>             phy-handle = <0x00000022>;
>         };
> 
> (This is on a Gigabyte MP30-AR1, which has an X-Gene ARM64 processor.)
> 
> I've spent a long time trying to get the two gigabit ethernet ports to 
> correctly auto-negotiate down to 100 Mbit, and it's possible that the 
> underlying problem is that one of the phys is using address 0.  Does 
> that make sense?

Maybe, but you are not exactly describing the problems you are seeing.

If the issue is that both PHYs are configured to respond to MDIO address
0, what could happen is that the Ethernet instance which references this
address 0 may end-up clobbering the other PHY's settings as well because
of the broadcast properties applied to this special address.

Few things for you to check:

- can you scan the entire bus range and see which addresses are valid,
outside of address 0 and 1, there is possibly another address at which
one of the two PHYs should respond, if not, check the next recommendation

- if you have access to the MV88E1512 datasheet, can you check if the
PHY is configured to respond to MDIO address 0, and what this implies
for reads and writes towards that address? Can you disable this and just
have address 0 be a normal (non-broadcast) address?

> 
> Anyway, my reason for this message is to suggest a runtime diagnostic 
> message somewhere if address 0 is used - this could have saved me a lot of 
> work! If someone can suggest the right place to add this I can prepare a 
> patch.

Address 0 can be made special or not, but there is no programmatic way
to tell without scanning the MDIO bus for devices, so I don't think a
warning is going to help at all to warn about that. There are also tons
of cases where the address is just treated like any other address, which
is typical with MDIO connected Ethernet switches where PHY@0 is just the
switch's port 0 built-in PHY for instance.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 1/3] ethtool: (uapi) Add ETHTOOL_PHY_LOOPBACK to PHY tunables
From: Andrew Lunn @ 2016-11-28 20:21 UTC (permalink / raw)
  To: Allan W. Nielsen; +Cc: netdev, f.fainelli, raju.lakkaraju
In-Reply-To: <20161128192306.GA11474@microsemi.com>

On Mon, Nov 28, 2016 at 08:23:06PM +0100, Allan W. Nielsen wrote:
> Hi Andrew and Florian,
> 
> On 28/11/16 15:14, Andrew Lunn wrote:
> > On Mon, Nov 28, 2016 at 02:24:30PM +0100, Allan W. Nielsen wrote:
> > > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> > >
> > > 3 types of PHY loopback are supported.
> > > i.e. Near-End Loopback, Far-End Loopback and External Loopback.
> > Is this integrated with ethtool --test? You only want the PHY to go
> > into loopback mode when running ethtool --test external_lb or maybe
> > ethtool --test offline.
> There are other use-cases for enabling PHY loopback:
> 
> 1) If the PHY is connected to a switch then a loop-port is sometime
>    used to "force/enable" one or more extra pass through the switch
>    core. This "hack" can sometime be used to achieve new functionality
>    with existing silicon.

With Linux, switches are managed via switchdev, or DSA. You will have
to teach this infrastructure that something really odd is going on
with one of its ports before you do anything like this in the PHY
layer. I suggest you leave this use case alone until somebody
really-really wants it. From my knowledge of the Marvell DSA driver,
this is not easy.

> 2) Existing user-space application may expect to be able to do the
>    testing on its own (generate/validate the test traffic).

Please can you reference some existing user-space application and the
kernel API it uses to put the PHY into loopback mode?

> We are always happy to integrate with any existing functionality, but
> as I understand "ethtool --test" then intention is to perform a test
> and then bring back the PHY in to a "normal" state (I may be
> wrong...).

Correct.

> The idea with this patch is to allow configuring loopback more
> "permanently" (userspace can decide when to activate and when to
> de-activate). I should properly have made that clear in the cover
> letter.

Leaving it in loopback is a really bad idea. I've spent days once
working out why an Ethernet did not work. Turned out the PHY powered
up in loopback mode, and the embedded OS running on it did not
initialise it to sensible defaults on probe. Packets we going out,
dhcp server was replying but all incoming packets were discarded.

It is really not obvious when everything looks O.K, but nothing works,
because the PHY is in loopback. There needs to be a big red flag to
warn you.

If you really do what to do this, you should look at NETIF_F_LOOPBACK
and consider how this concept can be applied at the PHY, not the MAC.
But you need the full concept, so you see things like:

2: eth0: <PHY_LOOPBACK,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 80:ee:73:83:60:27 brd ff:ff:ff:ff:ff:ff

Humm, i've no idea how you actually enable the MAC loopback
NETIF_F_LOOPBACK represents. I don't see anything with ip link set.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-28 20:20 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20161128191832.46191e1c@jkicinski-Precision-T1700>

> Any particular reason not to allow the XDP prog being set while the
> device is closed?  You seem to preserve the program across
> close()/open() cycles so the edev->xdp_prog is alive and valid while
> device is closed IIUC.  

> I think other drivers are allowing setting XDP while closed and it
> would be cool to keep the behaviour identical across drivers :)

You're right; No reason to prevent this.
I'll fix it in v2.

^ permalink raw reply

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-28 20:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161128.112248.2198395724649783009.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

On 11/28/2016 05:22 PM, David Miller wrote:
> Thanks for trying to solve this problem.
> 
> But we really don't want this to be Kconfig gated.  If we decide to
> support this it should be a run-time selectable option.  Every
> distribution on the planet is going to turn your Kconfig option on, so
> whatever you think we might be saving by putting this behind Kconfig
> checks won't be realized for large swaths of the userbase.
> 

OK for that

> I also question how necessary %100 consistent hashing of ECMP really
> is.
> 
> If we can do something at extremely low cost and arrive at a very low
> disruption rate, honestly that's probably good enough.
> 
> I assume you've looked over RFC2992.  A low disrutpion algorithm is
> described there and if we could just get rid of the divide it might
> be extremely desirable.

I wanted a solution that has very low disruption for both scale up and
scale down. When a next-hop is added, only 1/N flows should be disrupted
(i.e. in this case, reaffected to the new next-hop). When a next-hop is
removed, only the flows that were handled by this particular next-hop
should be disrupted, and those flows should be equally rebalanced across
the remaining next-hops. Although the solution presented in RFC2992 has
a "reasonably" low disruption (looks like it can get up to 1/2
disruption though), I'm not sure if the equal-rebalancing part holds.

The advantage of my solution over RFC2992 is lowest possible disruption
and equal rebalancing of affected flows. The disadvantage is the lookup
complexity of O(log n) vs O(1). Although from a theoretical viewpoint
O(1) is obviously better, would O(log n) have an effectively measurable
negative impact on scalability ? If we consider 32 next-hops for a route
and 100 pseudo-random numbers generated per next-hop, the lookup
algorithm would have to perform in the worst case log2 3200 = 11
comparisons to select a next-hop for that route.

For my part I prefer the minimal disruption over constant time lookup.
I'll try to come up with some measurements to see the actual impact.

David

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply

* Re: [PATCH net-next V2 00/10] Mellanox 100G mlx5 DCBX and ethtool updates
From: David Miller @ 2016-11-28 20:10 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <1480258932-10302-1-git-send-email-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Sun, 27 Nov 2016 17:02:02 +0200

> This series provides the following mlx5 updates:
 ...

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net 11/16] net: ethernet: marvell: mvneta: fix fixed-link phydev leaks
From: Thomas Petazzoni @ 2016-11-28 20:10 UTC (permalink / raw)
  To: Johan Hovold
  Cc: David S. Miller, Vince Bridgers, Florian Fainelli, Fugang Duan,
	Pantelis Antoniou, Vitaly Bordug, Claudiu Manoil, Li Yang,
	Felix Fietkau, John Crispin, Matthias Brugger, Sergei Shtylyov,
	Lars Persson, Mugunthan V N, Grygorii Strashko, Rob Herring,
	Frank Rowand, Andrew Lunn, Vivien Didelot
In-Reply-To: <1480357509-28074-12-git-send-email-johan@kernel.org>

Hello,

On Mon, 28 Nov 2016 19:25:04 +0100, Johan Hovold wrote:
> Make sure to deregister and free any fixed-link PHY registered using
> of_phy_register_fixed_link() on probe errors and on driver unbind.
> 
> Fixes: 83895bedeee6 ("net: mvneta: add support for fixed links")
> Signed-off-by: Johan Hovold <johan@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 5 +++++
>  1 file changed, 5 insertions(+)

Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>

Thanks!

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Woojung.Huh @ 2016-11-28 20:03 UTC (permalink / raw)
  To: davem, f.fainelli, andrew; +Cc: netdev, UNGLinuxDriver

From: Woojung Huh <woojung.huh@microchip.com>

Add LAN7801 MAC-only configuration support.

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
---
 drivers/net/usb/Kconfig   |   5 +++
 drivers/net/usb/lan78xx.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/net/usb/lan78xx.h |  14 ++++++
 3 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index cdde590..3dd490f5 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -114,6 +114,11 @@ config USB_LAN78XX
 	help
 	  This option adds support for Microchip LAN78XX based USB 2
 	  & USB 3 10/100/1000 Ethernet adapters.
+	  LAN7800 : USB 3 to 10/100/1000 Ethernet adapter
+	  LAN7850 : USB 2 to 10/100/1000 Ethernet adapter
+	  LAN7801 : USB 3 to 10/100/1000 Ethernet adapter (MAC only)
+
+	  Proper PHY driver is required for LAN7801.
 
 	  To compile this driver as a module, choose M here: the
 	  module will be called lan78xx.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0c459e9..08f2895 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -40,7 +40,7 @@
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
 #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME	"lan78xx"
-#define DRIVER_VERSION	"1.0.5"
+#define DRIVER_VERSION	"1.0.6"
 
 #define TX_TIMEOUT_JIFFIES		(5 * HZ)
 #define THROTTLE_JIFFIES		(HZ / 8)
@@ -67,6 +67,7 @@
 #define LAN78XX_USB_VENDOR_ID		(0x0424)
 #define LAN7800_USB_PRODUCT_ID		(0x7800)
 #define LAN7850_USB_PRODUCT_ID		(0x7850)
+#define LAN7801_USB_PRODUCT_ID		(0x7801)
 #define LAN78XX_EEPROM_MAGIC		(0x78A5)
 #define LAN78XX_OTP_MAGIC		(0x78F3)
 
@@ -400,6 +401,21 @@ struct lan78xx_net {
 	struct irq_domain_data	domain_data;
 };
 
+/* define external phy id */
+#define	PHY_LAN8835			(0x0007C130)
+#define	PHY_KSZ9031RNX			(0x00221620)
+
+/* phyid : masked external phy id
+ * pre_config : if needed, configure MAC and/or external PHY
+ *		such as irq pin mux and RGMII timing.
+ */
+struct ext_phy_config_table {
+	int phyid;
+	void (*pre_config)(struct lan78xx_net *dev,
+			   struct phy_device *phydev,
+			   phy_interface_t *interface);
+};
+
 /* use ethtool to change the level for any given device */
 static int msg_level = -1;
 module_param(msg_level, int, 0);
@@ -1697,6 +1713,7 @@ static int lan78xx_mdiobus_read(struct mii_bus *bus, int phy_id, int idx)
 done:
 	mutex_unlock(&dev->phy_mutex);
 	usb_autopm_put_interface(dev->intf);
+
 	return ret;
 }
 
@@ -1759,6 +1776,10 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
 		/* set to internal PHY id */
 		dev->mdiobus->phy_mask = ~(1 << 1);
 		break;
+	case ID_REV_CHIP_ID_7801_:
+		/* scan thru PHYAD[2..0] */
+		dev->mdiobus->phy_mask = ~(0xFF);
+		break;
 	}
 
 	ret = mdiobus_register(dev->mdiobus);
@@ -1933,11 +1954,58 @@ static void lan78xx_remove_irq_domain(struct lan78xx_net *dev)
 	dev->domain_data.irqdomain = NULL;
 }
 
+static void lan8835_pre_config(struct lan78xx_net *dev,
+			       struct phy_device *phydev,
+			       phy_interface_t *interface)
+{
+	int buf;
+	int ret;
+
+	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
+	buf = phy_read_mmd_indirect(phydev, 32784, 3);
+	buf &= ~0x1800;
+	buf |= 0x0800;
+	phy_write_mmd_indirect(phydev, 32784, 3, buf);
+
+	/* RGMII MAC TXC Delay Enable */
+	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+				MAC_RGMII_ID_TXC_DELAY_EN_);
+
+	/* RGMII TX DLL Tune Adjust */
+	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
+}
+
+static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
+				  struct phy_device *phydev,
+				  phy_interface_t *interface)
+{
+	/* Micrel9301RNX PHY configuration */
+	/* RGMII Control Signal Pad Skew */
+	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
+	/* RGMII RX Data Pad Skew */
+	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
+	/* RGMII RX Clock Pad Skew */
+	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
+
+	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
+}
+
+/* external phyid & configure routine for LAN7801 */
+static struct ext_phy_config_table ext_phy_table[] = {
+	{ PHY_LAN8835, lan8835_pre_config },
+	{ PHY_KSZ9031RNX, ksz9031rnx_pre_config },
+	{}
+};
+
 static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
 	int ret;
 	u32 mii_adv;
+	u32 masked_phyid;
 	struct phy_device *phydev = dev->net->phydev;
+	phy_interface_t interface;
 
 	phydev = phy_find_first(dev->mdiobus);
 	if (!phydev) {
@@ -1945,6 +2013,37 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 		return -EIO;
 	}
 
+	if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
+	    (dev->chipid == ID_REV_CHIP_ID_7850_)) {
+		phydev->is_internal = true;
+		interface = PHY_INTERFACE_MODE_GMII;
+
+	} else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+		int i;
+
+		if (!phydev->drv) {
+			netdev_err(dev->net, "no PHY driver found\n");
+			return -EIO;
+		}
+
+		masked_phyid = phydev->phy_id & phydev->drv->phy_id_mask;
+		interface = PHY_INTERFACE_MODE_RGMII;
+
+		for (i = 0; ext_phy_table[i].phyid != 0; i++) {
+			if ((masked_phyid == ext_phy_table[i].phyid) &&
+			    (ext_phy_table[i].pre_config)) {
+				ext_phy_table[i].pre_config(dev,
+							    phydev,
+							    &interface);
+			}
+		}
+
+		phydev->is_internal = false;
+	} else {
+		netdev_err(dev->net, "unknown ID found\n");
+		return -EIO;
+	}
+
 	/* if phyirq is not set, use polling mode in phylib */
 	if (dev->domain_data.phyirq > 0)
 		phydev->irq = dev->domain_data.phyirq;
@@ -1957,7 +2056,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 
 	ret = phy_connect_direct(dev->net, phydev,
 				 lan78xx_link_status_change,
-				 PHY_INTERFACE_MODE_GMII);
+				 interface);
 	if (ret) {
 		netdev_err(dev->net, "can't attach PHY to %s\n",
 			   dev->mdiobus->id);
@@ -2338,6 +2437,9 @@ static int lan78xx_reset(struct lan78xx_net *dev)
 	} while ((buf & PMT_CTL_PHY_RST_) || !(buf & PMT_CTL_READY_));
 
 	ret = lan78xx_read_reg(dev, MAC_CR, &buf);
+	/* LAN7801 only has RGMII mode */
+	if (dev->chipid == ID_REV_CHIP_ID_7801_)
+		buf &= ~MAC_CR_GMII_EN_;
 	buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
 	ret = lan78xx_write_reg(dev, MAC_CR, buf);
 
@@ -2466,6 +2568,7 @@ static int lan78xx_stop(struct net_device *net)
 
 	phy_stop(net->phydev);
 	phy_disconnect(net->phydev);
+
 	net->phydev = NULL;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
@@ -3887,6 +3990,10 @@ static const struct usb_device_id products[] = {
 	/* LAN7850 USB Gigabit Ethernet Device */
 	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7850_USB_PRODUCT_ID),
 	},
+	{
+	/* LAN7801 USB Gigabit Ethernet Device */
+	USB_DEVICE(LAN78XX_USB_VENDOR_ID, LAN7801_USB_PRODUCT_ID),
+	},
 	{},
 };
 MODULE_DEVICE_TABLE(usb, products);
diff --git a/drivers/net/usb/lan78xx.h b/drivers/net/usb/lan78xx.h
index 4092790..25aa546 100644
--- a/drivers/net/usb/lan78xx.h
+++ b/drivers/net/usb/lan78xx.h
@@ -108,6 +108,7 @@
 #define ID_REV_CHIP_REV_MASK_		(0x0000FFFF)
 #define ID_REV_CHIP_ID_7800_		(0x7800)
 #define ID_REV_CHIP_ID_7850_		(0x7850)
+#define ID_REV_CHIP_ID_7801_		(0x7801)
 
 #define FPGA_REV			(0x04)
 #define FPGA_REV_MINOR_MASK_		(0x0000FF00)
@@ -550,6 +551,7 @@
 #define LTM_INACTIVE1_TIMER10_		(0x0000FFFF)
 
 #define MAC_CR				(0x100)
+#define MAC_CR_GMII_EN_			(0x00080000)
 #define MAC_CR_EEE_TX_CLK_STOP_EN_	(0x00040000)
 #define MAC_CR_EEE_EN_			(0x00020000)
 #define MAC_CR_EEE_TLAR_EN_		(0x00010000)
@@ -787,6 +789,18 @@
 #define PHY_DEV_ID_MODEL_MASK_		(0x0FC00000)
 #define PHY_DEV_ID_OUI_MASK_		(0x003FFFFF)
 
+#define RGMII_TX_BYP_DLL		(0x708)
+#define RGMII_TX_BYP_DLL_TX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_TX_BYP_DLL_TX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_TX_BYP_DLL_TX_DLL_RESET_		(0x0000002)
+#define RGMII_TX_BYP_DLL_TX_DLL_BYPASS_		(0x0000001)
+
+#define RGMII_RX_BYP_DLL		(0x70C)
+#define RGMII_RX_BYP_DLL_RX_TUNE_ADJ_MASK_	(0x000FC00)
+#define RGMII_RX_BYP_DLL_RX_TUNE_SEL_MASK_	(0x00003F0)
+#define RGMII_RX_BYP_DLL_RX_DLL_RESET_		(0x0000002)
+#define RGMII_RX_BYP_DLL_RX_DLL_BYPASS_		(0x0000001)
+
 #define OTP_BASE_ADDR			(0x00001000)
 #define OTP_ADDR_RANGE_			(0x1FF)
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next 1/1] driver: ipvlan: Add the sanity check for ipvlan mode
From: David Miller @ 2016-11-28 20:08 UTC (permalink / raw)
  To: maheshb; +Cc: fgao, edumazet, netdev, gfree.wind
In-Reply-To: <CAF2d9ji2_JhiNv9P9aUt7wV2Vfjv+z_q-MWSKKH8tzO_1THobg@mail.gmail.com>

From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com>
Date: Mon, 28 Nov 2016 11:02:45 -0800

> On Mon, Nov 28, 2016 at 5:23 AM, <fgao@ikuai8.com> wrote:
> 
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> The ipvlan mode variable "nval" is from userspace, so the ipvlan codes
>> should check if the mode variable "nval" is valid.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>>  drivers/net/ipvlan/ipvlan_main.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_
>> main.c
>> index ab90b22..537b5a9 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -65,6 +65,9 @@ static int ipvlan_set_port_mode(struct ipvl_port *port,
>> u16 nval)
>>         struct net_device *mdev = port->dev;
>>         int err = 0;
>>
>> +       if (nval >= IPVLAN_MODE_MAX)
>> +               return -EINVAL;
>> +
>>
> I'm curious to know how you encountered this issue? The values are
> validated in ipvlan_nl_validate() and it should fail at that time itself.

I'm not applying this without at least a better explanation.

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357043.18162.70.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:17:23 -0800

> It appears that Eli changelog was not very precise, because his bug was
> because of XFRM being involved.
> 
> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
> 
> So XFRM calls skb_gso_segment() before ip_output() or
> ip6_finish_output2() had a chance to change skb->protocol

I think if we do it in ip6_local_out and ip_local_out, then none of
these hacks will be necessary at all.

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: Alexei Starovoitov @ 2016-11-28 20:06 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-2-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
> Code move only; no functional change intended.

not quite...

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
>   * @sk: The socken sending or receiving traffic
> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>  
>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>  	if (prog) {
> -		unsigned int offset = skb->data - skb_network_header(skb);
> -
> -		__skb_push(skb, offset);
> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
> -		__skb_pull(skb, offset);
> +		switch (type) {
> +		case BPF_CGROUP_INET_INGRESS:
> +		case BPF_CGROUP_INET_EGRESS:
> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
> +			break;

hmm. what's a point of double jump table? It's only burning cycles
in the fast path. We already have
prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
Could you do a variant of __cgroup_bpf_run_filter() instead ?
That doesnt't take 'skb' as an argument.
It will also solve scary looking NULL skb from patch 2:
__cgroup_bpf_run_filter(sk, NULL, ...

and to avoid copy-pasting first dozen lines of current
__cgroup_bpf_run_filter can be moved into a helper that
__cgroup_bpf_run_filter_skb and
__cgroup_bpf_run_filter_sk will call.
Or some other way to rearrange that code.

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357935.18162.76.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:32:15 -0800

> On Mon, 2016-11-28 at 10:17 -0800, Eric Dumazet wrote:
> 
>> I was referring to Stephen bug report.
>> 
>> It appears that Eli changelog was not very precise, because his bug was
>> because of XFRM being involved.
>> 
>> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
>> 
>> So XFRM calls skb_gso_segment() before ip_output() or
>> ip6_finish_output2() had a chance to change skb->protocol
> 
> So maybe the real fix would be to set skb->protocol at the right place,
> before xfrm can be called, instead of chasing all skb producers ;)

And the key for this would be dst_output() invocations.

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Lance Richardson @ 2016-11-28 19:49 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Sven-Haegar Koch, Eli Cooper, netdev, Eric Dumazet
In-Reply-To: <1428298236.4223309.1480355647336.JavaMail.zimbra@redhat.com>

> From: "Lance Richardson" <lrichard@redhat.com>
> To: "Stephen Rothwell" <sfr@canb.auug.org.au>
> Cc: "Sven-Haegar Koch" <haegar@sdinet.de>, "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric Dumazet"
> <eric.dumazet@gmail.com>
> Sent: Monday, November 28, 2016 12:54:07 PM
> Subject: Re: Large performance regression with 6in4 tunnel (sit)
> 
> > From: "Stephen Rothwell" <sfr@canb.auug.org.au>
> > To: "Sven-Haegar Koch" <haegar@sdinet.de>
> > Cc: "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric
> > Dumazet" <eric.dumazet@gmail.com>
> > Sent: Saturday, November 26, 2016 10:23:40 PM
> > Subject: Re: Large performance regression with 6in4 tunnel (sit)
> > 
> > Hi Sven-Haegar,
> > 
> > On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch
> > <haegar@sdinet.de>
> > wrote:
> > >
> > > Somehow this problem description really reminds me of a report on
> > > netdev a bit ago, which the following patch fixed:
> > > 
> > > commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
> > > Author: Lance Richardson <lrichard@redhat.com>
> > > Date:   Wed Nov 2 16:36:17 2016 -0400
> > > 
> > >     ipv4: allow local fragmentation in ip_finish_output_gso()
> > >     
> > >     Some configurations (e.g. geneve interface with default
> > >     MTU of 1500 over an ethernet interface with 1500 MTU) result
> > >     in the transmission of packets that exceed the configured MTU.
> > >     While this should be considered to be a "bad" configuration,
> > >     it is still allowed and should not result in the sending
> > >     of packets that exceed the configured MTU.
> > > 
> > > Could this be related?
> > > 
> > > I suppose it would be difficult to test this patch on this machine?
> > 
> > The kernel I am running on is based on 4.7.8, so the above patch
> > doesn't come close to applying. Most fo what it is reverting was
> > introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> > bit to inet_skb_parm.flags") in v4.8-rc1.
> > 
> > --
> > Cheers,
> > Stephen Rothwell
> > 
> 
> This should be equivalent for 4.7.x:
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 4bd4921..8a253e2 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -224,8 +224,7 @@ static int ip_finish_output_gso(struct net *net, struct
> sock *sk,
>         int ret = 0;
>  
>         /* common case: locally created skb or seglen is <= mtu */
> -       if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -             skb_gso_network_seglen(skb) <= mtu)
> +       if (skb_gso_network_seglen(skb) <= mtu)
>                 return ip_finish_output2(net, sk, skb);
>  
>         /* Slowpath -  GSO segment length is exceeding the dst MTU.
> 

BTW, I do think this would be worth trying. For the geneve case, I
measured on the order of a 10X-100X performance hit without this
patch, traces were similar to what you describe (too-large gso packets
were dropped, corresponding TCP segments were retransmitted later via
a non-gso code path).

Regards,

   Lance

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Hannes Frederic Sowa, David Miller, Tom Herbert,
	Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <CACT4Y+Yr6UJxXosjx2FO=KtxGr40X07_UBiscs1BP1s8cTaqAA@mail.gmail.com>

On Mon, 2016-11-28 at 20:34 +0100, Dmitry Vyukov wrote:
> On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller

> > Hi Eric,
> >
> > As far as I can see, skb_network_offset() becomes negative after
> > pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> > At least I'm able to detect that with a BUG_ON().
> >
> > Also it seems that the issue is only reproducible (at least with the
> > poc I provided) for a short time after boot.
> 
> 
> Eric,
> 
> Is it enough to debug? Or maybe Andrey can trace some values for you.

Well, now we are talking, if you tell me how many modules you load, it
might help ;)

nf_ct_frag6_queue is nowhere to be seen in my kernels, that might
explain why I could not reproduce the bug.

Let me try ;)

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] Documentation: net: phy: Add blurb about RGMII
From: Florian Fainelli @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Mason, Sebastian Frias, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <b6a19ca6-8bda-8573-eb7b-9ed7f80ca480@free.fr>

On 11/28/2016 11:15 AM, Mason wrote:
> On 28/11/2016 18:43, Florian Fainelli wrote:
> 
>> On 11/28/2016 02:34 AM, Sebastian Frias wrote:
>>
>>> For what is worth, the Atheros at803x driver comes with RX delay enabled
>>> as per HW reset.
>>
>> Always, or is this a behavior possibly defined via a stra-pin that can
>> be changed?
> 
> Here's the data sheet:
> 
>   http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf
> 
> 4.2.25 rgmii rx clock delay control
> Offset: 0x00
> bit 15: Control bit for rgmii interface rx clock delay:
> 1 = rgmii rx clock delay enable
> 0 = rgmii rx clock delay disable
> HW Rst. 1
> SW Rst. 1
> 
> As far as I can tell, rx delay is enabled by default, always.
> 
> The "Features" list mentions
> "RGMII timing modes support internal delay and external delay on Rx path"
> (Not sure about the internal vs external distinction.)
> 
> Table 3-6. RGMII AC Characteristics — no Internal Delay
> Table 3-7. RGMII AC Characteristics — with internal delay added (Default)
> 
> Delay is 2 ns apparently.
> 
> There's also
> 4.2.27 Hib ctrl and rgmii gtx clock delay register
> Offset: 0x0B
> 
> bits 6:5 Gtx_dly_val
> Select the delay of gtx_clk.
> 00:0.25ns
> 01:1.3ns
> 10:2.4ns
> 11:3.4ns
> 
> I don't know what that is used for.
> Maybe this is the external vs internal delay?

First, this has little to do with the initial patch series being
discussed now, and second, this still looks like an internal delay
programming, just that it applies to the received transmit clock from
the MAC, that's how I read it though.
-- 
Florian

^ permalink raw reply

* [PATCH net] bpf: fix states equal logic for varlen access
From: Josef Bacik @ 2016-11-28 19:44 UTC (permalink / raw)
  To: davem, netdev, daniel, ast, jannh

If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 kernel/bpf/verifier.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 89f787c..2c8a688 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 {
 	struct bpf_reg_state *rold, *rcur;
 	int i;
+	bool map_access = env->varlen_map_value_access;
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		rold = &old->regs[i];
@@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
 		/* If the ranges were not the same, but everything else was and
 		 * we didn't do a variable access into a map then we are a-ok.
 		 */
-		if (!env->varlen_map_value_access &&
+		if (!map_access &&
 		    rold->type == rcur->type && rold->imm == rcur->imm)
 			continue;
 
+		/* If we didn't map access then again we don't care about the
+		 * mismatched range values and it's ok if our old type was
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 */
 		if (rold->type == NOT_INIT ||
-		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
+		    (!map_access && (rold->type == UNKNOWN_VALUE &&
+				     rcur->type != NOT_INIT)))
 			continue;
 
 		if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 5/5] l2tp: fix address test in __l2tp_ip6_bind_lookup()
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
matches if addr is NULL or if addr != laddr.
But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
the same address, so the ipv6_addr_equal() condition needs to be
inverted.

For better clarity and consistency with the rest of the expression, the
(!X || X == Y) notation is used instead of !(X && X != Y).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 4a86440..aa821cb 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -72,7 +72,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,

 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
-		    !(addr && ipv6_addr_equal(addr, laddr)) &&
+		    (!addr || ipv6_addr_equal(addr, laddr)) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 4/5] l2tp: fix lookup for sockets not bound to a device in l2tp_ip
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

When looking up an l2tp socket, we must consider a null netdevice id as
wild card. There are currently two problems caused by
__l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:

  * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
    never receives any packet. Since __l2tp_ip_bind_lookup() is called
    with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
    different from 'dif' so the socket doesn't match.

  * Two sockets, one bound to a device but not the other, can be bound
    to the same address. If the first socket binding to the address is
    the one that is also bound to a device, the second socket can bind
    to the same address without __l2tp_ip_bind_lookup() noticing the
    overlap.

To fix this issue, we need to consider that any null device index, be
it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
We also need to pass the input device index to __l2tp_ip_bind_lookup()
on reception so that sockets bound to a device never receive packets
from other devices.

This patch fixes l2tp_ip6 in the same way.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 6 ++++--
 net/l2tp/l2tp_ip6.c | 7 ++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index ea818f3..aa42e10 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -61,7 +61,8 @@ static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -182,7 +183,8 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 		struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
 
 		read_lock_bh(&l2tp_ip_lock);
-		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		sk = __l2tp_ip_bind_lookup(net, iph->daddr, inet_iif(skb),
+					   tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip_lock);
 			goto discard;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 5f2ae61..4a86440 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -73,7 +73,8 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(addr && ipv6_addr_equal(addr, laddr)) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -196,8 +197,8 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		struct ipv6hdr *iph = ipv6_hdr(skb);
 
 		read_lock_bh(&l2tp_ip6_lock);
-		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
-					    0, tunnel_id);
+		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, inet6_iif(skb),
+					    tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip6_lock);
 			goto discard;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 3/5] l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

It's not enough to check for sockets bound to same address at the
beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
time, a socket with the same address could be bound before we take
the l2tp lock again.

This patch moves the lookup right before inserting the new socket, so
that no change can ever happen to the list between address lookup and
socket insertion.

Care is taken to avoid side effects on the socket in case of failure.
That is, modifications of the socket are done after the lookup, when
binding is guaranteed to succeed, and before releasing the l2tp lock,
so that concurrent lookups will always see fully initialised sockets.

For l2tp_ip6, the lookup is now always done with the correct bound
device. Before this patch, when binding to a link-local address, the
lookup was done with the original sk->sk_bound_dev_if, which was later
overwritten with addr->l2tp_scope_id. Lookup is now performed with the
final sk->sk_bound_dev_if value.

Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
already been checked at this point (this part of the code seems to have
been copy-pasted from net/ipv6/raw.c).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 25 ++++++++++---------------
 net/l2tp/l2tp_ip6.c | 43 ++++++++++++++++++++-----------------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4d1c942..ea818f3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -257,14 +257,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr->l2tp_family != AF_INET)
 		return -EINVAL;
 
-	ret = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip_lock);
-	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
-				  sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-
-	read_unlock_bh(&l2tp_ip_lock);
-
 	lock_sock(sk);
 	if (!sock_flag(sk, SOCK_ZAPPED))
 		goto out;
@@ -282,14 +274,22 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		inet->inet_rcv_saddr = inet->inet_saddr = addr->l2tp_addr.s_addr;
 	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
 		inet->inet_saddr = 0;  /* Use device */
-	sk_dst_reset(sk);
 
+	write_lock_bh(&l2tp_ip_lock);
+	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
+				  sk->sk_bound_dev_if, addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip_lock);
+		ret = -EADDRINUSE;
+		goto out;
+	}
+
+	sk_dst_reset(sk);
 	l2tp_ip_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip_lock);
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip_lock);
+
 	ret = 0;
 	sock_reset_flag(sk, SOCK_ZAPPED);
 
@@ -297,11 +297,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	release_sock(sk);
 
 	return ret;
-
-out_in_use:
-	read_unlock_bh(&l2tp_ip_lock);
-
-	return ret;
 }
 
 static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index e3fc778..5f2ae61 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -267,6 +267,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_l2tpip6 *addr = (struct sockaddr_l2tpip6 *) uaddr;
 	struct net *net = sock_net(sk);
 	__be32 v4addr = 0;
+	int bound_dev_if;
 	int addr_type;
 	int err;
 
@@ -285,13 +286,6 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr_type & IPV6_ADDR_MULTICAST)
 		return -EADDRNOTAVAIL;
 
-	err = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip6_lock);
-	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr,
-				   sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-	read_unlock_bh(&l2tp_ip6_lock);
-
 	lock_sock(sk);
 
 	err = -EINVAL;
@@ -301,28 +295,25 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (sk->sk_state != TCP_CLOSE)
 		goto out_unlock;
 
+	bound_dev_if = sk->sk_bound_dev_if;
+
 	/* Check if the address belongs to the host. */
 	rcu_read_lock();
 	if (addr_type != IPV6_ADDR_ANY) {
 		struct net_device *dev = NULL;
 
 		if (addr_type & IPV6_ADDR_LINKLOCAL) {
-			if (addr_len >= sizeof(struct sockaddr_in6) &&
-			    addr->l2tp_scope_id) {
-				/* Override any existing binding, if another
-				 * one is supplied by user.
-				 */
-				sk->sk_bound_dev_if = addr->l2tp_scope_id;
-			}
+			if (addr->l2tp_scope_id)
+				bound_dev_if = addr->l2tp_scope_id;
 
 			/* Binding to link-local address requires an
-			   interface */
-			if (!sk->sk_bound_dev_if)
+			 * interface.
+			 */
+			if (!bound_dev_if)
 				goto out_unlock_rcu;
 
 			err = -ENODEV;
-			dev = dev_get_by_index_rcu(sock_net(sk),
-						   sk->sk_bound_dev_if);
+			dev = dev_get_by_index_rcu(sock_net(sk), bound_dev_if);
 			if (!dev)
 				goto out_unlock_rcu;
 		}
@@ -337,13 +328,22 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	rcu_read_unlock();
 
-	inet->inet_rcv_saddr = inet->inet_saddr = v4addr;
+	write_lock_bh(&l2tp_ip6_lock);
+	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr, bound_dev_if,
+				   addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip6_lock);
+		err = -EADDRINUSE;
+		goto out_unlock;
+	}
+
+	inet->inet_saddr = v4addr;
+	inet->inet_rcv_saddr = v4addr;
+	sk->sk_bound_dev_if = bound_dev_if;
 	sk->sk_v6_rcv_saddr = addr->l2tp_addr;
 	np->saddr = addr->l2tp_addr;
 
 	l2tp_ip6_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip6_lock);
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip6_lock);
@@ -356,10 +356,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rcu_read_unlock();
 out_unlock:
 	release_sock(sk);
-	return err;
 
-out_in_use:
-	read_unlock_bh(&l2tp_ip6_lock);
 	return err;
 }
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 1/5] l2tp: lock socket before checking flags in connect()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 +++-
 net/l2tp/l2tp_ip.c  | 19 ++++++++++++-------
 net/l2tp/l2tp_ip6.c | 16 +++++++++++-----
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8fed1cd..f11ca83 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -970,6 +970,8 @@ int compat_ipv6_setsockopt(struct sock *sk, int level, int optname,
 int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen);
 
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *addr,
+			   int addr_len);
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 				 int addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 37874e2..ccf4055 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -139,7 +139,8 @@ void ip6_datagram_release_cb(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
 
-static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
+			   int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock	*inet = inet_sk(sk);
@@ -252,6 +253,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 out:
 	return err;
 }
+EXPORT_SYMBOL_GPL(__ip6_datagram_connect);
 
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 982f6c4..1f57094 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -308,21 +308,24 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	struct sockaddr_l2tpip *lsa = (struct sockaddr_l2tpip *) uaddr;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
 	if (ipv4_is_multicast(lsa->l2tp_addr.s_addr))
 		return -EINVAL;
 
-	rc = ip4_datagram_connect(sk, uaddr, addr_len);
-	if (rc < 0)
-		return rc;
-
 	lock_sock(sk);
 
+	/* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip4_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip_lock);
@@ -330,7 +333,9 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	write_unlock_bh(&l2tp_ip_lock);
 
+out_sk:
 	release_sock(sk);
+
 	return rc;
 }
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9978d01..af9abff 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -371,9 +371,6 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	int	addr_type;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
@@ -390,10 +387,18 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 			return -EINVAL;
 	}
 
-	rc = ip6_datagram_connect(sk, uaddr, addr_len);
-
 	lock_sock(sk);
 
+	 /* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip6_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip6_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip6_lock);
@@ -401,6 +406,7 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	write_unlock_bh(&l2tp_ip6_lock);
 
+out_sk:
 	release_sock(sk);
 
 	return rc;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 2/5] l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 11 ++++++-----
 net/l2tp/l2tp_ip6.c | 11 ++++++-----
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 1f57094..4d1c942 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -183,14 +183,15 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 
 		read_lock_bh(&l2tp_ip_lock);
 		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index af9abff..e3fc778 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -198,14 +198,15 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		read_lock_bh(&l2tp_ip6_lock);
 		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
 					    0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip6_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip6_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 0/5] l2tp: fixes for l2tp_ip and l2tp_ip6 socket handling
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston

This series addresses problems found while working on commit 32c231164b76
("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").

The first three patches fix races in socket's connect, recv and bind
operations. The last two ones fix scenarios where l2tp fails to
correctly lookup its userspace sockets.

Apart from the last patch, which is l2tp_ip6 specific, every patch fixes
the same problem in the L2TP IPv4 and IPv6 code.

Guillaume Nault (5):
  l2tp: lock socket before checking flags in connect()
  l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
  l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
  l2tp: fix lookup for sockets not bound to a device in l2tp_ip
  l2tp: fix address test in __l2tp_ip6_bind_lookup()

 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 ++-
 net/l2tp/l2tp_ip.c  | 61 +++++++++++++++++++++--------------------
 net/l2tp/l2tp_ip6.c | 79 ++++++++++++++++++++++++++++-------------------------
 4 files changed, 79 insertions(+), 67 deletions(-)

-- 
2.10.2

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox