Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: net: GPF in eth_header
From: Andrey Konovalov @ 2016-11-28 19:04 UTC (permalink / raw)
  To: syzkaller
  Cc: Hannes Frederic Sowa, Dmitry Vyukov, David Miller, Tom Herbert,
	Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <1480359019.18162.79.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Nov 28, 2016 at 7:50 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2016-11-26 at 20:07 +0100, Andrey Konovalov wrote:
>> On Sat, Nov 26, 2016 at 7:28 PM, 'Eric Dumazet' via syzkaller
>> <syzkaller@googlegroups.com> wrote:
>> > On Sat, Nov 26, 2016 at 9:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> >> Hello,
>> >>
>> >> The following program triggers GPF in eth_header:
>> >>
>> >> https://gist.githubusercontent.com/dvyukov/613cadf05543b55a419f237e419cd495/raw/5471231523d1a07c3de55f11f87472c2816ee06c/gistfile1.txt
>> >>
>> >> On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).
>> >>
>> >> BUG: unable to handle kernel paging request at ffffed002d14d74a
>> >> IP: [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>> >> PGD 7fff6067 [   50.787819] PUD 7fff5067
>> >> PMD 0 [   50.787819]
>> >> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> >> Modules linked in:
>> >> CPU: 2 PID: 6712 Comm: a.out Not tainted 4.9.0-rc6+ #55
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> task: ffff88003a1841c0 task.stack: ffff880034d08000
>> >> RIP: 0010:[<ffffffff86be3295>]  [<ffffffff86be3295>]
>> >> eth_header+0x75/0x260 net/ethernet/eth.c:88
>> >> RSP: 0018:ffff880034d0eb68  EFLAGS: 00010a03
>> >> RAX: 1ffff1002d14d74a RBX: ffff880168a6ba4a RCX: ffff88006a9c7858
>> >> RDX: 000000000000dd86 RSI: dffffc0000000000 RDI: ffff880168a6ba56
>> >> RBP: ffff880034d0eb98 R08: 0000000000000000 R09: 0000000000000031
>> >> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
>> >> R13: ffff88006c208d80 R14: 00000000000086dd R15: ffff88006a9c7858
>> >> FS:  0000000001a02940(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
>> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> CR2: ffffed002d14d74a CR3: 0000000037373000 CR4: 00000000000006e0
>> >> Stack:
>> >>  000000316881ab40 ffff88006a9c76c0 ffff88006881ab40 ffff88006a9c77f8
>> >>  0000000000000000 dffffc0000000000 ffff880034d0ee98 ffffffff86b31af9
>> >>  ffffffff8719605c ffff880034d0f0f8 ffffffff000086dd ffffffff86be3220
>> >> Call Trace:
>> >>  [<     inline     >] dev_hard_header ./include/linux/netdevice.h:2762
>> >>  [<ffffffff86b31af9>] neigh_resolve_output+0x659/0xb20 net/core/neighbour.c:1302
>> >>  [<     inline     >] dst_neigh_output ./include/net/dst.h:464
>> >>  [<ffffffff8719605c>] ip6_finish_output2+0xb3c/0x2500 net/ipv6/ip6_output.c:121
>> >>  [<ffffffff871a0b0b>] ip6_finish_output+0x2eb/0x760 net/ipv6/ip6_output.c:139
>> >>  [<     inline     >] NF_HOOK_COND ./include/linux/netfilter.h:246
>> >>  [<ffffffff871a1157>] ip6_output+0x1d7/0x9a0 net/ipv6/ip6_output.c:153
>> >>  [<     inline     >] dst_output ./include/net/dst.h:501
>> >>  [<ffffffff873312ea>] ip6_local_out+0x9a/0x180 net/ipv6/output_core.c:170
>> >>  [<ffffffff871a3886>] ip6_send_skb+0xa6/0x340 net/ipv6/ip6_output.c:1712
>> >>  [<ffffffff871a3bd8>] ip6_push_pending_frames+0xb8/0xe0
>> >> net/ipv6/ip6_output.c:1732
>> >>  [<     inline     >] rawv6_push_pending_frames net/ipv6/raw.c:607
>> >>  [<ffffffff8722acfb>] rawv6_sendmsg+0x250b/0x2c20 net/ipv6/raw.c:920
>> >>  [<ffffffff8701c4f5>] inet_sendmsg+0x385/0x590 net/ipv4/af_inet.c:734
>> >>  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
>> >>  [<ffffffff86a6ea9f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
>> >>  [<ffffffff86a6ee0b>] sock_write_iter+0x32b/0x620 net/socket.c:829
>> >>  [<ffffffff81a6f153>] do_iter_readv_writev+0x363/0x670 fs/read_write.c:695
>> >>  [<ffffffff81a71ba1>] do_readv_writev+0x431/0x9b0 fs/read_write.c:872
>> >>  [<ffffffff81a726dc>] vfs_writev+0x8c/0xc0 fs/read_write.c:911
>> >>  [<ffffffff81a72825>] do_writev+0x115/0x2d0 fs/read_write.c:944
>> >>  [<     inline     >] SYSC_writev fs/read_write.c:1017
>> >>  [<ffffffff81a75fdc>] SyS_writev+0x2c/0x40 fs/read_write.c:1014
>> >>  [<ffffffff8814cf85>] entry_SYSCALL_64_fastpath+0x23/0xc6
>> >> arch/x86/entry/entry_64.S:209
>> >> Code: 41 83 fe 04 0f 84 aa 00 00 00 e8 17 4e b0 fa 48 8d 7b 0c 48 be
>> >> 00 00 00 00 00 fc ff df 44 89 f2 66 c1 c2 08 48 89 f8 48 c1 e8 03 <0f>
>> >> b6 0c 30 48 8d 43 0d 49 89 c0 49 c1 e8 03 41 0f b6 34 30 49
>> >> RIP  [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>> >>  RSP <ffff880034d0eb68>
>> >> CR2: ffffed002d14d74a
>> >> ---[ end trace a73fedfdc11bd60c ]---
>> >
>> >
>> > Hi Dmitry
>> >
>> > I could not reproduce the issue. Might need some specific configuration...
>> >
>> > loopback device has proper ethernet header (all 0)
>> >
>> > Fault happens in :
>> >
>> > 0f b6 0c 30             movzbl (%rax,%rsi,1),%ecx
>> >
>> > RAX=1ffff1002d14d74a  which is RDI>>3, and RSI=dffffc0000000000
>> >
>> > Could this be a KASAN problem ?
>>
>> Hi Eric,
>>
>> The crash happens when the kernel tries to access shadow for nonmapped memory.
>>
>> The issue here is an integer overflow which happens in neigh_resolve_output().
>> skb_network_offset(skb) can return negative number, but __skb_pull()
>> accepts unsigned int as len.
>> As a result, the least significat bit in higher 32 bits of skb->data
>> gets set and we get an out-of-bounds with offset of 4 GB.
>>
>> I've attached a short reproducer, but you either need KASAN or to add
>> a BUG_ON to see the crash.
>> In this reproducer skb_network_offset() becomes negative after merging
>> two ipv6 fragments.
>>
>> I actually see multiple places where skb_network_offset() is used as
>> an argument to skb_pull().
>> So I guess every place can potentially be buggy.
>>
>> Thanks!
>
> I can not reproduce the bug on my hosts.
> Quite hard to debug for me.
>
> skb_network_offset() can not be negative at this point, unless there is
> a bug upper in the stack.

Hi Eric,

As far as I can see, skb_network_offset() becomes negative after
pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
At least I'm able to detect that with a BUG_ON().

Also it seems that the issue is only reproducible (at least with the
poc I provided) for a short time after boot.

I hope that helps.

>
> Hannes, do you have an idea of what could be wrong in IPv6 stack ?
>
> Thanks.
>
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Sergei Shtylyov @ 2016-11-28 19:06 UTC (permalink / raw)
  To: Souptick Joarder, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161128132819.GA500@gnr743-HP-ZBook-15>

Hello.

On 11/28/2016 04:28 PM, Souptick Joarder wrote:

> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
>
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx4/cmd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index e36bebc..ee3bd76 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -2679,14 +2679,13 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
>  	if (!mailbox)
>  		return ERR_PTR(-ENOMEM);
>
> -	mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> +	mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
>  				      &mailbox->dma);

    You need to realign he continuation line now, the way it was aligned in 
the original code.

[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 1/1] driver: ipvlan: Add the sanity check for ipvlan mode
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-11-28 19:08 UTC (permalink / raw)
  To: fgao; +Cc: David Miller, Eric Dumazet, linux-netdev, gfree.wind
In-Reply-To: <1480339435-15551-1-git-send-email-fgao@ikuai8.com>

On Mon, Nov 28, 2016 at 5:23 AM,  <fgao@ikuai8.com> wrote:
> From: Gao Feng <fgao@ikuai8.com>
>
> The ipvlan mode variable "nval" is from userspace, so the ipvlan codes
> should check if the mode variable "nval" is valid.
>
> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index ab90b22..537b5a9 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -65,6 +65,9 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>         struct net_device *mdev = port->dev;
>         int err = 0;
>
> +       if (nval >= IPVLAN_MODE_MAX)
> +               return -EINVAL;
> +
I'm curious to know how you encountered this issue? The values are
validated in ipvlan_nl_validate() and it should fail at that time
itself.
>         ASSERT_RTNL();
>         if (port->mode != nval) {
>                 if (nval == IPVLAN_MODE_L3S) {
> --
> 1.9.1
>
>

^ permalink raw reply

* Re: [PATCH] ethernet :mellanox :mlx5: Replace pci_pool_alloc by pci_pool_zalloc
From: Sergei Shtylyov @ 2016-11-28 19:08 UTC (permalink / raw)
  To: Souptick Joarder, saeedm-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, leonro-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161128133740.GA688@gnr743-HP-ZBook-15>

On 11/28/2016 04:37 PM, Souptick Joarder wrote:

> In alloc_cmd_box(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
>
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> index 1e639f8..d96ebd4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> @@ -1063,14 +1063,13 @@ static struct mlx5_cmd_mailbox *alloc_cmd_box(struct mlx5_core_dev *dev,
>  	if (!mailbox)
>  		return ERR_PTR(-ENOMEM);
>
> -	mailbox->buf = pci_pool_alloc(dev->cmd.pool, flags,
> +	mailbox->buf = pci_pool_zalloc(dev->cmd.pool, flags,
>  				      &mailbox->dma);

    Same here, the & needs to start under 'dev' on the broken up line.

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] Documentation: net: phy: Add blurb about RGMII
From: Mason @ 2016-11-28 19:15 UTC (permalink / raw)
  To: Florian Fainelli, Sebastian Frias, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <60473447-423c-7dff-6647-cce7cf2d4ab2@gmail.com>

On 28/11/2016 18:43, Florian Fainelli wrote:

> On 11/28/2016 02:34 AM, Sebastian Frias wrote:
>
>> For what is worth, the Atheros at803x driver comes with RX delay enabled
>> as per HW reset.
> 
> Always, or is this a behavior possibly defined via a stra-pin that can
> be changed?

Here's the data sheet:

  http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf

4.2.25 rgmii rx clock delay control
Offset: 0x00
bit 15: Control bit for rgmii interface rx clock delay:
1 = rgmii rx clock delay enable
0 = rgmii rx clock delay disable
HW Rst. 1
SW Rst. 1

As far as I can tell, rx delay is enabled by default, always.

The "Features" list mentions
"RGMII timing modes support internal delay and external delay on Rx path"
(Not sure about the internal vs external distinction.)

Table 3-6. RGMII AC Characteristics — no Internal Delay
Table 3-7. RGMII AC Characteristics — with internal delay added (Default)

Delay is 2 ns apparently.

There's also
4.2.27 Hib ctrl and rgmii gtx clock delay register
Offset: 0x0B

bits 6:5 Gtx_dly_val
Select the delay of gtx_clk.
00:0.25ns
01:1.3ns
10:2.4ns
11:3.4ns

I don't know what that is used for.
Maybe this is the external vs internal delay?

Regards.

^ permalink raw reply

* Checking for MDIO phy address 0
From: Phil Endecott @ 2016-11-28 18:53 UTC (permalink / raw)
  To: netdev

Dear Experts,

Is it true that phy address 0 is special, and should not be used?
I have this in my device tree (edited for brevity):

        mdio@17020000 {
            compatible = "apm,xgene-mdio-rgmii";
            #address-cells = <0x00000001>;
            #size-cells = <0x00000000>;
            phy@4 {
                reg = <0x00000000>;
                linux,phandle = <0x00000021>;
                phandle = <0x00000021>;
            };
            phy@5 {
                reg = <0x00000001>;
                linux,phandle = <0x00000022>;
                phandle = <0x00000022>;
            };
        };
        ethernet@1f210000 {
            compatible = "apm,xgene1-sgenet";
            phy-connection-type = "sgmii";
            phy-handle = <0x00000021>;
        };
        ethernet@1f210030 {
            compatible = "apm,xgene1-sgenet";
            phy-connection-type = "sgmii";
            phy-handle = <0x00000022>;
        };

(This is on a Gigabyte MP30-AR1, which has an X-Gene ARM64 processor.)

I've spent a long time trying to get the two gigabit ethernet ports to 
correctly auto-negotiate down to 100 Mbit, and it's possible that the 
underlying problem is that one of the phys is using address 0.  Does 
that make sense?

Anyway, my reason for this message is to suggest a runtime diagnostic 
message somewhere if address 0 is used - this could have saved me a lot of 
work! If someone can suggest the right place to add this I can prepare a 
patch.

Thanks,  Phil.

^ permalink raw reply

* Re: [PATCH net-next 10/11] qede: Add basic XDP support
From: Jakub Kicinski @ 2016-11-28 19:18 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: davem, netdev
In-Reply-To: <1480258273-24973-11-git-send-email-Yuval.Mintz@cavium.com>

On Sun, 27 Nov 2016 16:51:12 +0200, Yuval Mintz wrote:
> Add support for the ndo_xdp callback. This patch would support XDP_PASS,
> XDP_DROP and XDP_ABORTED commands.
> 
> This also adds a per Rx queue statistic which counts number of packets
> which didn't reach the stack [due to XDP].
> 
> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
> ---
> [...]
> +static int qede_xdp_set(struct qede_dev *edev, struct bpf_prog *prog)
> +{
> +	bool reload_required = true;
> +	struct qede_reload_args args;
> +	int rc = 0;
> +
> +	/* Protect against various other internal-reload flows */
> +	__qede_lock(edev);
> +	if (edev->state != QEDE_STATE_OPEN) {
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* If we're called, there was already a bpf reference increment */
> +	args.func = &qede_xdp_reload_func;
> +	args.u.new_prog = prog;
> +	if (reload_required)
> +		qede_reload(edev, &args, true);
> +	else
> +		args.func(edev, &args);
> +
> +out:
> +	__qede_unlock(edev);
> +	return rc;
> +}

Any particular reason not to allow the XDP prog being set while the
device is closed?  You seem to preserve the program across
close()/open() cycles so the edev->xdp_prog is alive and valid while
device is closed IIUC.  

I think other drivers are allowing setting XDP while closed and it
would be cool to keep the behaviour identical across drivers :)

^ permalink raw reply

* Re: [PATCH net-next 1/3] ethtool: (uapi) Add ETHTOOL_PHY_LOOPBACK to PHY tunables
From: Allan W. Nielsen @ 2016-11-28 19:23 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, f.fainelli, raju.lakkaraju
In-Reply-To: <20161128141410.GF4379@lunn.ch>

Hi Andrew and Florian,

On 28/11/16 15:14, Andrew Lunn wrote:
> On Mon, Nov 28, 2016 at 02:24:30PM +0100, Allan W. Nielsen wrote:
> > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> >
> > 3 types of PHY loopback are supported.
> > i.e. Near-End Loopback, Far-End Loopback and External Loopback.
> Is this integrated with ethtool --test? You only want the PHY to go
> into loopback mode when running ethtool --test external_lb or maybe
> ethtool --test offline.
There are other use-cases for enabling PHY loopback:

1) If the PHY is connected to a switch then a loop-port is sometime
   used to "force/enable" one or more extra pass through the switch
   core. This "hack" can sometime be used to achieve new functionality
   with existing silicon.

2) Existing user-space application may expect to be able to do the
   testing on its own (generate/validate the test traffic).

> What i think should happen is that this tunable sets the mode the
> PHY will go into when loopback is enabled. It does not however
> enable loopback.
That does not make a lot of sense with the "FAR" loopback (it is
looping received traffic back into the wire).

> It is running ethtool --test which actually enables
> the loopback, probably by setting BMCR_LOOPBACK. Once the test is
> finished, the bit is cleared and the PHY goes back into normal
> operation.
We are always happy to integrate with any existing functionality, but
as I understand "ethtool --test" then intention is to perform a test
and then bring back the PHY in to a "normal" state (I may be
wrong...).

The idea with this patch is to allow configuring loopback more
"permanently" (userspace can decide when to activate and when to
de-activate). I should properly have made that clear in the cover
letter.

Please let me know what you think.

/Allan

^ permalink raw reply

* Re: [PATCH net-next v2 1/1] driver: ipvlan: Use NF_IP_PRI_LAST as hook priority instead of INT_MAX
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-11-28 19:26 UTC (permalink / raw)
  To: fgao; +Cc: David Miller, Eric Dumazet, linux-netdev, gfree.wind
In-Reply-To: <1480245482-10357-1-git-send-email-fgao@ikuai8.com>

On Sun, Nov 27, 2016 at 3:18 AM,  <fgao@ikuai8.com> wrote:
> From: Gao Feng <fgao@ikuai8.com>
>
> It is better to use NF_IP_PRI_LAST instead of INT_MAX as hook priority.
> The former is good at readability and easier to maintain.
>
This IPvlan hook has to be "absolute" last hook and at this moment
NF_IP_PRI_LAST is set as INT_MAX so it's not altering anything.

If for whatever reasons the value of NF_IP_PRI_LAST changes, there
could be random IPvlan failure. Since that possibility cannot be
denied and there are several places INT_MAX is still used as hook
priority, I don't see any gain in having this patch in fact there
could be future (possible) downside.

> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> ---
>  v2: Add the lost header file. It is added in local but not in v1 patch
>  v1: Inital patch
>
>  drivers/net/ipvlan/ipvlan_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index ab90b22..01c7446 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -7,6 +7,7 @@
>   *
>   */
>
> +#include "linux/netfilter_ipv4.h"
>  #include "ipvlan.h"
>
>  static u32 ipvl_nf_hook_refcnt = 0;
> @@ -16,13 +17,13 @@
>                 .hook     = ipvlan_nf_input,
>                 .pf       = NFPROTO_IPV4,
>                 .hooknum  = NF_INET_LOCAL_IN,
> -               .priority = INT_MAX,
> +               .priority = NF_IP_PRI_LAST,
>         },
>         {
>                 .hook     = ipvlan_nf_input,
>                 .pf       = NFPROTO_IPV6,
>                 .hooknum  = NF_INET_LOCAL_IN,
> -               .priority = INT_MAX,
> +               .priority = NF_IP_PRI_LAST,
>         },
>  };
>
> --
> 1.9.1
>
>

^ permalink raw reply

* Re: [PATCH net-next v2] bpf: cgroup: fix documentation of __cgroup_bpf_update()
From: Alexei Starovoitov @ 2016-11-28 19:34 UTC (permalink / raw)
  To: Daniel Mack
  Cc: ast-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, netdev-u79uwXL29TY76Z2rM5mHXA,
	roszenrami-Re5JQEeQqe8AvxtiuMwx3w, cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1480338664-22616-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

On Mon, Nov 28, 2016 at 02:11:04PM +0100, Daniel Mack wrote:
> There's a 'not' missing in one paragraph. Add it.
> 
> Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
> Reported-by: Rami Rosen <roszenrami-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Fixes: 3007098494be ("cgroup: add support for eBPF programs")

Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

btw more canonical patch format is to put Fixes tag before SOB.

^ permalink raw reply

* Re: net: GPF in eth_header
From: Dmitry Vyukov @ 2016-11-28 19:34 UTC (permalink / raw)
  To: syzkaller
  Cc: Hannes Frederic Sowa, David Miller, Tom Herbert, Alexander Duyck,
	Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <CAAeHK+wWVp46ej0tk5mDnOHpH5Ky0+HiH9_SUqGcT0iTppLaGw@mail.gmail.com>

On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller
<syzkaller@googlegroups.com> wrote:
> On Mon, Nov 28, 2016 at 7:50 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Sat, 2016-11-26 at 20:07 +0100, Andrey Konovalov wrote:
>>> On Sat, Nov 26, 2016 at 7:28 PM, 'Eric Dumazet' via syzkaller
>>> <syzkaller@googlegroups.com> wrote:
>>> > On Sat, Nov 26, 2016 at 9:30 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> >> Hello,
>>> >>
>>> >> The following program triggers GPF in eth_header:
>>> >>
>>> >> https://gist.githubusercontent.com/dvyukov/613cadf05543b55a419f237e419cd495/raw/5471231523d1a07c3de55f11f87472c2816ee06c/gistfile1.txt
>>> >>
>>> >> On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).
>>> >>
>>> >> BUG: unable to handle kernel paging request at ffffed002d14d74a
>>> >> IP: [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >> PGD 7fff6067 [   50.787819] PUD 7fff5067
>>> >> PMD 0 [   50.787819]
>>> >> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>>> >> Modules linked in:
>>> >> CPU: 2 PID: 6712 Comm: a.out Not tainted 4.9.0-rc6+ #55
>>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> >> task: ffff88003a1841c0 task.stack: ffff880034d08000
>>> >> RIP: 0010:[<ffffffff86be3295>]  [<ffffffff86be3295>]
>>> >> eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >> RSP: 0018:ffff880034d0eb68  EFLAGS: 00010a03
>>> >> RAX: 1ffff1002d14d74a RBX: ffff880168a6ba4a RCX: ffff88006a9c7858
>>> >> RDX: 000000000000dd86 RSI: dffffc0000000000 RDI: ffff880168a6ba56
>>> >> RBP: ffff880034d0eb98 R08: 0000000000000000 R09: 0000000000000031
>>> >> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
>>> >> R13: ffff88006c208d80 R14: 00000000000086dd R15: ffff88006a9c7858
>>> >> FS:  0000000001a02940(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
>>> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> >> CR2: ffffed002d14d74a CR3: 0000000037373000 CR4: 00000000000006e0
>>> >> Stack:
>>> >>  000000316881ab40 ffff88006a9c76c0 ffff88006881ab40 ffff88006a9c77f8
>>> >>  0000000000000000 dffffc0000000000 ffff880034d0ee98 ffffffff86b31af9
>>> >>  ffffffff8719605c ffff880034d0f0f8 ffffffff000086dd ffffffff86be3220
>>> >> Call Trace:
>>> >>  [<     inline     >] dev_hard_header ./include/linux/netdevice.h:2762
>>> >>  [<ffffffff86b31af9>] neigh_resolve_output+0x659/0xb20 net/core/neighbour.c:1302
>>> >>  [<     inline     >] dst_neigh_output ./include/net/dst.h:464
>>> >>  [<ffffffff8719605c>] ip6_finish_output2+0xb3c/0x2500 net/ipv6/ip6_output.c:121
>>> >>  [<ffffffff871a0b0b>] ip6_finish_output+0x2eb/0x760 net/ipv6/ip6_output.c:139
>>> >>  [<     inline     >] NF_HOOK_COND ./include/linux/netfilter.h:246
>>> >>  [<ffffffff871a1157>] ip6_output+0x1d7/0x9a0 net/ipv6/ip6_output.c:153
>>> >>  [<     inline     >] dst_output ./include/net/dst.h:501
>>> >>  [<ffffffff873312ea>] ip6_local_out+0x9a/0x180 net/ipv6/output_core.c:170
>>> >>  [<ffffffff871a3886>] ip6_send_skb+0xa6/0x340 net/ipv6/ip6_output.c:1712
>>> >>  [<ffffffff871a3bd8>] ip6_push_pending_frames+0xb8/0xe0
>>> >> net/ipv6/ip6_output.c:1732
>>> >>  [<     inline     >] rawv6_push_pending_frames net/ipv6/raw.c:607
>>> >>  [<ffffffff8722acfb>] rawv6_sendmsg+0x250b/0x2c20 net/ipv6/raw.c:920
>>> >>  [<ffffffff8701c4f5>] inet_sendmsg+0x385/0x590 net/ipv4/af_inet.c:734
>>> >>  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
>>> >>  [<ffffffff86a6ea9f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
>>> >>  [<ffffffff86a6ee0b>] sock_write_iter+0x32b/0x620 net/socket.c:829
>>> >>  [<ffffffff81a6f153>] do_iter_readv_writev+0x363/0x670 fs/read_write.c:695
>>> >>  [<ffffffff81a71ba1>] do_readv_writev+0x431/0x9b0 fs/read_write.c:872
>>> >>  [<ffffffff81a726dc>] vfs_writev+0x8c/0xc0 fs/read_write.c:911
>>> >>  [<ffffffff81a72825>] do_writev+0x115/0x2d0 fs/read_write.c:944
>>> >>  [<     inline     >] SYSC_writev fs/read_write.c:1017
>>> >>  [<ffffffff81a75fdc>] SyS_writev+0x2c/0x40 fs/read_write.c:1014
>>> >>  [<ffffffff8814cf85>] entry_SYSCALL_64_fastpath+0x23/0xc6
>>> >> arch/x86/entry/entry_64.S:209
>>> >> Code: 41 83 fe 04 0f 84 aa 00 00 00 e8 17 4e b0 fa 48 8d 7b 0c 48 be
>>> >> 00 00 00 00 00 fc ff df 44 89 f2 66 c1 c2 08 48 89 f8 48 c1 e8 03 <0f>
>>> >> b6 0c 30 48 8d 43 0d 49 89 c0 49 c1 e8 03 41 0f b6 34 30 49
>>> >> RIP  [<ffffffff86be3295>] eth_header+0x75/0x260 net/ethernet/eth.c:88
>>> >>  RSP <ffff880034d0eb68>
>>> >> CR2: ffffed002d14d74a
>>> >> ---[ end trace a73fedfdc11bd60c ]---
>>> >
>>> >
>>> > Hi Dmitry
>>> >
>>> > I could not reproduce the issue. Might need some specific configuration...
>>> >
>>> > loopback device has proper ethernet header (all 0)
>>> >
>>> > Fault happens in :
>>> >
>>> > 0f b6 0c 30             movzbl (%rax,%rsi,1),%ecx
>>> >
>>> > RAX=1ffff1002d14d74a  which is RDI>>3, and RSI=dffffc0000000000
>>> >
>>> > Could this be a KASAN problem ?
>>>
>>> Hi Eric,
>>>
>>> The crash happens when the kernel tries to access shadow for nonmapped memory.
>>>
>>> The issue here is an integer overflow which happens in neigh_resolve_output().
>>> skb_network_offset(skb) can return negative number, but __skb_pull()
>>> accepts unsigned int as len.
>>> As a result, the least significat bit in higher 32 bits of skb->data
>>> gets set and we get an out-of-bounds with offset of 4 GB.
>>>
>>> I've attached a short reproducer, but you either need KASAN or to add
>>> a BUG_ON to see the crash.
>>> In this reproducer skb_network_offset() becomes negative after merging
>>> two ipv6 fragments.
>>>
>>> I actually see multiple places where skb_network_offset() is used as
>>> an argument to skb_pull().
>>> So I guess every place can potentially be buggy.
>>>
>>> Thanks!
>>
>> I can not reproduce the bug on my hosts.
>> Quite hard to debug for me.
>>
>> skb_network_offset() can not be negative at this point, unless there is
>> a bug upper in the stack.
>
> Hi Eric,
>
> As far as I can see, skb_network_offset() becomes negative after
> pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> At least I'm able to detect that with a BUG_ON().
>
> Also it seems that the issue is only reproducible (at least with the
> poc I provided) for a short time after boot.


Eric,

Is it enough to debug? Or maybe Andrey can trace some values for you.

^ permalink raw reply

* Re: [PATCH ethtool 2/2] Ethtool: Implements PHY Loopback
From: Allan W. Nielsen @ 2016-11-28 19:34 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, linville, andrew, raju.lakkaraju
In-Reply-To: <73aa42bc-b476-ba5a-cf22-115b409a40e3@gmail.com>

On 28/11/16 09:56, Florian Fainelli wrote:
> On 11/28/2016 05:25 AM, Allan W. Nielsen wrote:
> > From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> >
> > Add loopback in ethtool tunables to access PHY drivers.
> >
> > Ethtool Help: ethtool -h for PHY tunables
> >     ethtool --set-phy-tunable DEVNAME      Set PHY tunable
> >                 [ loopback off|near|far|extn ]
> >     ethtool --get-phy-tunable DEVNAME      Get PHY tunable
> >                 [ loopback ]
> >
> > Ethtool ex:
> >   ethtool --set-phy-tunable eth0 loopback near
> >   ethtool --set-phy-tunable eth0 loopback far
> >   ethtool --set-phy-tunable eth0 loopback extn
> >   ethtool --set-phy-tunable eth0 loopback off
> >
> >   ethtool --get-phy-tunable eth0 loopback
> >
> > Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> > Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
> > ---
> 
> > +Near-End Loopback:
> > +Transmitted data (TXD) is looped back in the PCS block onto the receive data
> > +signal (RXD). When Near-End loopback enable, no data is transmitted over
> > +the network. no data receive from the network.
> 
> This is also known as the local loopback test mode, right?
Yes - Traffic transmitted/generated by the host list loopback to the
host instead of transmitting it on the wire.

> > +
> > +Far-End Loopback:
> > +This loopback is a special test mode to allow testing the PHY from link
> > +partner side. In this mode data that is received from the link partner pass
> > +through the PHY's receiver, looped back on the MII and transmitted back to
> > +the link partner.
> 
> And this is the remote loopback mode.
Yes - Traffic receiwed on the "wire" is transmitted back on the wire.

> > +
> > +External Loopback:
> > +An RJ45 loopback cable can be used to route the transmit signals an the
> > +output of the trnsformer back to the receiver inputs and this loopback will
> > +work at either 10 or 100 or 1000 Mbps speed.
> > +RJ45 Loopback cable created by conncting pin 1 to pin 3 and connecting pin
> > +2 to pin 6.
> 
> OK, this name makes sense to me, but for the two other names, we need to
> use a terminology that is clearer to the reader and/or people familiar
> and targeted at using this feature (e.g: in a lab or during manufacturing).
Sure, we can find better names and/or improve the documentation. But
before jumping to that, then it is properly a good idea to agree on
the overall concept.

We will get back to the naming when we agree on the other parts.

/Allan

^ permalink raw reply

* Re: [PATCH net-next] bpf: samples: Fix compile of test_lru_dist.c
From: Martin Lau @ 2016-11-28 19:36 UTC (permalink / raw)
  To: Daniel Borkmann, David Ahern, netdev@vger.kernel.org; +Cc: Alexei Starovoitov
In-Reply-To: <583BF623.7030103@iogearbox.net>


On Mon, Nov 28, 2016 at 10:17:23AM +0100, Daniel Borkmann wrote:
> On 11/28/2016 05:32 AM, David Ahern wrote:
> >Build of samples/bpf on debian/jessie fails with:
> >
> >   HOSTCC  /home/dsa/kernel-3.git/samples/bpf/test_lru_dist.o
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c: In function ‘main’:
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: error: variable ‘r’ has initializer but incomplete type
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >          ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:21: error: ‘RLIM_INFINITY’ undeclared (first use in this function)
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >                      ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:21: note: each undeclared identifier is reported only once for each function it appears in
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: excess elements in struct initializer
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >          ^
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: (near initialization for ‘r’)
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: excess elements in struct initializer
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:9: warning: (near initialization for ‘r’)
> >/home/dsa/kernel-3.git/samples/bpf/test_lru_dist.c:490:16: error: storage size of ‘r’ isn’t known
> >   struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> >
> >Add sys/resource.h to the include list
> >
> >Fixes: 5db58faf989f ("bpf: Add tests for the LRU bpf_htab")
> >Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> >Cc: Martin KaFai Lau <kafai@fb.com>
>
> Ran into the same issue, fixed here already:
>
> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e00c7b216f34444252f3771f7d4ed48d4f032636

Thanks for David's report/patch and the earlier Daniel's fix.  Sorry that I
missed it since it compiles fine in my current dev setup (CentOS6).  I
can also reproduce in my newer arch-linux setup.

--Martin

^ permalink raw reply

* [PATCH net 0/5] l2tp: fixes for l2tp_ip and l2tp_ip6 socket handling
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston

This series addresses problems found while working on commit 32c231164b76
("l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()").

The first three patches fix races in socket's connect, recv and bind
operations. The last two ones fix scenarios where l2tp fails to
correctly lookup its userspace sockets.

Apart from the last patch, which is l2tp_ip6 specific, every patch fixes
the same problem in the L2TP IPv4 and IPv6 code.

Guillaume Nault (5):
  l2tp: lock socket before checking flags in connect()
  l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
  l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
  l2tp: fix lookup for sockets not bound to a device in l2tp_ip
  l2tp: fix address test in __l2tp_ip6_bind_lookup()

 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 ++-
 net/l2tp/l2tp_ip.c  | 61 +++++++++++++++++++++--------------------
 net/l2tp/l2tp_ip6.c | 79 ++++++++++++++++++++++++++++-------------------------
 4 files changed, 79 insertions(+), 67 deletions(-)

-- 
2.10.2

^ permalink raw reply

* [PATCH net 2/5] l2tp: hold socket before dropping lock in l2tp_ip{,6}_recv()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 11 ++++++-----
 net/l2tp/l2tp_ip6.c | 11 ++++++-----
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 1f57094..4d1c942 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -183,14 +183,15 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 
 		read_lock_bh(&l2tp_ip_lock);
 		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index af9abff..e3fc778 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -198,14 +198,15 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		read_lock_bh(&l2tp_ip6_lock);
 		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
 					    0, tunnel_id);
+		if (!sk) {
+			read_unlock_bh(&l2tp_ip6_lock);
+			goto discard;
+		}
+
+		sock_hold(sk);
 		read_unlock_bh(&l2tp_ip6_lock);
 	}
 
-	if (sk == NULL)
-		goto discard;
-
-	sock_hold(sk);
-
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_put;
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 1/5] l2tp: lock socket before checking flags in connect()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.

This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 include/net/ipv6.h  |  2 ++
 net/ipv6/datagram.c |  4 +++-
 net/l2tp/l2tp_ip.c  | 19 ++++++++++++-------
 net/l2tp/l2tp_ip6.c | 16 +++++++++++-----
 4 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8fed1cd..f11ca83 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -970,6 +970,8 @@ int compat_ipv6_setsockopt(struct sock *sk, int level, int optname,
 int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen);
 
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *addr,
+			   int addr_len);
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 				 int addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 37874e2..ccf4055 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -139,7 +139,8 @@ void ip6_datagram_release_cb(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
 
-static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr,
+			   int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock	*inet = inet_sk(sk);
@@ -252,6 +253,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 out:
 	return err;
 }
+EXPORT_SYMBOL_GPL(__ip6_datagram_connect);
 
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 982f6c4..1f57094 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -308,21 +308,24 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	struct sockaddr_l2tpip *lsa = (struct sockaddr_l2tpip *) uaddr;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
 	if (ipv4_is_multicast(lsa->l2tp_addr.s_addr))
 		return -EINVAL;
 
-	rc = ip4_datagram_connect(sk, uaddr, addr_len);
-	if (rc < 0)
-		return rc;
-
 	lock_sock(sk);
 
+	/* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip4_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip_lock);
@@ -330,7 +333,9 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	write_unlock_bh(&l2tp_ip_lock);
 
+out_sk:
 	release_sock(sk);
+
 	return rc;
 }
 
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9978d01..af9abff 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -371,9 +371,6 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	int	addr_type;
 	int rc;
 
-	if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
-		return -EINVAL;
-
 	if (addr_len < sizeof(*lsa))
 		return -EINVAL;
 
@@ -390,10 +387,18 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 			return -EINVAL;
 	}
 
-	rc = ip6_datagram_connect(sk, uaddr, addr_len);
-
 	lock_sock(sk);
 
+	 /* Must bind first - autobinding does not work */
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		rc = -EINVAL;
+		goto out_sk;
+	}
+
+	rc = __ip6_datagram_connect(sk, uaddr, addr_len);
+	if (rc < 0)
+		goto out_sk;
+
 	l2tp_ip6_sk(sk)->peer_conn_id = lsa->l2tp_conn_id;
 
 	write_lock_bh(&l2tp_ip6_lock);
@@ -401,6 +406,7 @@ static int l2tp_ip6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	write_unlock_bh(&l2tp_ip6_lock);
 
+out_sk:
 	release_sock(sk);
 
 	return rc;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 3/5] l2tp: fix racy socket lookup in l2tp_ip and l2tp_ip6 bind()
From: Guillaume Nault @ 2016-11-28 19:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

It's not enough to check for sockets bound to same address at the
beginning of l2tp_ip{,6}_bind(): even if no socket is found at that
time, a socket with the same address could be bound before we take
the l2tp lock again.

This patch moves the lookup right before inserting the new socket, so
that no change can ever happen to the list between address lookup and
socket insertion.

Care is taken to avoid side effects on the socket in case of failure.
That is, modifications of the socket are done after the lookup, when
binding is guaranteed to succeed, and before releasing the l2tp lock,
so that concurrent lookups will always see fully initialised sockets.

For l2tp_ip6, the lookup is now always done with the correct bound
device. Before this patch, when binding to a link-local address, the
lookup was done with the original sk->sk_bound_dev_if, which was later
overwritten with addr->l2tp_scope_id. Lookup is now performed with the
final sk->sk_bound_dev_if value.

Finally, the (addr_len >= sizeof(struct sockaddr_in6)) check has been
dropped: addr is a sockaddr_l2tpip6 not sockaddr_in6 and addr_len has
already been checked at this point (this part of the code seems to have
been copy-pasted from net/ipv6/raw.c).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 25 ++++++++++---------------
 net/l2tp/l2tp_ip6.c | 43 ++++++++++++++++++++-----------------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4d1c942..ea818f3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -257,14 +257,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr->l2tp_family != AF_INET)
 		return -EINVAL;
 
-	ret = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip_lock);
-	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
-				  sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-
-	read_unlock_bh(&l2tp_ip_lock);
-
 	lock_sock(sk);
 	if (!sock_flag(sk, SOCK_ZAPPED))
 		goto out;
@@ -282,14 +274,22 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		inet->inet_rcv_saddr = inet->inet_saddr = addr->l2tp_addr.s_addr;
 	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
 		inet->inet_saddr = 0;  /* Use device */
-	sk_dst_reset(sk);
 
+	write_lock_bh(&l2tp_ip_lock);
+	if (__l2tp_ip_bind_lookup(net, addr->l2tp_addr.s_addr,
+				  sk->sk_bound_dev_if, addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip_lock);
+		ret = -EADDRINUSE;
+		goto out;
+	}
+
+	sk_dst_reset(sk);
 	l2tp_ip_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip_lock);
 	sk_add_bind_node(sk, &l2tp_ip_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip_lock);
+
 	ret = 0;
 	sock_reset_flag(sk, SOCK_ZAPPED);
 
@@ -297,11 +297,6 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	release_sock(sk);
 
 	return ret;
-
-out_in_use:
-	read_unlock_bh(&l2tp_ip_lock);
-
-	return ret;
 }
 
 static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index e3fc778..5f2ae61 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -267,6 +267,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_l2tpip6 *addr = (struct sockaddr_l2tpip6 *) uaddr;
 	struct net *net = sock_net(sk);
 	__be32 v4addr = 0;
+	int bound_dev_if;
 	int addr_type;
 	int err;
 
@@ -285,13 +286,6 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (addr_type & IPV6_ADDR_MULTICAST)
 		return -EADDRNOTAVAIL;
 
-	err = -EADDRINUSE;
-	read_lock_bh(&l2tp_ip6_lock);
-	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr,
-				   sk->sk_bound_dev_if, addr->l2tp_conn_id))
-		goto out_in_use;
-	read_unlock_bh(&l2tp_ip6_lock);
-
 	lock_sock(sk);
 
 	err = -EINVAL;
@@ -301,28 +295,25 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (sk->sk_state != TCP_CLOSE)
 		goto out_unlock;
 
+	bound_dev_if = sk->sk_bound_dev_if;
+
 	/* Check if the address belongs to the host. */
 	rcu_read_lock();
 	if (addr_type != IPV6_ADDR_ANY) {
 		struct net_device *dev = NULL;
 
 		if (addr_type & IPV6_ADDR_LINKLOCAL) {
-			if (addr_len >= sizeof(struct sockaddr_in6) &&
-			    addr->l2tp_scope_id) {
-				/* Override any existing binding, if another
-				 * one is supplied by user.
-				 */
-				sk->sk_bound_dev_if = addr->l2tp_scope_id;
-			}
+			if (addr->l2tp_scope_id)
+				bound_dev_if = addr->l2tp_scope_id;
 
 			/* Binding to link-local address requires an
-			   interface */
-			if (!sk->sk_bound_dev_if)
+			 * interface.
+			 */
+			if (!bound_dev_if)
 				goto out_unlock_rcu;
 
 			err = -ENODEV;
-			dev = dev_get_by_index_rcu(sock_net(sk),
-						   sk->sk_bound_dev_if);
+			dev = dev_get_by_index_rcu(sock_net(sk), bound_dev_if);
 			if (!dev)
 				goto out_unlock_rcu;
 		}
@@ -337,13 +328,22 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	}
 	rcu_read_unlock();
 
-	inet->inet_rcv_saddr = inet->inet_saddr = v4addr;
+	write_lock_bh(&l2tp_ip6_lock);
+	if (__l2tp_ip6_bind_lookup(net, &addr->l2tp_addr, bound_dev_if,
+				   addr->l2tp_conn_id)) {
+		write_unlock_bh(&l2tp_ip6_lock);
+		err = -EADDRINUSE;
+		goto out_unlock;
+	}
+
+	inet->inet_saddr = v4addr;
+	inet->inet_rcv_saddr = v4addr;
+	sk->sk_bound_dev_if = bound_dev_if;
 	sk->sk_v6_rcv_saddr = addr->l2tp_addr;
 	np->saddr = addr->l2tp_addr;
 
 	l2tp_ip6_sk(sk)->conn_id = addr->l2tp_conn_id;
 
-	write_lock_bh(&l2tp_ip6_lock);
 	sk_add_bind_node(sk, &l2tp_ip6_bind_table);
 	sk_del_node_init(sk);
 	write_unlock_bh(&l2tp_ip6_lock);
@@ -356,10 +356,7 @@ static int l2tp_ip6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rcu_read_unlock();
 out_unlock:
 	release_sock(sk);
-	return err;
 
-out_in_use:
-	read_unlock_bh(&l2tp_ip6_lock);
 	return err;
 }
 
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 4/5] l2tp: fix lookup for sockets not bound to a device in l2tp_ip
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

When looking up an l2tp socket, we must consider a null netdevice id as
wild card. There are currently two problems caused by
__l2tp_ip_bind_lookup() not considering 'dif' as wild card when set to 0:

  * A socket bound to a device (i.e. with sk->sk_bound_dev_if != 0)
    never receives any packet. Since __l2tp_ip_bind_lookup() is called
    with dif == 0 in l2tp_ip_recv(), sk->sk_bound_dev_if is always
    different from 'dif' so the socket doesn't match.

  * Two sockets, one bound to a device but not the other, can be bound
    to the same address. If the first socket binding to the address is
    the one that is also bound to a device, the second socket can bind
    to the same address without __l2tp_ip_bind_lookup() noticing the
    overlap.

To fix this issue, we need to consider that any null device index, be
it 'sk->sk_bound_dev_if' or 'dif', matches with any other value.
We also need to pass the input device index to __l2tp_ip_bind_lookup()
on reception so that sockets bound to a device never receive packets
from other devices.

This patch fixes l2tp_ip6 in the same way.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip.c  | 6 ++++--
 net/l2tp/l2tp_ip6.c | 7 ++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index ea818f3..aa42e10 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -61,7 +61,8 @@ static struct sock *__l2tp_ip_bind_lookup(struct net *net, __be32 laddr, int dif
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -182,7 +183,8 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 		struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
 
 		read_lock_bh(&l2tp_ip_lock);
-		sk = __l2tp_ip_bind_lookup(net, iph->daddr, 0, tunnel_id);
+		sk = __l2tp_ip_bind_lookup(net, iph->daddr, inet_iif(skb),
+					   tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip_lock);
 			goto discard;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 5f2ae61..4a86440 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -73,7 +73,8 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,
 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
 		    !(addr && ipv6_addr_equal(addr, laddr)) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+		    (!sk->sk_bound_dev_if || !dif ||
+		     sk->sk_bound_dev_if == dif))
 			goto found;
 	}
 
@@ -196,8 +197,8 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
 		struct ipv6hdr *iph = ipv6_hdr(skb);
 
 		read_lock_bh(&l2tp_ip6_lock);
-		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr,
-					    0, tunnel_id);
+		sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, inet6_iif(skb),
+					    tunnel_id);
 		if (!sk) {
 			read_unlock_bh(&l2tp_ip6_lock);
 			goto discard;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net 5/5] l2tp: fix address test in __l2tp_ip6_bind_lookup()
From: Guillaume Nault @ 2016-11-28 19:40 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Chris Elston
In-Reply-To: <cover.1480360512.git.g.nault@alphalink.fr>

The '!(addr && ipv6_addr_equal(addr, laddr))' part of the conditional
matches if addr is NULL or if addr != laddr.
But the intend of __l2tp_ip6_bind_lookup() is to find a sockets with
the same address, so the ipv6_addr_equal() condition needs to be
inverted.

For better clarity and consistency with the rest of the expression, the
(!X || X == Y) notation is used instead of !(X && X != Y).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_ip6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 4a86440..aa821cb 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -72,7 +72,7 @@ static struct sock *__l2tp_ip6_bind_lookup(struct net *net,

 		if ((l2tp->conn_id == tunnel_id) &&
 		    net_eq(sock_net(sk), net) &&
-		    !(addr && ipv6_addr_equal(addr, laddr)) &&
+		    (!addr || ipv6_addr_equal(addr, laddr)) &&
 		    (!sk->sk_bound_dev_if || !dif ||
 		     sk->sk_bound_dev_if == dif))
 			goto found;
-- 
2.10.2

^ permalink raw reply related

* [PATCH net] bpf: fix states equal logic for varlen access
From: Josef Bacik @ 2016-11-28 19:44 UTC (permalink / raw)
  To: davem, netdev, daniel, ast, jannh

If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 kernel/bpf/verifier.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 89f787c..2c8a688 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 {
 	struct bpf_reg_state *rold, *rcur;
 	int i;
+	bool map_access = env->varlen_map_value_access;
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		rold = &old->regs[i];
@@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
 		/* If the ranges were not the same, but everything else was and
 		 * we didn't do a variable access into a map then we are a-ok.
 		 */
-		if (!env->varlen_map_value_access &&
+		if (!map_access &&
 		    rold->type == rcur->type && rold->imm == rcur->imm)
 			continue;
 
+		/* If we didn't map access then again we don't care about the
+		 * mismatched range values and it's ok if our old type was
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 */
 		if (rold->type == NOT_INIT ||
-		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
+		    (!map_access && (rold->type == UNKNOWN_VALUE &&
+				     rcur->type != NOT_INIT)))
 			continue;
 
 		if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v2 3/4] Documentation: net: phy: Add blurb about RGMII
From: Florian Fainelli @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Mason, Sebastian Frias, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <b6a19ca6-8bda-8573-eb7b-9ed7f80ca480@free.fr>

On 11/28/2016 11:15 AM, Mason wrote:
> On 28/11/2016 18:43, Florian Fainelli wrote:
> 
>> On 11/28/2016 02:34 AM, Sebastian Frias wrote:
>>
>>> For what is worth, the Atheros at803x driver comes with RX delay enabled
>>> as per HW reset.
>>
>> Always, or is this a behavior possibly defined via a stra-pin that can
>> be changed?
> 
> Here's the data sheet:
> 
>   http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf
> 
> 4.2.25 rgmii rx clock delay control
> Offset: 0x00
> bit 15: Control bit for rgmii interface rx clock delay:
> 1 = rgmii rx clock delay enable
> 0 = rgmii rx clock delay disable
> HW Rst. 1
> SW Rst. 1
> 
> As far as I can tell, rx delay is enabled by default, always.
> 
> The "Features" list mentions
> "RGMII timing modes support internal delay and external delay on Rx path"
> (Not sure about the internal vs external distinction.)
> 
> Table 3-6. RGMII AC Characteristics — no Internal Delay
> Table 3-7. RGMII AC Characteristics — with internal delay added (Default)
> 
> Delay is 2 ns apparently.
> 
> There's also
> 4.2.27 Hib ctrl and rgmii gtx clock delay register
> Offset: 0x0B
> 
> bits 6:5 Gtx_dly_val
> Select the delay of gtx_clk.
> 00:0.25ns
> 01:1.3ns
> 10:2.4ns
> 11:3.4ns
> 
> I don't know what that is used for.
> Maybe this is the external vs internal delay?

First, this has little to do with the initial patch series being
discussed now, and second, this still looks like an internal delay
programming, just that it applies to the received transmit clock from
the MAC, that's how I read it though.
-- 
Florian

^ permalink raw reply

* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-28 19:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzkaller, Hannes Frederic Sowa, David Miller, Tom Herbert,
	Alexander Duyck, Jiri Benc, Sabrina Dubroca, netdev, LKML
In-Reply-To: <CACT4Y+Yr6UJxXosjx2FO=KtxGr40X07_UBiscs1BP1s8cTaqAA@mail.gmail.com>

On Mon, 2016-11-28 at 20:34 +0100, Dmitry Vyukov wrote:
> On Mon, Nov 28, 2016 at 8:04 PM, 'Andrey Konovalov' via syzkaller

> > Hi Eric,
> >
> > As far as I can see, skb_network_offset() becomes negative after
> > pskb_pull(skb, (u8 *) (fhdr + 1) - skb->data) in nf_ct_frag6_queue().
> > At least I'm able to detect that with a BUG_ON().
> >
> > Also it seems that the issue is only reproducible (at least with the
> > poc I provided) for a short time after boot.
> 
> 
> Eric,
> 
> Is it enough to debug? Or maybe Andrey can trace some values for you.

Well, now we are talking, if you tell me how many modules you load, it
might help ;)

nf_ct_frag6_queue is nowhere to be seen in my kernels, that might
explain why I could not reproduce the bug.

Let me try ;)

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Lance Richardson @ 2016-11-28 19:49 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Sven-Haegar Koch, Eli Cooper, netdev, Eric Dumazet
In-Reply-To: <1428298236.4223309.1480355647336.JavaMail.zimbra@redhat.com>

> From: "Lance Richardson" <lrichard@redhat.com>
> To: "Stephen Rothwell" <sfr@canb.auug.org.au>
> Cc: "Sven-Haegar Koch" <haegar@sdinet.de>, "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric Dumazet"
> <eric.dumazet@gmail.com>
> Sent: Monday, November 28, 2016 12:54:07 PM
> Subject: Re: Large performance regression with 6in4 tunnel (sit)
> 
> > From: "Stephen Rothwell" <sfr@canb.auug.org.au>
> > To: "Sven-Haegar Koch" <haegar@sdinet.de>
> > Cc: "Eli Cooper" <elicooper@gmx.com>, netdev@vger.kernel.org, "Eric
> > Dumazet" <eric.dumazet@gmail.com>
> > Sent: Saturday, November 26, 2016 10:23:40 PM
> > Subject: Re: Large performance regression with 6in4 tunnel (sit)
> > 
> > Hi Sven-Haegar,
> > 
> > On Fri, 25 Nov 2016 05:06:53 +0100 (CET) Sven-Haegar Koch
> > <haegar@sdinet.de>
> > wrote:
> > >
> > > Somehow this problem description really reminds me of a report on
> > > netdev a bit ago, which the following patch fixed:
> > > 
> > > commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
> > > Author: Lance Richardson <lrichard@redhat.com>
> > > Date:   Wed Nov 2 16:36:17 2016 -0400
> > > 
> > >     ipv4: allow local fragmentation in ip_finish_output_gso()
> > >     
> > >     Some configurations (e.g. geneve interface with default
> > >     MTU of 1500 over an ethernet interface with 1500 MTU) result
> > >     in the transmission of packets that exceed the configured MTU.
> > >     While this should be considered to be a "bad" configuration,
> > >     it is still allowed and should not result in the sending
> > >     of packets that exceed the configured MTU.
> > > 
> > > Could this be related?
> > > 
> > > I suppose it would be difficult to test this patch on this machine?
> > 
> > The kernel I am running on is based on 4.7.8, so the above patch
> > doesn't come close to applying. Most fo what it is reverting was
> > introduced in commit 359ebda25aa0 ("net/ipv4: Introduce IPSKB_FRAG_SEGS
> > bit to inet_skb_parm.flags") in v4.8-rc1.
> > 
> > --
> > Cheers,
> > Stephen Rothwell
> > 
> 
> This should be equivalent for 4.7.x:
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 4bd4921..8a253e2 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -224,8 +224,7 @@ static int ip_finish_output_gso(struct net *net, struct
> sock *sk,
>         int ret = 0;
>  
>         /* common case: locally created skb or seglen is <= mtu */
> -       if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -             skb_gso_network_seglen(skb) <= mtu)
> +       if (skb_gso_network_seglen(skb) <= mtu)
>                 return ip_finish_output2(net, sk, skb);
>  
>         /* Slowpath -  GSO segment length is exceeding the dst MTU.
> 

BTW, I do think this would be worth trying. For the geneve case, I
measured on the order of a 10X-100X performance hit without this
patch, traces were similar to what you describe (too-large gso packets
were dropped, corresponding TCP segments were retransmitted later via
a non-gso code path).

Regards,

   Lance

^ permalink raw reply

* Re: [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: David Miller @ 2016-11-28 20:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, sfr, elicooper, netdev
In-Reply-To: <1480357935.18162.76.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2016 10:32:15 -0800

> On Mon, 2016-11-28 at 10:17 -0800, Eric Dumazet wrote:
> 
>> I was referring to Stephen bug report.
>> 
>> It appears that Eli changelog was not very precise, because his bug was
>> because of XFRM being involved.
>> 
>> xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
>> 
>> So XFRM calls skb_gso_segment() before ip_output() or
>> ip6_finish_output2() had a chance to change skb->protocol
> 
> So maybe the real fix would be to set skb->protocol at the right place,
> before xfrm can be called, instead of chasing all skb producers ;)

And the key for this would be dst_output() invocations.

^ permalink raw reply

* Re: [PATCH net-next v3 1/3] bpf: Refactor cgroups code in prep for new type
From: Alexei Starovoitov @ 2016-11-28 20:06 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480348130-31354-2-git-send-email-dsa@cumulusnetworks.com>

On Mon, Nov 28, 2016 at 07:48:48AM -0800, David Ahern wrote:
> Code move only; no functional change intended.

not quite...

> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
>   * @sk: The socken sending or receiving traffic
> @@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
>  
>  	prog = rcu_dereference(cgrp->bpf.effective[type]);
>  	if (prog) {
> -		unsigned int offset = skb->data - skb_network_header(skb);
> -
> -		__skb_push(skb, offset);
> -		ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
> -		__skb_pull(skb, offset);
> +		switch (type) {
> +		case BPF_CGROUP_INET_INGRESS:
> +		case BPF_CGROUP_INET_EGRESS:
> +			ret = __cgroup_bpf_run_filter_skb(skb, prog);
> +			break;

hmm. what's a point of double jump table? It's only burning cycles
in the fast path. We already have
prog = rcu_dereference(cgrp->bpf.effective[type]); if (prog) {...}
Could you do a variant of __cgroup_bpf_run_filter() instead ?
That doesnt't take 'skb' as an argument.
It will also solve scary looking NULL skb from patch 2:
__cgroup_bpf_run_filter(sk, NULL, ...

and to avoid copy-pasting first dozen lines of current
__cgroup_bpf_run_filter can be moved into a helper that
__cgroup_bpf_run_filter_skb and
__cgroup_bpf_run_filter_sk will call.
Or some other way to rearrange that code.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox