Netdev List
 help / color / mirror / Atom feed
* Re: Large performance regression with 6in4 tunnel (sit)
From: Stephen Rothwell @ 2016-11-25  3:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eli Cooper, netdev
In-Reply-To: <1480042888.8455.527.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Thu, 24 Nov 2016 19:01:28 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Since I do not have this problem at all on my hosts, it could be a buggy
> ethernet driver.
> 
> Could you share what NIC card and driver you are using ?


# uname -a
Linux bilbo 4.7.0-1-amd64 #1 SMP Debian 4.7.8-1 (2016-10-19) x86_64 GNU/Linux

# lspci | grep -i net
03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
04:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

from boot dmesg:

[    7.573725] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.3.0-k
[    7.573726] igb: Copyright (c) 2007-2014 Intel Corporation.
[    7.752918] igb 0000:03:00.0: added PHC on eth0
[    7.752925] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[    7.752927] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:1e:67:9f:d4:24
[    7.753460] igb 0000:03:00.0: eth0: PBA No: 102100-000
[    7.753460] igb 0000:03:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    7.902433] igb 0000:04:00.0: added PHC on eth1
[    7.902434] igb 0000:04:00.0: Intel(R) Gigabit Ethernet Network Connection
[    7.902435] igb 0000:04:00.0: eth1: (PCIe:2.5Gb/s:Width x1) 00:1e:67:9f:d4:25
[    7.902484] igb 0000:04:00.0: eth1: PBA No: 102100-000
[    7.902485] igb 0000:04:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[   19.753325] igb 0000:03:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Mark Lord @ 2016-11-25  3:49 UTC (permalink / raw)
  To: Francois Romieu
  Cc: David Miller, hayeswang, netdev, nic_swsd, linux-kernel,
	linux-usb
In-Reply-To: <20161125002702.GA14085@electric-eye.fr.zoreil.com>

On 16-11-24 07:27 PM, Francois Romieu wrote:
>
> Through aliasing the URB was given a page that contains said (previously)
> received file. The ethernet chip/usb host does not write anything in it.

I don't see how that could be possible.  Please elaborate.

The URB buffers are statically allocated by the driver at probe time,
ten of them in all, allocated with usb_alloc_coherent() in the copy of
the driver I am testing with.

There is no possibility for them to be used for anything other than
USB receive buffers, for this driver only.  Nothing in the driver
or kernel ever writes to those buffers after initial allocation,
and only the driver and USB host controller ever have pointers to the buffers.
-- 
Mark Lord

^ permalink raw reply

* Re: [scr265482] ip_tunnel.c
From: Liyang Yu (于立洋1) @ 2016-11-25  3:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, netdev@vger.kernel.org

 I accept that the issue is not a CVE candidate. But it's a bug isn't it
 
 Thank you for your suggestions, about the format of mail, and not next time. 

 Everything has its meaning. If sequence number is a joke, why the guys put it into RFC , even implemented the feature.

 And if you means that SMP ( Symmetric Multi Processing )? 



On Thu, Nov 24, 2016 at 5:39 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>
>
>
>
>
>
>
> BTW:
>
>    Which RFC suggests UINT_MAX as GRE sequence number?  Can you show me?
>
>
>
>
>
>


RFC 2890

In any cases, this is absolutely not a security issue nor a CVE candidate.
Please remove security@kernel.org from CC, no need to spam security guys, they have enough on their plate.

Please send text messages, no HTML is allowed on netdev

Nobody sane uses GRE sequence numbers, precisely because GRE has no documented way to synchronize the source and destination of the tunnel.
Basically, if you use GRE sequence numbers, you must re-start other side of the tunnel if one side had to restart, or risk dropping up to 2^31 packets.

Really, this is not something that can be solved by using 'a different initial sequence number'

linux GRE sequence number support is a joke, it does not support SMP for a start.



>
>
>
> On Wed, Nov 23, 2016 at 11:45 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>
> > Yeah,I means that recreate the tunnel again, But I don’t think the
>
> > patch can fix the bug. It only can make the first packet received successed. And the follow packet will droped also.
>
> > In function __gre_xmit  line 366
>
> >   tunnel->o_seqno++;
>
> >
>
> > If you restart from UINT_MAX, the 'o_seqno' of second packet will return to 0 again.
>
>
>
> The first packet after restart: o_seqno == UINT_MAX, the other end: 
> i_seqno = 0 The second packet after restart: o_seqno == 0, the other 
> end: i_seqno = 1
>
>
>
> So traffic should be back to normal.
>
>
>
> UINT_MAX is also what RFC suggests.

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Eric Dumazet @ 2016-11-25  3:54 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Eli Cooper, netdev
In-Reply-To: <20161125140938.24538f97@canb.auug.org.au>

On Fri, 2016-11-25 at 14:09 +1100, Stephen Rothwell wrote:
> Hi Eric,
> 
> On Thu, 24 Nov 2016 19:01:28 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Since I do not have this problem at all on my hosts, it could be a buggy
> > ethernet driver.
> > 
> > Could you share what NIC card and driver you are using ?
> 
> 
> # uname -a
> Linux bilbo 4.7.0-1-amd64 #1 SMP Debian 4.7.8-1 (2016-10-19) x86_64 GNU/Linux
> 
> # lspci | grep -i net
> 03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
> 04:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
> 
> from boot dmesg:
> 
> [    7.573725] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.3.0-k
> [    7.573726] igb: Copyright (c) 2007-2014 Intel Corporation.
> [    7.752918] igb 0000:03:00.0: added PHC on eth0
> [    7.752925] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
> [    7.752927] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:1e:67:9f:d4:24
> [    7.753460] igb 0000:03:00.0: eth0: PBA No: 102100-000
> [    7.753460] igb 0000:03:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [    7.902433] igb 0000:04:00.0: added PHC on eth1
> [    7.902434] igb 0000:04:00.0: Intel(R) Gigabit Ethernet Network Connection
> [    7.902435] igb 0000:04:00.0: eth1: (PCIe:2.5Gb/s:Width x1) 00:1e:67:9f:d4:25
> [    7.902484] igb 0000:04:00.0: eth1: PBA No: 102100-000
> [    7.902485] igb 0000:04:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [   19.753325] igb 0000:03:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
> 

Could you now report : 

ethtool -k eth0

Thanks

^ permalink raw reply

* Re: [scr265482] ip_tunnel.c
From: Cong Wang @ 2016-11-25  4:12 UTC (permalink / raw)
  To: Liyang Yu (于立洋1)
  Cc: security@kernel.org, netdev@vger.kernel.org,
	cve-request@mitre.org
In-Reply-To: <F41877A268BCDE49BAEE2286FA8C269F6FECFC5E@LETV-MBX-IDC09.letv.local>

On Thu, Nov 24, 2016 at 5:39 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
> BTW:
>
>    Which RFC suggests UINT_MAX as GRE sequence number?  Can you show me?


https://tools.ietf.org/html/rfc2890

"The receiver maintains the sequence number value of the last successfully
decapsulated packet. Upon establishment of the GRE tunnel, this value
should be set to (2**32)-1."


I agree with Eric here, this is a minor issue, not a serious bug at all.
Please test my patch since you can reproduce it.

BTW, please don't top-post here.

Thanks.

^ permalink raw reply

* Re: [scr265482] ip_tunnel.c
From: Eric Dumazet @ 2016-11-25  4:15 UTC (permalink / raw)
  To: Liyang Yu (于立洋1); +Cc: Cong Wang, netdev@vger.kernel.org
In-Reply-To: <F41877A268BCDE49BAEE2286FA8C269F6FECFD83@LETV-MBX-IDC09.letv.local>

On Thu, Nov 24, 2016 at 7:52 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>  I accept that the issue is not a CVE candidate. But it's a bug isn't it


Bug is in the RFC for not describing how both ends would magically synchronize,
with a sequence protocol which is unidirectional, with no ACK packets
or whatever going back.

With no description, it means that RFC author(s) never considered one
side of the tunnel could die and restart.

Contact the author(s) to get his thoughts maybe ?

>
>  Thank you for your suggestions, about the format of mail, and not next time.
>
>  Everything has its meaning. If sequence number is a joke, why the guys put it into RFC , even implemented the feature.

You will learn that not everything put in RFC or other piece of paper
makes sense.
And the implementation is exactly implementing the RFC bug, because, why not ?

Now if you believe you can make this work, please send patches.


>
>  And if you means that SMP ( Symmetric Multi Processing )?

Yes.

If you enable sequences, GRE performance is abysmal on SMP,
 because linux uses an extra lock, instead of allowing multiple cpus
using GRE tunnel at the same time,
and GSO/TSO are disabled .



>
>
>
> On Thu, Nov 24, 2016 at 5:39 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> BTW:
>>
>>    Which RFC suggests UINT_MAX as GRE sequence number?  Can you show me?
>>
>>
>>
>>
>>
>>
>
>
> RFC 2890
>
> In any cases, this is absolutely not a security issue nor a CVE candidate.
> Please remove security@kernel.org from CC, no need to spam security guys, they have enough on their plate.
>
> Please send text messages, no HTML is allowed on netdev
>
> Nobody sane uses GRE sequence numbers, precisely because GRE has no documented way to synchronize the source and destination of the tunnel.
> Basically, if you use GRE sequence numbers, you must re-start other side of the tunnel if one side had to restart, or risk dropping up to 2^31 packets.
>
> Really, this is not something that can be solved by using 'a different initial sequence number'
>
> linux GRE sequence number support is a joke, it does not support SMP for a start.
>
>
>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 11:45 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>>
>> > Yeah,I means that recreate the tunnel again, But I don’t think the
>>
>> > patch can fix the bug. It only can make the first packet received successed. And the follow packet will droped also.
>>
>> > In function __gre_xmit  line 366
>>
>> >   tunnel->o_seqno++;
>>
>> >
>>
>> > If you restart from UINT_MAX, the 'o_seqno' of second packet will return to 0 again.
>>
>>
>>
>> The first packet after restart: o_seqno == UINT_MAX, the other end:
>> i_seqno = 0 The second packet after restart: o_seqno == 0, the other
>> end: i_seqno = 1
>>
>>
>>
>> So traffic should be back to normal.
>>
>>
>>
>> UINT_MAX is also what RFC suggests.

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Sven-Haegar Koch @ 2016-11-25  4:06 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Eli Cooper, netdev
In-Reply-To: <20161125134520.68ca0969@canb.auug.org.au>

On Fri, 25 Nov 2016, Stephen Rothwell wrote:

> On Fri, 25 Nov 2016 10:18:12 +0800 Eli Cooper <elicooper@gmx.com> wrote:
> >
> > Sounds like TSO/GSO packets are not properly segmented and therefore
> > dropped.
> > 
> > Could you first try turning off segmentation offloading for the tunnel
> > interface?
> >     ethtool -K sit0 tso off gso off
> 
> On Thu, 24 Nov 2016 18:30:14 -0800 Eric Dumazet <eric.dumazet@gmail.com>
> >
> > You also could try to disable TSO and see if this makes a difference
> > 
> > ethtool -K sixtofour0 tso off
> 
> So turning off tso brings performance up to IPv4 levels ...
> 
> Thanks for that, it solves my immediate problem.

Somehow this problem description really reminds me of a report on 
netdev a bit ago, which the following patch fixed:

commit 9ee6c5dc816aa8256257f2cd4008a9291ec7e985
Author: Lance Richardson <lrichard@redhat.com>
Date:   Wed Nov 2 16:36:17 2016 -0400

    ipv4: allow local fragmentation in ip_finish_output_gso()
    
    Some configurations (e.g. geneve interface with default
    MTU of 1500 over an ethernet interface with 1500 MTU) result
    in the transmission of packets that exceed the configured MTU.
    While this should be considered to be a "bad" configuration,
    it is still allowed and should not result in the sending
    of packets that exceed the configured MTU.

Could this be related?

I suppose it would be difficult to test this patch on this machine?

c'ya
sven-haegar

-- 
Three may keep a secret, if two of them are dead.
- Ben F.

^ permalink raw reply

* [PATCH net-next] virtio-net: enable multiqueue by default
From: Jason Wang @ 2016-11-25  4:37 UTC (permalink / raw)
  To: virtualization, netdev, linux-kernel
  Cc: Neil Horman, Jeremy Eder, Michael S . Tsirkin,
	Hannes Frederic Sowa, Marko Myllynen, Maxime Coquelin

We use single queue even if multiqueue is enabled and let admin to
enable it through ethtool later. This is used to avoid possible
regression (small packet TCP stream transmission). But looks like an
overkill since:

- single queue user can disable multiqueue when launching qemu
- brings extra troubles for the management since it needs extra admin
  tool in guest to enable multiqueue
- multiqueue performs much better than single queue in most of the
  cases

So this patch enables multiqueue by default: if #queues is less than or
equal to #vcpu, enable as much as queue pairs; if #queues is greater
than #vcpu, enable #vcpu queue pairs.

Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: Jeremy Eder <jeder@redhat.com>
Cc: Marko Myllynen <myllynen@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d4ac7a6..a21d93a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1886,8 +1886,11 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (vi->any_header_sg)
 		dev->needed_headroom = vi->hdr_len;
 
-	/* Use single tx/rx queue pair as default */
-	vi->curr_queue_pairs = 1;
+	/* Enable multiqueue by default */
+	if (num_online_cpus() >= max_queue_pairs)
+		vi->curr_queue_pairs = max_queue_pairs;
+	else
+		vi->curr_queue_pairs = num_online_cpus();
 	vi->max_queue_pairs = max_queue_pairs;
 
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
@@ -1918,6 +1921,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		goto free_unregister_netdev;
 	}
 
+	virtnet_set_affinity(vi);
+
 	/* Assume link up if device can't report link status,
 	   otherwise get link status from config. */
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next] virtio-net: enable multiqueue by default
From: Michael S. Tsirkin @ 2016-11-25  4:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Neil Horman, Jeremy Eder, Hannes Frederic Sowa, netdev,
	linux-kernel, virtualization, Marko Myllynen, Maxime Coquelin
In-Reply-To: <1480048646-17536-1-git-send-email-jasowang@redhat.com>

On Fri, Nov 25, 2016 at 12:37:26PM +0800, Jason Wang wrote:
> We use single queue even if multiqueue is enabled and let admin to
> enable it through ethtool later. This is used to avoid possible
> regression (small packet TCP stream transmission). But looks like an
> overkill since:
> 
> - single queue user can disable multiqueue when launching qemu
> - brings extra troubles for the management since it needs extra admin
>   tool in guest to enable multiqueue
> - multiqueue performs much better than single queue in most of the
>   cases
> 
> So this patch enables multiqueue by default: if #queues is less than or
> equal to #vcpu, enable as much as queue pairs; if #queues is greater
> than #vcpu, enable #vcpu queue pairs.
> 
> Cc: Hannes Frederic Sowa <hannes@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Neil Horman <nhorman@redhat.com>
> Cc: Jeremy Eder <jeder@redhat.com>
> Cc: Marko Myllynen <myllynen@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

OK at some level but all uses of num_online_cpus()
like this are racy versus hotplug.
I know we already have this bug but shouldn't we fix it
before we add more?


> ---
>  drivers/net/virtio_net.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d4ac7a6..a21d93a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1886,8 +1886,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (vi->any_header_sg)
>  		dev->needed_headroom = vi->hdr_len;
>  
> -	/* Use single tx/rx queue pair as default */
> -	vi->curr_queue_pairs = 1;
> +	/* Enable multiqueue by default */
> +	if (num_online_cpus() >= max_queue_pairs)
> +		vi->curr_queue_pairs = max_queue_pairs;
> +	else
> +		vi->curr_queue_pairs = num_online_cpus();
>  	vi->max_queue_pairs = max_queue_pairs;
>  
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> @@ -1918,6 +1921,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		goto free_unregister_netdev;
>  	}
>  
> +	virtnet_set_affinity(vi);
> +
>  	/* Assume link up if device can't report link status,
>  	   otherwise get link status from config. */
>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: enable multiqueue by default
From: Jason Wang @ 2016-11-25  5:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Neil Horman, Jeremy Eder, Hannes Frederic Sowa, netdev,
	linux-kernel, virtualization, Marko Myllynen, Maxime Coquelin
In-Reply-To: <20161125064201-mutt-send-email-mst@kernel.org>



On 2016年11月25日 12:43, Michael S. Tsirkin wrote:
> On Fri, Nov 25, 2016 at 12:37:26PM +0800, Jason Wang wrote:
>> >We use single queue even if multiqueue is enabled and let admin to
>> >enable it through ethtool later. This is used to avoid possible
>> >regression (small packet TCP stream transmission). But looks like an
>> >overkill since:
>> >
>> >- single queue user can disable multiqueue when launching qemu
>> >- brings extra troubles for the management since it needs extra admin
>> >   tool in guest to enable multiqueue
>> >- multiqueue performs much better than single queue in most of the
>> >   cases
>> >
>> >So this patch enables multiqueue by default: if #queues is less than or
>> >equal to #vcpu, enable as much as queue pairs; if #queues is greater
>> >than #vcpu, enable #vcpu queue pairs.
>> >
>> >Cc: Hannes Frederic Sowa<hannes@redhat.com>
>> >Cc: Michael S. Tsirkin<mst@redhat.com>
>> >Cc: Neil Horman<nhorman@redhat.com>
>> >Cc: Jeremy Eder<jeder@redhat.com>
>> >Cc: Marko Myllynen<myllynen@redhat.com>
>> >Cc: Maxime Coquelin<maxime.coquelin@redhat.com>
>> >Signed-off-by: Jason Wang<jasowang@redhat.com>
> OK at some level but all uses of num_online_cpus()
> like this are racy versus hotplug.
> I know we already have this bug but shouldn't we fix it
> before we add more?

Not sure I get the point, do you mean adding get/put_online_cpus()? But 
is it a real bug? We don't do any cpu specific things so I believe it's 
not necessary (unless we want to keep #queues == #vcpus magically but I 
don't think so). Admin need to re-configure #queues after cpu hotplug if 
they wish.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: 答复: [scr265482] ip_tunnel.c
From: Cong Wang @ 2016-11-25  5:41 UTC (permalink / raw)
  To: Liyang Yu (于立洋1)
  Cc: Eric Dumazet, Linux Kernel Network Developers
In-Reply-To: <F41877A268BCDE49BAEE2286FA8C269F6FECFDBC@LETV-MBX-IDC09.letv.local>

On Thu, Nov 24, 2016 at 9:11 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>> Please test my patch since you can reproduce it.
>
>         Thanks cong, but Eric had said: "Really, this is not something that can be solved by using 'a different initial sequence number'"
>         So I don't think it's nessary to test you patch, anyway thank you very much.

That is correct, I missed the signed cast, it is similar to the jiffies test,
for the overflow case, which means no matter which value we pick,
we could drop packet as it is.

^ permalink raw reply

* Re: [scr265482] ip_tunnel.c
From: Liyang Yu (于立洋1) @ 2016-11-25  5:43 UTC (permalink / raw)
  To: Cong Wang; +Cc: Eric Dumazet, Linux Kernel Network Developers

Yeah you got it. Great!!!

On Thu, Nov 24, 2016 at 9:11 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>> Please test my patch since you can reproduce it.
>
>         Thanks cong, but Eric had said: "Really, this is not something that can be solved by using 'a different initial sequence number'"
>         So I don't think it's nessary to test you patch, anyway thank you very much.

That is correct, I missed the signed cast, it is similar to the jiffies test, for the overflow case, which means no matter which value we pick, we could drop packet as it is.

^ permalink raw reply

* Re: [scr265482] ip_tunnel.c
From: Liyang Yu (于立洋1) @ 2016-11-25  5:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, netdev@vger.kernel.org

>Bug is in the RFC for not describing how both ends would magically synchronize, with a sequence protocol which is unidirectional, with no ACK packets or whatever going back.
>With no description, it means that RFC author(s) never considered one side of the tunnel could die and restart.
>Contact the author(s) to get his thoughts maybe ?

 	I agree with you, but I don't know how to contacted the author(s). If any news about this, it will be grateful let me know.  

>You will learn that not everything put in RFC or other piece of paper makes sense.
>And the implementation is exactly implementing the RFC bug, because, why not ?

>Now if you believe you can make this work, please send patches.

	Now I needs to regain a sense of the RFC. Thank you.
    I had thought about this issue, but GRE didn't encrypt the SEQ field, so it's difficult to let the other end to update the local SEQ, When the peer has restarted.
    Because the attacker can hack the packet easily.
    Maybe do not use the SEQ is a good choice.


In other words If the attacker reply the packets with sequence number bigger than the normal packets, can the tunnel work well?  haw-haw, is this condition able to meet
a CVE. 




On Thu, Nov 24, 2016 at 7:52 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>  I accept that the issue is not a CVE candidate. But it's a bug isn't 
> it


Bug is in the RFC for not describing how both ends would magically synchronize, with a sequence protocol which is unidirectional, with no ACK packets or whatever going back.

With no description, it means that RFC author(s) never considered one side of the tunnel could die and restart.

Contact the author(s) to get his thoughts maybe ?

>
>  Thank you for your suggestions, about the format of mail, and not next time.
>
>  Everything has its meaning. If sequence number is a joke, why the guys put it into RFC , even implemented the feature.

You will learn that not everything put in RFC or other piece of paper makes sense.
And the implementation is exactly implementing the RFC bug, because, why not ?

Now if you believe you can make this work, please send patches.


>
>  And if you means that SMP ( Symmetric Multi Processing )?

Yes.

If you enable sequences, GRE performance is abysmal on SMP,  because linux uses an extra lock, instead of allowing multiple cpus using GRE tunnel at the same time, and GSO/TSO are disabled .



>
>
>
> On Thu, Nov 24, 2016 at 5:39 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> BTW:
>>
>>    Which RFC suggests UINT_MAX as GRE sequence number?  Can you show me?
>>
>>
>>
>>
>>
>>
>
>
> RFC 2890
>
> In any cases, this is absolutely not a security issue nor a CVE candidate.
> Please remove security@kernel.org from CC, no need to spam security guys, they have enough on their plate.
>
> Please send text messages, no HTML is allowed on netdev
>
> Nobody sane uses GRE sequence numbers, precisely because GRE has no documented way to synchronize the source and destination of the tunnel.
> Basically, if you use GRE sequence numbers, you must re-start other side of the tunnel if one side had to restart, or risk dropping up to 2^31 packets.
>
> Really, this is not something that can be solved by using 'a different initial sequence number'
>
> linux GRE sequence number support is a joke, it does not support SMP for a start.
>
>
>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 11:45 PM, Liyang Yu (于立洋1) <yuliyang1@le.com> wrote:
>>
>> > Yeah,I means that recreate the tunnel again, But I don’t think the
>>
>> > patch can fix the bug. It only can make the first packet received successed. And the follow packet will droped also.
>>
>> > In function __gre_xmit  line 366
>>
>> >   tunnel->o_seqno++;
>>
>> >
>>
>> > If you restart from UINT_MAX, the 'o_seqno' of second packet will return to 0 again.
>>
>>
>>
>> The first packet after restart: o_seqno == UINT_MAX, the other end:
>> i_seqno = 0 The second packet after restart: o_seqno == 0, the other
>> end: i_seqno = 1
>>
>>
>>
>> So traffic should be back to normal.
>>
>>
>>
>> UINT_MAX is also what RFC suggests.

^ permalink raw reply

* [PATCH net] sit: Set skb->protocol properly in ipip6_tunnel_xmit()
From: Eli Cooper @ 2016-11-25  5:50 UTC (permalink / raw)
  To: netdev, David S . Miller, Stephen Rothwell

Similar to commit ae148b085876
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"),
sit tunnels also need to update skb->protocol; otherwise, TSO/GSO packets
might not be properly segmented, which causes the packets being dropped.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Eli Cooper <elicooper@gmx.com>
Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
---
 net/ipv6/sit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index b1cdf80..a05dceb 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -972,6 +972,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 	}
 
 	skb_set_inner_ipproto(skb, IPPROTO_IPV6);
+	skb->protocol = htons(ETH_P_IP);
 
 	iptunnel_xmit(NULL, rt, skb, fl4.saddr, fl4.daddr, protocol, tos, ttl,
 		      df, !net_eq(tunnel->net, dev_net(dev)));
-- 
2.10.2

^ permalink raw reply related

* Re: Large performance regression with 6in4 tunnel (sit)
From: Eli Cooper @ 2016-11-25  6:05 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: netdev
In-Reply-To: <20161125134520.68ca0969@canb.auug.org.au>

Hi Stephen,

On 2016/11/25 10:45, Stephen Rothwell wrote:
> Hi Eli,
>
> On Fri, 25 Nov 2016 10:18:12 +0800 Eli Cooper <elicooper@gmx.com> wrote:
>> Sounds like TSO/GSO packets are not properly segmented and therefore
>> dropped.
>>
>> Could you first try turning off segmentation offloading for the tunnel
>> interface?
>>     ethtool -K sit0 tso off gso off
> On Thu, 24 Nov 2016 18:30:14 -0800 Eric Dumazet <eric.dumazet@gmail.com>
>> You also could try to disable TSO and see if this makes a difference
>>
>> ethtool -K sixtofour0 tso off
> So turning off tso brings performance up to IPv4 levels ...
>
> Thanks for that, it solves my immediate problem.

I think this is similar to the bug I fixed in commit ae148b085876
("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").

I can reproduce a similar problem by applying xfrm to sit traffic.
TSO/GSO packets are dropped when IPSec is enabled, and IPv6 throughput
drops to 10s of Kbps. I am not sure if this is the same issue you
experienced, but I wrote a patch that fixed at least the issue I had.

Could you test the patch I sent to the mailing list just now?

Thanks,
Eli

^ permalink raw reply

* RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Hayes Wang @ 2016-11-25  6:11 UTC (permalink / raw)
  To: Mark Lord, netdev@vger.kernel.org
  Cc: nic_swsd, linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <cdb64b21-a590-8c4f-50b1-757302d7b528@pobox.com>

Mark Lord [mailto:mlord@pobox.com]
> Sent: Thursday, November 24, 2016 11:25 PM
[...]
> x86 has near fully-coherent memory, so it is the "easy" platform
> to get things working on.  But Linux supports a very diverse number
> of platforms, with varying degrees of cache/memory coherency,
> and it can be tricky for things to work correctly on all of them.

However, I have test iperf on raspberry pi v1 which you suggest
for more than one day. I still couldn't reproduce your issue.

> If you are testing with the driver as currently in 4.4.34,
> then you won't even notice when things are screwing up,
> because the driver just silently drops packets.
> Or it passes them on without noticing that they have bad data.

I only drop the packet silently when the rx descriptor outside
the urb buffer. Then, I check the rx descriptor before checking
the length of the packet.

> Here (attached) is the instrumented driver I am using here now.
> I suggest you use it or something similar when testing,
> and not the stock driver.

I would test it again with your driver.

[...]
> Also, unrelated, but inside r8152_submit_rx() there is this code:
> 
>          /* The rx would be stopped, so skip submitting */
>          if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
>              !test_bit(WORK_ENABLE, &tp->flags)
> || !netif_carrier_ok(tp->netdev))
>                 return 0;
> 
> If that "return 0" statement is ever executed, doesn't it result
> in the loss/leak of a buffer?

They would be found back by calling rtl_start_rx(), when the rx
is restarted.

Best Regards,
Hayes

^ permalink raw reply

* Re: Large performance regression with 6in4 tunnel (sit)
From: Stephen Rothwell @ 2016-11-25  6:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eli Cooper, netdev
In-Reply-To: <1480046044.8455.529.camel@edumazet-glaptop3.roam.corp.google.com>

Hi Eric,

On Thu, 24 Nov 2016 19:54:04 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> Could you now report : 
> 
> ethtool -k eth0

Features for eth0:
rx-checksumming: on
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: on
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: off
	tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
hw-tc-offload: off [fixed]


-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH net-next] net/sched: cls_flower: verify root pointer before dereferncing it
From: Cong Wang @ 2016-11-25  6:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David Miller, Jiri Pirko, Roi Dayan,
	Linux Kernel Network Developers, Jiri Pirko, Or Gerlitz,
	Cong Wang, John Fastabend, Alexei Starovoitov
In-Reply-To: <58378309.2090101@iogearbox.net>

On Thu, Nov 24, 2016 at 4:17 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>
> I'm not sure if setting a dummy object for each affected classifier is
> making things better. Are you having this in mind as a target for -net?
>
> We do kfree_rcu() the head (tp->root) and likewise do we kfree_rcu() the
> tp immediately after the callback. The head object's lifetime for such
> classifiers is thus always bound to the tp lifetime. And besides that,
> nothing apart from get() checks whether it's actually really NULL (and
> that check in get() is odd anyway; some cls do it, some don't).
>

Excellent point.

I thought we should exclude any parallel readers when we call destroy(),
you are taking a different approach by observing we only have to exclude
readers when we really free them, this seems fine to me after a second
thought, because the RCU API should take care of races with readers so
as long as we free everything in RCU callback we are good. Hmm...

But I may miss something since I am not an RCU expert.

[...]
>
> (Btw, matchall is still broken besides this fix. mall_delete() sets the
>  RCU_INIT_POINTER(head->filter, NULL), so that a mall_delete() plus
>  mall_destroy() combo doesn't free head->filter twice, but doing that is
>  racy with mall_classify() where head->filter is dereferenced unchecked.
>  Requires additional fix.)

This seems due to matchall only has one filter per tp. But you don't need
to worry since readers never read a freed pointer, right?

^ permalink raw reply

* RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Hayes Wang @ 2016-11-25  6:31 UTC (permalink / raw)
  To: Mark Lord, David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, nic_swsd,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <23e0c132-8844-0a34-3e0b-e412f76493ba-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Mark Lord [mailto:mlord-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org]
> Sent: Friday, November 25, 2016 12:44 AM
[...]
> The bad data in this case is ASCII:
> 
>          "SRC=m3400:/ TARGET=/m340"
> 
> This data is what is seen in /run/mount/utab, a file that is read/written over NFS on
> each boot.
> 
>          "SRC=m3400:/ TARGET=/m3400 ROOT=/
> ATTRS=nolock,addr=192.168.8.1\n"
> 
> But how does this ASCII data end up at offset zero of the rx buffer??
> Not possible -- this isn't even stale data, because only an rx_desc could
> be at that offset in that buffer.
> 
> So even if this were a platform memory coherency issue, one should still
> never see ASCII data at the beginning of an rx buffer.  The driver NEVER
> writes anything to the rx buffers.  Only the USB hardware ever does.
> 
> And only the r8152 dongle/driver exhibits this issue.
> Other USB dongles do not.  They *might* still have such issues,
> but because they use software checksums, the bad packets are caught/rejected.

Do you test it by rebooting? Maybe you could try a patch
commit 93fe9b183840 ("r8152: reset the bmu"). However, it should
only occur for the first urb buffer after rx is reset. I don't
think you would reset the rx frequently, so the situation seems
to be different.

Best Regards,
Hayes

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
From: Hayes Wang @ 2016-11-25  6:51 UTC (permalink / raw)
  To: Mark Lord, David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, nic_swsd,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB2010562B1-JIZ+AM9kKNzuvTFwvkocLypo8c9IxeqyAjHCUHv49ws@public.gmane.org>

> Mark Lord [mailto:mlord-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org]
> > Sent: Friday, November 25, 2016 12:44 AM
> [...]
> > The bad data in this case is ASCII:
> >
> >          "SRC=m3400:/ TARGET=/m340"
> >
> > This data is what is seen in /run/mount/utab, a file that is read/written over NFS
> on
> > each boot.
> >
> >          "SRC=m3400:/ TARGET=/m3400 ROOT=/
> > ATTRS=nolock,addr=192.168.8.1\n"
> >
> > But how does this ASCII data end up at offset zero of the rx buffer??
> > Not possible -- this isn't even stale data, because only an rx_desc could
> > be at that offset in that buffer.
> >
> > So even if this were a platform memory coherency issue, one should still
> > never see ASCII data at the beginning of an rx buffer.  The driver NEVER
> > writes anything to the rx buffers.  Only the USB hardware ever does.
> >
> > And only the r8152 dongle/driver exhibits this issue.
> > Other USB dongles do not.  They *might* still have such issues,
> > but because they use software checksums, the bad packets are caught/rejected.
> 
> Do you test it by rebooting? Maybe you could try a patch
> commit 93fe9b183840 ("r8152: reset the bmu"). However, it should
> only occur for the first urb buffer after rx is reset. I don't
> think you would reset the rx frequently, so the situation seems
> to be different.

Forgive me. I provide wrong information. This is about RTL8153,
not RTL8152.

Best Regards,
Hayes

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 3/3] flowcache: Increase threshold for refusing new allocations
From: Steffen Klassert @ 2016-11-25  6:58 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1480057080-20177-1-git-send-email-steffen.klassert@secunet.com>

From: Miroslav Urbanek <mu@miroslavurbanek.com>

The threshold for OOM protection is too small for systems with large
number of CPUs. Applications report ENOBUFs on connect() every 10
minutes.

The problem is that the variable net->xfrm.flow_cache_gc_count is a
global counter while the variable fc->high_watermark is a per-CPU
constant. Take the number of CPUs into account as well.

Fixes: 6ad3122a08e3 ("flowcache: Avoid OOM condition under preasure")
Reported-by: Lukáš Koldrt <lk@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/core/flow.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/core/flow.c b/net/core/flow.c
index 3937b1b..18e8893 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -95,7 +95,6 @@ static void flow_cache_gc_task(struct work_struct *work)
 	list_for_each_entry_safe(fce, n, &gc_list, u.gc_list) {
 		flow_entry_kill(fce, xfrm);
 		atomic_dec(&xfrm->flow_cache_gc_count);
-		WARN_ON(atomic_read(&xfrm->flow_cache_gc_count) < 0);
 	}
 }
 
@@ -236,9 +235,8 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 		if (fcp->hash_count > fc->high_watermark)
 			flow_cache_shrink(fc, fcp);
 
-		if (fcp->hash_count > 2 * fc->high_watermark ||
-		    atomic_read(&net->xfrm.flow_cache_gc_count) > fc->high_watermark) {
-			atomic_inc(&net->xfrm.flow_cache_genid);
+		if (atomic_read(&net->xfrm.flow_cache_gc_count) >
+		    2 * num_online_cpus() * fc->high_watermark) {
 			flo = ERR_PTR(-ENOBUFS);
 			goto ret_object;
 		}
-- 
1.9.1

^ permalink raw reply related

* [PATCH 2/3] xfrm: unbreak xfrm_sk_policy_lookup
From: Steffen Klassert @ 2016-11-25  6:57 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1480057080-20177-1-git-send-email-steffen.klassert@secunet.com>

From: Florian Westphal <fw@strlen.de>

if we succeed grabbing the refcount, then
  if (err && !xfrm_pol_hold_rcu)

will evaluate to false so this hits last else branch which then
sets policy to ERR_PTR(0).

Fixes: ae33786f73a7ce ("xfrm: policy: only use rcu in xfrm_sk_policy_lookup")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_policy.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index fd69866..5bf7e1bf 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1268,12 +1268,14 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir,
 			err = security_xfrm_policy_lookup(pol->security,
 						      fl->flowi_secid,
 						      policy_to_flow_dir(dir));
-			if (!err && !xfrm_pol_hold_rcu(pol))
-				goto again;
-			else if (err == -ESRCH)
+			if (!err) {
+				if (!xfrm_pol_hold_rcu(pol))
+					goto again;
+			} else if (err == -ESRCH) {
 				pol = NULL;
-			else
+			} else {
 				pol = ERR_PTR(err);
+			}
 		} else
 			pol = NULL;
 	}
-- 
1.9.1

^ permalink raw reply related

* pull request (net): ipsec 2016-11-25
From: Steffen Klassert @ 2016-11-25  6:57 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev

1) Fix a refcount leak in vti6.
   From Nicolas Dichtel.

2) Fix a wrong if statement in xfrm_sk_policy_lookup.
   From Florian Westphal.

3) The flowcache watermarks are per cpu. Take this into
   account when comparing to the threshold where we
   refusing new allocations. From Miroslav Urbanek.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit d24cd733bae8fc6c121c437b3197ab7f3930ca66:

  Merge branch 'be2net-fixes' (2016-10-09 09:30:45 -0400)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master

for you to fetch changes up to 6b226487815574193c1da864f2eac274781a2b0c:

  flowcache: Increase threshold for refusing new allocations (2016-11-23 06:37:09 +0100)

----------------------------------------------------------------
Florian Westphal (1):
      xfrm: unbreak xfrm_sk_policy_lookup

Miroslav Urbanek (1):
      flowcache: Increase threshold for refusing new allocations

Nicolas Dichtel (1):
      vti6: flush x-netns xfrm cache when vti interface is removed

 net/core/flow.c        |  6 ++----
 net/ipv6/ip6_vti.c     | 31 +++++++++++++++++++++++++++++++
 net/xfrm/xfrm_policy.c | 10 ++++++----
 3 files changed, 39 insertions(+), 8 deletions(-)

^ permalink raw reply

* [PATCH 1/3] vti6: flush x-netns xfrm cache when vti interface is removed
From: Steffen Klassert @ 2016-11-25  6:57 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <1480057080-20177-1-git-send-email-steffen.klassert@secunet.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>

This is the same fix than commit a5d0dc810abf ("vti: flush x-netns xfrm
cache when vti interface is removed")

This patch fixes a refcnt problem when a x-netns vti6 interface is removed:
unregister_netdevice: waiting for vti6_test to become free. Usage count = 1

Here is a script to reproduce the problem:

ip link set dev ntfp2 up
ip addr add dev ntfp2 2001::1/64
ip link add vti6_test type vti6 local 2001::1 remote 2001::2 key 1
ip netns add secure
ip link set vti6_test netns secure
ip netns exec secure ip link set vti6_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev vti6_test 2003::1/64
ip -6 xfrm policy add dir out tmpl src 2001::1 dst 2001::2 proto esp \
	   mode tunnel mark 1
ip -6 xfrm policy add dir in tmpl src 2001::2 dst 2001::1 proto esp \
	   mode tunnel mark 1
ip xfrm state add src 2001::1 dst 2001::2 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip xfrm state add src 2001::2 dst 2001::1 proto esp spi 1 mode tunnel \
	   enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1
ip netns exec secure  ping6 -c 4 2003::2
ip netns del secure

CC: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv6/ip6_vti.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 8a02ca8..c299c1e 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -1138,6 +1138,33 @@ static struct xfrm6_protocol vti_ipcomp6_protocol __read_mostly = {
 	.priority	=	100,
 };
 
+static bool is_vti6_tunnel(const struct net_device *dev)
+{
+	return dev->netdev_ops == &vti6_netdev_ops;
+}
+
+static int vti6_device_event(struct notifier_block *unused,
+			     unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct ip6_tnl *t = netdev_priv(dev);
+
+	if (!is_vti6_tunnel(dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_DOWN:
+		if (!net_eq(t->net, dev_net(dev)))
+			xfrm_garbage_collect(t->net);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block vti6_notifier_block __read_mostly = {
+	.notifier_call = vti6_device_event,
+};
+
 /**
  * vti6_tunnel_init - register protocol and reserve needed resources
  *
@@ -1148,6 +1175,8 @@ static int __init vti6_tunnel_init(void)
 	const char *msg;
 	int err;
 
+	register_netdevice_notifier(&vti6_notifier_block);
+
 	msg = "tunnel device";
 	err = register_pernet_device(&vti6_net_ops);
 	if (err < 0)
@@ -1180,6 +1209,7 @@ xfrm_proto_ah_failed:
 xfrm_proto_esp_failed:
 	unregister_pernet_device(&vti6_net_ops);
 pernet_dev_failed:
+	unregister_netdevice_notifier(&vti6_notifier_block);
 	pr_err("vti6 init: failed to register %s\n", msg);
 	return err;
 }
@@ -1194,6 +1224,7 @@ static void __exit vti6_tunnel_cleanup(void)
 	xfrm6_protocol_deregister(&vti_ah6_protocol, IPPROTO_AH);
 	xfrm6_protocol_deregister(&vti_esp6_protocol, IPPROTO_ESP);
 	unregister_pernet_device(&vti6_net_ops);
+	unregister_netdevice_notifier(&vti6_notifier_block);
 }
 
 module_init(vti6_tunnel_init);
-- 
1.9.1

^ permalink raw reply related

* Svare.//..
From: Santander Bank Plc @ 2016-11-25  6:40 UTC (permalink / raw)
  To: Recipients

Mr. Steve Bhatti,
Drift / regionsjef
Santander Bank Plc,
47-48 Piccadilly
PICCADILLY
W1J0DT
London, Storbritannia
 
Good Day Kjære,

Hvordan har du og familien det? Jeg håper mitt brev møter deg i ditt beste humør i dag. Jeg er Dr. Steve Bhatti, fra Harlesden North West London, leder for regnskap Revisjonen og formell senior programmerer sjef hos Deutsche bank, her i England (Santander Bank Plc). Jeg har også jobbet for Nyeste ministrene Bank Plc.
 
Jeg trenger din haster hjelp til å overføre summen av (£ 21,5 GBP) Twenty One Million, fem tusen britiske pund på kontoen din innen 11 eller 15 bankdager. Disse pengene har vært sovende i mange år i vår Bank uten et krav. Jeg ønsker banken å frigjøre penger til deg som nærmeste person til vår avdøde kunden (eieren av kontoen) som dessverre mistet livet i februar 2003 gjennom det sørlige USA romfergen Columbia, døde han en enkelt mann. Jeg oppdaget at han ikke spesifiserer noen pårørende eller vilje mottaker til hans konto etter å ha gått gjennom hans fil i vår bank, og jeg ønsker ikke penger til å gå inn i Bank treasury konto som en forlatt fondet. Så dette er grunnen til at jeg kontakter deg slik at banken kan frigjøre penger til deg som skal pårørende / Will mottaker til den avdøde kunde
 n (Late David McDowell Brown) en amerikansk statsborger av Arlington Virginia. Vær så snill, jeg vil også gjerne at du skal holde Forslaget som en topp hemmelig og slette den hvis du ikke er interessert.
 
Her er deling ratio; 50% til meg og 40% til deg mens 10% er for eventuelle utgifter i løpet av transaksjonen. Hvis du er interessert, ta kontakt med meg umiddelbart via min private e-post: steve.s.bhatti@gmail.com

Gi meg følgende under, som vi har 7 dager for å kjøre den gjennom. Dette er veldig HASTER.
 
1. Fullt navn:
2. Direkte Mobilnummer:
3. Kontakt Adresse:
4. Yrke:
5. Alder:
6. Kjønn:
7. Nasjonalitet:

Dette er for å aktivere meg laste opp informasjon i vår bank database for å reflektere i banken nettverk system du den navngitte
pårørende / vilje mottaker av denne kontoen, så vil jeg lede deg på kommunikasjon med banken for videreformidling av dette fondet til deg.
MERK: Vi har noen bankdager for å utføre denne transaksjonen avtale.
Takk i påvente av umiddelbar respons.
 
Vennlig hilsen,
Dr. Steve Bhatti.
Tel: +447042043211

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox