Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net/sched: cls_api: add missing validation of netlink attributes
From: David Ahern @ 2018-10-09 17:45 UTC (permalink / raw)
  To: Davide Caratti, David S. Miller, Jamal Hadi Salim; +Cc: netdev
In-Reply-To: <a40c29064d9e1546b190254432b82b6a19430568.camel@redhat.com>

On 10/9/18 10:12 AM, Davide Caratti wrote:
>>> --- a/net/sched/cls_api.c
>>> +++ b/net/sched/cls_api.c
>>> @@ -37,6 +37,11 @@ static LIST_HEAD(tcf_proto_base);
>>>  /* Protects list of registered TC modules. It is pure SMP lock. */
>>>  static DEFINE_RWLOCK(cls_mod_lock);
>>>  
>>> +const struct nla_policy cls_tca_policy[TCA_MAX + 1] = {
>>> +	[TCA_KIND]	= { .type = NLA_STRING },
>>> +	[TCA_CHAIN]	= { .type = NLA_U32 },
>>> +};
>>> +
>>
> 
>> it be nice to have a tc_common module so this stuff does not have to be
>> defined multiple times.
> 
> it makes sense to avoid duplicating the declaration of that array. But I
> don't think we can put it in a module, because CONFIG_NET_SCHED is 'bool'
> and
> 
> obj-$(CONFIG_NET_SCHED)         += sch_api.o
> 
> I can try a v2 where 'rtm_tca_policy' symbol in sch_api is exported and
> used by cls_api.c code. WDYT?

since NET_SCHED is a bool, that should work.

^ permalink raw reply

* [sky2 driver] 88E8056 PCI-E Gigabit Ethernet Controller not working after suspend
From: Laurent Bigonville @ 2018-10-09 17:30 UTC (permalink / raw)
  To: netdev

Hello,

On my desktop (Asus MB with dual Ethernet port), when waking up after 
suspend, the network card is not detecting the link.

I have to rmmod the sky2 driver and then modprobing it again.

lspci shows me:

04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
Gigabit Ethernet Controller (rev 12)
05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
Gigabit Ethernet Controller (rev 12)

An idea what's wrong here?

Kind regards,

Laurent Bigonville

^ permalink raw reply

* Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Eric Dumazet @ 2018-10-09 17:26 UTC (permalink / raw)
  To: zelo.zejn; +Cc: Eric Dumazet, Kevin Yang, netdev
In-Reply-To: <1988670d-0d87-2b9e-24cb-a9a610ea33fa@gmail.com>

On Tue, Oct 9, 2018 at 10:22 AM Gasper Zejn <zelo.zejn@gmail.com> wrote:
>
> On 09. 10. 2018 19:00, Eric Dumazet wrote:
> >
> > On 10/09/2018 09:38 AM, Gasper Zejn wrote:
> >> Hello,
> >>
> >> I am seeing interrupt storms of over 100k-900k local timer interrupts
> >> when changing between network devices or networks with open TCP
> >> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
> >> makes the bug with interrupt storm go away.
> >>
> > That is for what kind of traffic ?
> >
> > If your TCP flows send 100k-3M packets per second, then yes, the pacing timers
> > could be setup in the 100k-900k range.
> >
> Traffic is nowhere in that range, think of having a few browser tabs of
> javascript rich
> web pages open, mostly idle, for example slack, gmail or tweetdeck. No
> significant
> packet rate is needed, just open connections.

No idea of what is going on really. A repro would be nice.

^ permalink raw reply

* Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Gasper Zejn @ 2018-10-09 17:22 UTC (permalink / raw)
  To: Eric Dumazet, Kevin Yang, Eric Dumazet, netdev
In-Reply-To: <cb3f5cd7-dc71-6004-9e68-748616ad7871@gmail.com>

On 09. 10. 2018 19:00, Eric Dumazet wrote:
>
> On 10/09/2018 09:38 AM, Gasper Zejn wrote:
>> Hello,
>>
>> I am seeing interrupt storms of over 100k-900k local timer interrupts
>> when changing between network devices or networks with open TCP
>> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
>> makes the bug with interrupt storm go away.
>>
> That is for what kind of traffic ?
>
> If your TCP flows send 100k-3M packets per second, then yes, the pacing timers
> could be setup in the 100k-900k range.
>
Traffic is nowhere in that range, think of having a few browser tabs of
javascript rich
web pages open, mostly idle, for example slack, gmail or tweetdeck. No
significant
packet rate is needed, just open connections.

>> The interrupts all called tcp_pace_kick (according to perf), which seems
>> to return HRTIMER_NORESTART, but apparently somewhere calls another
>> function, that does restart the timer.
>>
>> The bug is fairly easy to reproduce. Congestion control needs to be BBR,
>> network scheduler was pfifo_fast, and there need to be open TCP
>> connections when changing network in such a way that TCP connections
>> cannot continue to work (eg. different client IP addresses). The more
>> connections the more interrupts. The connection handling code will cause
>> interrupt storm, which eventually sets down as the connections time out.
>> It is a bit annoying as high interrupt rate does not show as load. I
>> successfully reproduced this with 4.18.12, but this has been happening
>> for some time, with previous versions of kernel too.
>>
>>
>> I'd like to thank you for the comment regarding use of sch_fq with BBR
>> above the tcp_needs_internal_pacing function. It has pointed me in the
>> direction to find the workaround.
>>
> Well, BBR has been very clear about sch_fq being the best packet scheduler
>
> net/ipv4/tcp_bbr.c currently says :
>
> /* ...
>  *
>  * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
>  * otherwise TCP stack falls back to an internal pacing using one high
>  * resolution timer per TCP socket and may use more resources.
>  */
>
I am not disputing FQ being the best packet packet scheduler, it does
seem however
that some effort has been made to make BBR work without FQ too. Using more
resources in that case is perfectly fine. But going from ~ thousand
interrupts to few
hundred thousand interrupts (and in the process consuming most of the cpu)
seems to indicate that a corner case was somehow hit as this happens the
moment the network is changed and not before.

^ permalink raw reply

* Re: general protection fault in __handle_mm_fault
From: Eric Dumazet @ 2018-10-09 17:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Willem de Bruijn, Eric Dumazet, aneesh.kumar,
	syzbot+1577fbe983d20fe2e88f, David Miller, Alexey Kuznetsov, LKML,
	netdev, syzkaller-bugs, Hideaki YOSHIFUJI, Andrew Morton,
	kirill.shutemov
In-Reply-To: <20181009170517.GA2821@splinter>

On Tue, Oct 9, 2018 at 10:05 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Tue, Oct 09, 2018 at 12:02:58PM -0400, Willem de Bruijn wrote:
> > On Tue, Oct 9, 2018 at 11:00 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Well, this is a C repro, you can test this yourself instead of asking Willem who
> > > already did a painful bisection.
> >
> > Thanks Eric. It does take some effort to get the syzkaller environment
> > up and running [1] and I happen to have it ready, so I don't mind testing
> > a few patches.
>
> It is possible to ask syzbot to test the patch. I sent a couple of
> patches and got a response in 20-30 minutes. Very convenient.
>
> It is mentioned at the bottom of the report:
>
> "
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches
> "

One day syzbot will do the bisection as well maybe, once it has a
repro, this could be really nice.

^ permalink raw reply

* Re: general protection fault in __handle_mm_fault
From: Ido Schimmel @ 2018-10-09 17:05 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, aneesh.kumar, syzbot+1577fbe983d20fe2e88f,
	David Miller, Eric Dumazet, Alexey Kuznetsov, LKML,
	Network Development, syzkaller-bugs, Hideaki YOSHIFUJI,
	Andrew Morton, kirill.shutemov
In-Reply-To: <CAF=yD-K3BeuUWf_2RhWoC6fx=+A49Nt1TOFOfzdwKfOdsv=TiQ@mail.gmail.com>

On Tue, Oct 09, 2018 at 12:02:58PM -0400, Willem de Bruijn wrote:
> On Tue, Oct 9, 2018 at 11:00 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Well, this is a C repro, you can test this yourself instead of asking Willem who
> > already did a painful bisection.
> 
> Thanks Eric. It does take some effort to get the syzkaller environment
> up and running [1] and I happen to have it ready, so I don't mind testing
> a few patches.

It is possible to ask syzbot to test the patch. I sent a couple of
patches and got a response in 20-30 minutes. Very convenient.

It is mentioned at the bottom of the report:

"
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches
"

^ permalink raw reply

* Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Eric Dumazet @ 2018-10-09 17:00 UTC (permalink / raw)
  To: Gasper Zejn, Kevin Yang, Eric Dumazet, netdev
In-Reply-To: <183b7fe0-0757-cf63-555c-925ea840c67f@gmail.com>



On 10/09/2018 09:38 AM, Gasper Zejn wrote:
> Hello,
> 
> I am seeing interrupt storms of over 100k-900k local timer interrupts
> when changing between network devices or networks with open TCP
> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
> makes the bug with interrupt storm go away.
> 

That is for what kind of traffic ?

If your TCP flows send 100k-3M packets per second, then yes, the pacing timers
could be setup in the 100k-900k range.

> The interrupts all called tcp_pace_kick (according to perf), which seems
> to return HRTIMER_NORESTART, but apparently somewhere calls another
> function, that does restart the timer.
> 
> The bug is fairly easy to reproduce. Congestion control needs to be BBR,
> network scheduler was pfifo_fast, and there need to be open TCP
> connections when changing network in such a way that TCP connections
> cannot continue to work (eg. different client IP addresses). The more
> connections the more interrupts. The connection handling code will cause
> interrupt storm, which eventually sets down as the connections time out.
> It is a bit annoying as high interrupt rate does not show as load. I
> successfully reproduced this with 4.18.12, but this has been happening
> for some time, with previous versions of kernel too.
> 
> 
> I'd like to thank you for the comment regarding use of sch_fq with BBR
> above the tcp_needs_internal_pacing function. It has pointed me in the
> direction to find the workaround.
>

Well, BBR has been very clear about sch_fq being the best packet scheduler

net/ipv4/tcp_bbr.c currently says :

/* ...
 *
 * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
 * otherwise TCP stack falls back to an internal pacing using one high
 * resolution timer per TCP socket and may use more resources.
 */

^ permalink raw reply

* [PATCH net] net/xfrm: fix out-of-bounds packet access
From: Alexei Starovoitov @ 2018-10-09 16:59 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, edumazet, netdev, kernel-team

BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
net/ipv6/xfrm6_policy.c:161
Read of size 1 at addr ffff8801d882eec7 by task syz-executor1/6667
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
  __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
  xfrm_decode_session include/net/xfrm.h:1232 [inline]
  vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
  __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
  netdev_start_xmit include/linux/netdevice.h:4322 [inline]
  xmit_one net/core/dev.c:3217 [inline]
  dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
  __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3836

Reported-by: syzbot+acffccec848dc13fe459@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 net/ipv6/xfrm6_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ef3defaf43b9..d35bcf92969c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -146,8 +146,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int reverse)
 	fl6->daddr = reverse ? hdr->saddr : hdr->daddr;
 	fl6->saddr = reverse ? hdr->daddr : hdr->saddr;
 
-	while (nh + offset + 1 < skb->data ||
-	       pskb_may_pull(skb, nh + offset + 1 - skb->data)) {
+	while (nh + offset + sizeof(*exthdr) < skb->data ||
+	       pskb_may_pull(skb, nh + offset + sizeof(*exthdr) - skb->data)) {
 		nh = skb_network_header(skb);
 		exthdr = (struct ipv6_opt_hdr *)(nh + offset);
 
-- 
2.17.1

^ permalink raw reply related

* enquiry 09-10-2018
From: Daniel Murray @ 2018-10-09 16:09 UTC (permalink / raw)
  To: netdev

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

^ permalink raw reply

* BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Gasper Zejn @ 2018-10-09 16:38 UTC (permalink / raw)
  To: Kevin Yang, Eric Dumazet, netdev

Hello,

I am seeing interrupt storms of over 100k-900k local timer interrupts
when changing between network devices or networks with open TCP
connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
makes the bug with interrupt storm go away.

The interrupts all called tcp_pace_kick (according to perf), which seems
to return HRTIMER_NORESTART, but apparently somewhere calls another
function, that does restart the timer.

The bug is fairly easy to reproduce. Congestion control needs to be BBR,
network scheduler was pfifo_fast, and there need to be open TCP
connections when changing network in such a way that TCP connections
cannot continue to work (eg. different client IP addresses). The more
connections the more interrupts. The connection handling code will cause
interrupt storm, which eventually sets down as the connections time out.
It is a bit annoying as high interrupt rate does not show as load. I
successfully reproduced this with 4.18.12, but this has been happening
for some time, with previous versions of kernel too.

I'd like to thank you for the comment regarding use of sch_fq with BBR
above the tcp_needs_internal_pacing function. It has pointed me in the
direction to find the workaround.

Kind regards,

Gasper Zejn

^ permalink raw reply

* Re: PMTU discovery broken in Linux for UDP/raw application if the socket is not bound to a device
From: David Ahern @ 2018-10-09 16:23 UTC (permalink / raw)
  To: Preethi Ramachandra; +Cc: Reji Thomas, Yogesh Ankolekar, netdev@vger.kernel.org
In-Reply-To: <4138E416-D846-47B2-AA9F-593313041E6F@juniper.net>

[ adding netdev so others know ]

On 10/9/18 3:38 AM, Preethi Ramachandra wrote:
> Hi David,
> 
> I tested your fix, Linux is updating PMTU successfully. 

ok, I'll send a formal patch


> 
> Thanks,
> Preethi
> 
> On 10/7/18, 8:59 AM, "David Ahern" <dsahern@gmail.com> wrote:
> 
>     The correct mailing list is netdev@vger.kernel.org (added)
>     
>     non-text emails will be rejected.
>     
>     
>     On 10/3/18 10:15 PM, Preethi Ramachandra wrote:
>     > Hi,
>     > 
>     >  
>     > 
>     > While testing the PMTU discovery for UDP/raw applications, Linux is not
>     > doing PMTU discovery if the UDP server socket is not bound to a device.
>     >  In the scenario we are testing there could be multiple VRF devices
>     > created and an application like UDP/RAW can use a common socket for all
>     > vrf devices. While sending packet IP_PKTINFO socket option can be used
>     > to specify the vrf interface through which packet will be sent out. In
>     > this case, when packet too big icmp6 error message comes back to Linux
>     > on a vrf device, a route lookup is done on default routing-table(0) for
>     > src/dst address which case, the route will not be found and packet is
>     > dropped. If the route lookup happened with proper VRF device (packet’s
>     > incoming index), the route lookup succeeds,  PMTU discovery is successful.
>     > 
>     >  
>     > 
>     > This might need a fix, please take a look.
>     > 
>     >  
>     > 
>     > *Linux version *
>     > 
>     >  
>     > 
>     > Linux  4.8.24
>     > 
>     >  
>     > 
>     > *Code flow *
>     > 
>     >  
>     > 
>     > Linux code where it expects socket’s bound device in order for PMTU
>     > discovery to happen.
>     > 
>     > *void ip6_sk_update_pmtu*(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>     > 
>     > {
>     > 
>     >                     struct dst_entry *dst;
>     > 
>     >  
>     > 
>     >                     ip6_update_pmtu(skb, sock_net(sk), mtu,
>     > 
>     >                                                             
>     > sk->sk_bound_dev_if, sk->sk_mark, sk->sk_uid);*<<<<< This is the point
>     > where it expects socket’s sk_bound_dev_if to be set. In our testing this
>     > is actually 0, since the socket is not really bound to a vrf device.*
>     
>     Try this based on top of tree for 4.19-next (whitespace damaged on paste
>     so you'll need to manually apply and handle differences with 4.8):
>     
>     diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>     index 6c1d817151ca..50b95b48b911 100644
>     --- a/net/ipv6/route.c
>     +++ b/net/ipv6/route.c
>     @@ -2360,10 +2360,13 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
>     
>      void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>      {
>     +       int oif = sk->sk_bound_dev_if;
>             struct dst_entry *dst;
>     
>     -       ip6_update_pmtu(skb, sock_net(sk), mtu,
>     -                       sk->sk_bound_dev_if, sk->sk_mark, sk->sk_uid);
>     +       if (!oif && skb->dev)
>     +               oif = l3mdev_master_ifindex(skb->dev);
>     +
>     +       ip6_update_pmtu(skb, sock_net(sk), mtu, oif, sk->sk_mark,
>     sk->sk_uid);
>     
>             dst = __sk_dst_get(sk);
>             if (!dst || !dst->obsolete ||
>     
>     
>     
> 

^ permalink raw reply

* Re: [PATCH net] net/sched: cls_api: add missing validation of netlink attributes
From: Davide Caratti @ 2018-10-09 16:12 UTC (permalink / raw)
  To: David Ahern, David S. Miller, Jamal Hadi Salim; +Cc: netdev
In-Reply-To: <72e8eeea-a4e7-e80d-217d-7ccf4cd71e0d@gmail.com>

On Tue, 2018-10-09 at 08:46 -0600, David Ahern wrote:
> On 10/9/18 7:10 AM, Davide Caratti wrote:
> > Similarly to what has been done in 8b4c3cdd9dd8 ("net: sched: Add policy
> > validation for tc attributes"), add validation for TCA_CHAIN and TCA_KIND
> > netlink attributes.
> > 
> > tested with:
> >  # ./tdc.py -c filter
> > 
> > Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
> > Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> > ---
> >  net/sched/cls_api.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > index 0a75cb2e5e7b..fb1afc0e130d 100644

hi David,
thanks for looking at this.

> > --- a/net/sched/cls_api.c
> > +++ b/net/sched/cls_api.c
> > @@ -37,6 +37,11 @@ static LIST_HEAD(tcf_proto_base);
> >  /* Protects list of registered TC modules. It is pure SMP lock. */
> >  static DEFINE_RWLOCK(cls_mod_lock);
> >  
> > +const struct nla_policy cls_tca_policy[TCA_MAX + 1] = {
> > +	[TCA_KIND]	= { .type = NLA_STRING },
> > +	[TCA_CHAIN]	= { .type = NLA_U32 },
> > +};
> > +
> 

> it be nice to have a tc_common module so this stuff does not have to be
> defined multiple times.

it makes sense to avoid duplicating the declaration of that array. But I
don't think we can put it in a module, because CONFIG_NET_SCHED is 'bool'
and

obj-$(CONFIG_NET_SCHED)         += sch_api.o

I can try a v2 where 'rtm_tca_policy' symbol in sch_api is exported and
used by cls_api.c code. WDYT?

thanks,
-- 
davide

^ permalink raw reply

* Possible bug in traffic control?
From: Josh Coombs @ 2018-10-09 15:58 UTC (permalink / raw)
  To: netdev

Hello all, I'm looking for some guidance in chasing what I believe to
be a bug in kernel traffic control filters.  If I'm pinging the wrong
list let me know.

I have a homebrew MACSec bridge setup using two pairs of PCs.  I
establish a MACSec link between them, and then use TC to bridge a
second ethernet interface over the MACSec link. The second interface
is connected to a Juniper switch at each end, and I'm using LACP over
the links to bond them up for redundancy.  It turns out I need that
redundancy as after awhile one pair of bridges will stop flowing
packets in one direction.  I've since replicated this failure with a
group of VMs as well.

My test setup to replicate the failure inside ESXi:
- Two MACSec bridge VMs, A and Z
- Two IPerf VMs, A and Z
My VMs are currently built using Ubuntu Server 18.04 to be quick, no
additional packages are required outside of iperf3.  Kernel ver as
shipped currently is 4.15.0-36.  I highly advise using a CPU with AES
instruction support as MACSec eats CPU without it and will take longer
to reproduce the symptoms.

- A 'MACSec Bridge' network
- A 'A Side link' network
- A 'Z Side link' network
In ESXi I used a dedicated vSwitch, 9000 MTU (to allow full 1500 eth
packets + MACSec to pass on the bridge) and the security policy is
full open (allow promiscuous, allow forged, allow mac changes) as
we're abusing the networks as direct point to point links.  If using
physical machines, just cable up, my example script bumps the MTU as
required.

The MACSec boxes have two ethernet interfaces each.  One pair is on
the MACSec Bridge network.  The other interfaces go to the A and Z
IPerf boxes respectively via their dedicated networks.  A and Z need
their interfaces configured with IPs in a common subnet, such as
192.168.0.1/30 and 192.168.0.2/30.

My script sets up MACSec, tweaks MTUs, and touches a few sysctls to
turn the involved interfaces into silent actors.  It then uses TC to
start the actual bridging.  From there I've been firing up iperf 3
sessions in both directions between A and Z to hammer the bridge until
it fails.  When it does, I can see packets stop being bridged in one
direction on one MACSec host, but not the other.  The second host
continues to flow packets in both directions.  Nothing is logged to
dmesg when this fault occurs.  The fault seems to occur at roughly the
same packet / traffic amount each time.  On my main application it's
after approximately 2.5TB of traffic (random mix of sizes) and with my
test bed it was after 5.5TB of 1500 byte packets.

On the impacted MACSec node, watching interface packet counters via
ifconfig and actual traffic with tcpdump I can see packets coming in
MACSec and going out the host interface, the host reply coming in but
not showing up on the MACSec interface to cross the bridge.  Clearing
out the tc filter and qdisc and re-adding does not restore traffic
flow.

There is a PPA with 4.18 available for Ubuntu that I'm going to test
with next to see if that makes a difference in behavior.  In the mean
time I'd appreciate any suggestions on how to diagnose this.

My MACSec bridge setup script, update sif, dif, the keys and rxmac to
match your setup.  The rxmac is the mac addy of the remote bridge
interface.  Keys need to be flipped between systems.
-----------------------
#!/bin/bash

# Interfaces:
# sif = Ingress physical interface (Source)
# dif = Egress physical interface (Dest)
# eif = Encrypted interface
sif=eno2
dif=enp1s0f0
eif=macsec0

# MACSec Keys:
# txkey = Transmit (Local) key
# rxkey = Receive (Remote) key
# rxmac = Receive (Remote) MAC addy
txkey=00000000000000000000000000000000
rxkey=99999999999999999999999999999999
rxmac=00:11:22:33:44:55

# Use jumbo frames for macsec to allow full 1500 MTU passthrough:
echo "* MTU update"
ip link set "$sif" mtu 9000
ip link set "$dif" mtu 9000

# Bring up macsec:
echo "* Enable MACSec"
modprobe macsec
ip link add link "$dif" "$eif" type macsec
ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey"
ip macsec add "$eif" rx address "$rxmac" port 1
ip macsec add "$eif" rx address "$rxmac" port 1 sa 0 pn 1 on key 01 "$rxkey"
ip link set "$eif" type macsec encrypt on
#ip link set "$eif" type macsec replay on window 64

# Keep system from trying to respond to observed traffic:
echo "* Clamp the system so bridge ports NEVER respond to traffic"
sysctl -w net.ipv4.conf.default.arp_filter=1
sysctl -w net.ipv4.conf.all.arp_filter=1
ip link set "$sif" down promisc on arp off multicast off
sysctl -w net.ipv6.conf."$sif".autoconf=0
sysctl -w net.ipv6.conf."$sif".accept_ra=0
sysctl -w net.ipv4.conf."$sif".arp_ignore=8
sysctl -w net.ipv4.conf."$sif".rp_filter=0
ip link set "$dif" down promisc on arp off multicast off
sysctl -w net.ipv6.conf."$dif".autoconf=0
sysctl -w net.ipv6.conf."$dif".accept_ra=0
sysctl -w net.ipv4.conf."$dif".arp_ignore=8
sysctl -w net.ipv4.conf."$dif".rp_filter=0
ip link set "$eif" down promisc on arp off multicast off
sysctl -w net.ipv6.conf."$eif".autoconf=0
sysctl -w net.ipv6.conf."$eif".accept_ra=0
sysctl -w net.ipv4.conf."$eif".arp_ignore=8
sysctl -w net.ipv4.conf."$eif".rp_filter=0

# Set up traffic mirroring:
echo "* Start Port Mirror"
# sif to eif
tc qdisc add dev "$sif" ingress
tc filter add dev "$sif" parent ffff: \
          protocol all \
          u32 match u8 0 0 \
          action mirred egress mirror dev "$eif"

# eif to sif
tc qdisc add dev "$eif" ingress
tc filter add dev "$eif" parent ffff: \
          protocol all \
          u32 match u8 0 0 \
          action mirred egress mirror dev "$sif"

# Bring up the interfaces:
echo "* Light tunnel NICS"
ip link set "$sif" up
ip link set "$dif" up
ip link set "$eif" up

echo " --=[ MACSec Up ]=--"
-----------------------

Josh Coombs

^ permalink raw reply

* (unknown), 
From: Oliver Carter @ 2018-10-09 15:55 UTC (permalink / raw)
  To: netdev

Netdev     https://goo.gl/Gf1b7B  Oliver

^ permalink raw reply

* [PATCH net v2 2/2] net: ipv4: don't let PMTU updates increase route MTU
From: Sabrina Dubroca @ 2018-10-09 15:48 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, Stefano Brivio, Sabrina Dubroca
In-Reply-To: <cover.1539073548.git.sd@queasysnail.net>

When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.

To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.

Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.

Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: unchanged

 net/ipv4/route.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b678466da451..8501554e96a4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1001,21 +1001,22 @@ out:	kfree_skb(skb);
 static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 {
 	struct dst_entry *dst = &rt->dst;
+	u32 old_mtu = ipv4_mtu(dst);
 	struct fib_result res;
 	bool lock = false;
 
 	if (ip_mtu_locked(dst))
 		return;
 
-	if (ipv4_mtu(dst) < mtu)
+	if (old_mtu < mtu)
 		return;
 
 	if (mtu < ip_rt_min_pmtu) {
 		lock = true;
-		mtu = ip_rt_min_pmtu;
+		mtu = min(old_mtu, ip_rt_min_pmtu);
 	}
 
-	if (rt->rt_pmtu == mtu &&
+	if (rt->rt_pmtu == mtu && !lock &&
 	    time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
 		return;
 
-- 
2.19.0

^ permalink raw reply related

* [PATCH net v2 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes
From: Sabrina Dubroca @ 2018-10-09 15:48 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, Stefano Brivio, Sabrina Dubroca
In-Reply-To: <cover.1539073548.git.sd@queasysnail.net>

Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
cases.

If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
---
v2:
 - s/u32/mtu/ in netdev_notifier_info_ext and call_netdevice_notifiers_
   helper, suggested by David Ahern
 - don't EXPORT_SYMBOL the helper, it's only used in net/core/dev.c
 - fix typo in commit message
 - fix kerneldoc comment, spotted by kbuild bot

 include/linux/netdevice.h |  7 ++++++
 include/net/ip_fib.h      |  1 +
 net/core/dev.c            | 28 ++++++++++++++++++++--
 net/ipv4/fib_frontend.c   | 12 ++++++----
 net/ipv4/fib_semantics.c  | 50 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c7861e4b402c..d837dad24b4c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
 	struct netlink_ext_ack	*extack;
 };
 
+struct netdev_notifier_info_ext {
+	struct netdev_notifier_info info; /* must be first */
+	union {
+		u32 mtu;
+	} ext;
+};
+
 struct netdev_notifier_change_info {
 	struct netdev_notifier_info info; /* must be first */
 	unsigned int flags_changed;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 69c91d1934c1..c9b7b136939d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -394,6 +394,7 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
+void fib_sync_mtu(struct net_device *dev, u32 orig_mtu);
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
diff --git a/net/core/dev.c b/net/core/dev.c
index 82114e1111e6..93243479085f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1752,6 +1752,28 @@ int call_netdevice_notifiers(unsigned long val, struct net_device *dev)
 }
 EXPORT_SYMBOL(call_netdevice_notifiers);
 
+/**
+ *	call_netdevice_notifiers_mtu - call all network notifier blocks
+ *	@val: value passed unmodified to notifier function
+ *	@dev: net_device pointer passed unmodified to notifier function
+ *	@arg: additional u32 argument passed to the notifier function
+ *
+ *	Call all network notifier blocks.  Parameters and return value
+ *	are as for raw_notifier_call_chain().
+ */
+static int call_netdevice_notifiers_mtu(unsigned long val,
+					struct net_device *dev, u32 arg)
+{
+	struct netdev_notifier_info_ext info = {
+		.info.dev = dev,
+		.ext.mtu = arg,
+	};
+
+	BUILD_BUG_ON(offsetof(struct netdev_notifier_info_ext, info) != 0);
+
+	return call_netdevice_notifiers_info(val, &info.info);
+}
+
 #ifdef CONFIG_NET_INGRESS
 static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
 
@@ -7574,14 +7596,16 @@ int dev_set_mtu_ext(struct net_device *dev, int new_mtu,
 	err = __dev_set_mtu(dev, new_mtu);
 
 	if (!err) {
-		err = call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+		err = call_netdevice_notifiers_mtu(NETDEV_CHANGEMTU, dev,
+						   orig_mtu);
 		err = notifier_to_errno(err);
 		if (err) {
 			/* setting mtu back and notifying everyone again,
 			 * so that they have a chance to revert changes.
 			 */
 			__dev_set_mtu(dev, orig_mtu);
-			call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+			call_netdevice_notifiers_mtu(NETDEV_CHANGEMTU, dev,
+						     new_mtu);
 		}
 	}
 	return err;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2998b0e47d4b..0113993e9b2c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1243,7 +1243,8 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 static int fib_netdev_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-	struct netdev_notifier_changeupper_info *info;
+	struct netdev_notifier_changeupper_info *upper_info = ptr;
+	struct netdev_notifier_info_ext *info_ext = ptr;
 	struct in_device *in_dev;
 	struct net *net = dev_net(dev);
 	unsigned int flags;
@@ -1278,16 +1279,19 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 			fib_sync_up(dev, RTNH_F_LINKDOWN);
 		else
 			fib_sync_down_dev(dev, event, false);
-		/* fall through */
+		rt_cache_flush(net);
+		break;
 	case NETDEV_CHANGEMTU:
+		fib_sync_mtu(dev, info_ext->ext.mtu);
 		rt_cache_flush(net);
 		break;
 	case NETDEV_CHANGEUPPER:
-		info = ptr;
+		upper_info = ptr;
 		/* flush all routes if dev is linked to or unlinked from
 		 * an L3 master device (e.g., VRF)
 		 */
-		if (info->upper_dev && netif_is_l3_master(info->upper_dev))
+		if (upper_info->upper_dev &&
+		    netif_is_l3_master(upper_info->upper_dev))
 			fib_disable_ip(dev, NETDEV_DOWN, true);
 		break;
 	}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f3c89ccf14c5..446204ca7406 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1470,6 +1470,56 @@ static int call_fib_nh_notifiers(struct fib_nh *fib_nh,
 	return NOTIFY_DONE;
 }
 
+/* Update the PMTU of exceptions when:
+ * - the new MTU of the first hop becomes smaller than the PMTU
+ * - the old MTU was the same as the PMTU, and it limited discovery of
+ *   larger MTUs on the path. With that limit raised, we can now
+ *   discover larger MTUs
+ * A special case is locked exceptions, for which the PMTU is smaller
+ * than the minimal accepted PMTU:
+ * - if the new MTU is greater than the PMTU, don't make any change
+ * - otherwise, unlock and set PMTU
+ */
+static void nh_update_mtu(struct fib_nh *nh, u32 new, u32 orig)
+{
+	struct fnhe_hash_bucket *bucket;
+	int i;
+
+	bucket = rcu_dereference_protected(nh->nh_exceptions, 1);
+	if (!bucket)
+		return;
+
+	for (i = 0; i < FNHE_HASH_SIZE; i++) {
+		struct fib_nh_exception *fnhe;
+
+		for (fnhe = rcu_dereference_protected(bucket[i].chain, 1);
+		     fnhe;
+		     fnhe = rcu_dereference_protected(fnhe->fnhe_next, 1)) {
+			if (fnhe->fnhe_mtu_locked) {
+				if (new <= fnhe->fnhe_pmtu) {
+					fnhe->fnhe_pmtu = new;
+					fnhe->fnhe_mtu_locked = false;
+				}
+			} else if (new < fnhe->fnhe_pmtu ||
+				   orig == fnhe->fnhe_pmtu) {
+				fnhe->fnhe_pmtu = new;
+			}
+		}
+	}
+}
+
+void fib_sync_mtu(struct net_device *dev, u32 orig_mtu)
+{
+	unsigned int hash = fib_devindex_hashfn(dev->ifindex);
+	struct hlist_head *head = &fib_info_devhash[hash];
+	struct fib_nh *nh;
+
+	hlist_for_each_entry(nh, head, nh_hash) {
+		if (nh->nh_dev == dev)
+			nh_update_mtu(nh, dev->mtu, orig_mtu);
+	}
+}
+
 /* Event              force Flags           Description
  * NETDEV_CHANGE      0     LINKDOWN        Carrier OFF, not for scope host
  * NETDEV_DOWN        0     LINKDOWN|DEAD   Link down, not for scope host
-- 
2.19.0

^ permalink raw reply related

* [PATCH net v2 0/2] net: ipv4: fixes for PMTU when link MTU changes
From: Sabrina Dubroca @ 2018-10-09 15:48 UTC (permalink / raw)
  To: netdev; +Cc: David Ahern, Stefano Brivio, Sabrina Dubroca

The first patch adapts the changes that commit e9fa1495d738 ("ipv6:
Reflect MTU changes on PMTU of exceptions for MTU-less routes") did in
IPv6 to IPv4: lower PMTU when the first hop's MTU drops below it, and
raise PMTU when the first hop was limiting PMTU discovery and its MTU
is increased.

The second patch fixes bugs introduced in commit d52e5a7e7ca4 ("ipv4:
lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") that
only appear once the first patch is applied.

Selftests for these cases were introduced in net-next commit
e44e428f59e4 ("selftests: pmtu: add basic IPv4 and IPv6 PMTU tests")

v2: add cover letter, and fix a few small things in patch 1

Sabrina Dubroca (2):
  net: ipv4: update fnhe_pmtu when first hop's MTU changes
  net: ipv4: don't let PMTU updates increase route MTU

 include/linux/netdevice.h |  7 ++++++
 include/net/ip_fib.h      |  1 +
 net/core/dev.c            | 28 ++++++++++++++++++++--
 net/ipv4/fib_frontend.c   | 12 ++++++----
 net/ipv4/fib_semantics.c  | 50 +++++++++++++++++++++++++++++++++++++++
 net/ipv4/route.c          |  7 +++---
 6 files changed, 96 insertions(+), 9 deletions(-)

-- 
2.19.0

^ permalink raw reply

* Re: [RFC PATCH net 0/3] net/smc: move some definitions to UAPI
From: Ursula Braun @ 2018-10-09 15:43 UTC (permalink / raw)
  To: Eugene Syromiatnikov, netdev
  Cc: David S. Miller, linux-kernel, Karsten Graul, Hans Wippel,
	linux-s390
In-Reply-To: <cover.1538929504.git.esyr@redhat.com>



On 10/07/2018 06:34 PM, Eugene Syromiatnikov wrote:
> Hello.
> 
> As of now, it's a bit difficult to use SMC protocol, as significant part
> of definitions related to it are defined in private headers and are not
> part of UAPI. The following commits move some definitions to UAPI,
> making them readily available to user space.
> 
> 
> Eugene Syromiatnikov (3):
>   uapi, net/smc: move protocol constant definitions to UAPI
>   uapi, net/smc: provide fallback diagnosis codes in UAPI
>   uapi, net/smc: provide socket state constants in UAPI
> 
>  include/uapi/linux/smc.h      | 26 +++++++++++++++++++++++++-
>  include/uapi/linux/smc_diag.h | 17 +++++++++++++++++
>  net/smc/smc.h                 | 22 ++--------------------
>  net/smc/smc_clc.h             | 16 ----------------
>  4 files changed, 44 insertions(+), 37 deletions(-)
> 

Hello Eugene,

I am OK with your 3 patches.

Thanks, Ursula

^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: rename stack trace map operations
From: Mauricio Vasquez @ 2018-10-09 15:34 UTC (permalink / raw)
  To: Song Liu; +Cc: Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <CAPhsuW5jKw1cWyZBFMuOpwY1NFcR+kjauovQbY2yx4JJWW4qJA@mail.gmail.com>



On 10/08/2018 08:40 PM, Song Liu wrote:
> Actually, does it make sense to implement a list_map that supports
> both pop_head()
> and pop_tail()? Maybe gate one of the pop operations with options?

The main issue with this approach is the mapping to the system calls.
Adding a flag would complicate things a bit because lookup nor 
lookup_and_delete have a flag argument.

> I am asking because mixing stack with stack trace is still confusing
> after this patch.
>
> Thanks,
> Song
>
> On Mon, Oct 8, 2018 at 12:11 PM Mauricio Vasquez B
> <mauricio.vasquez@polito.it> wrote:
>> In the following patches queue and stack maps (FIFO and LIFO
>> datastructures) will be implemented.  In order to avoid confusion and
>> a possible name clash rename stack_map_ops to stack_trace_map_ops
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>> ---
>>   include/linux/bpf_types.h |    2 +-
>>   kernel/bpf/stackmap.c     |    2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 5432f4c9f50e..658509daacd4 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -51,7 +51,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_LPM_TRIE, trie_map_ops)
>>   #ifdef CONFIG_PERF_EVENTS
>> -BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_map_ops)
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops)
>>   #endif
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 8061a439ef18..bb41e293418d 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -600,7 +600,7 @@ static void stack_map_free(struct bpf_map *map)
>>          put_callchain_buffers();
>>   }
>>
>> -const struct bpf_map_ops stack_map_ops = {
>> +const struct bpf_map_ops stack_trace_map_ops = {
>>          .map_alloc = stack_map_alloc,
>>          .map_free = stack_map_free,
>>          .map_get_next_key = stack_map_get_next_key,
>>

^ permalink raw reply

* Re: [PATCH v2] rxrpc: use correct kvec num while send response packet in rxrpc_reject_packets
From: Sergei Shtylyov @ 2018-10-09 15:34 UTC (permalink / raw)
  To: YueHaibing, David Howells, davem; +Cc: linux-afs, netdev, kernel-janitors
In-Reply-To: <1539094525-174123-1-git-send-email-yuehaibing@huawei.com>

On 10/09/2018 05:15 PM, YueHaibing wrote:

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> net/rxrpc/output.c: In function 'rxrpc_reject_packets':
> net/rxrpc/output.c:527:11: warning:
>  variable 'ioc' set but not used [-Wunused-but-set-variable]
> 
> 'ioc' is the correct kvec num while send response packet.
> 
> Fixes: commit ece64fec164f ("rxrpc: Emit BUSY packets when supposed to rather than 
ABORTs")

   "commit" not needed here.

> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
[...]

MBR, Sergei

^ permalink raw reply

* [PATCH stable 4.9 23/29] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
From: Florian Fainelli @ 2018-10-09 22:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, gregkh, stable, edumazet, sthemmin
In-Reply-To: <20181009224924.30151-1-f.fainelli@gmail.com>

From: Eric Dumazet <edumazet@google.com>

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5)
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 724c6abdb9e6..11974e63a41d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2954,6 +2954,7 @@ static inline unsigned char *skb_push_rcsum(struct sk_buff *skb,
 	return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  *	pskb_trim_rcsum - trim received skb and update checksum
  *	@skb: buffer to trim
@@ -2967,9 +2968,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 {
 	if (likely(len >= skb->len))
 		return 0;
-	if (skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->ip_summed = CHECKSUM_NONE;
-	return __pskb_trim(skb, len);
+	return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e545e4432eb..038ec74fa131 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1578,6 +1578,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE) {
+		int delta = skb->len - len;
+
+		skb->csum = csum_sub(skb->csum,
+				     skb_checksum(skb, len, delta, 0));
+	}
+	return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  *	__pskb_pull_tail - advance tail of skb header
  *	@skb: buffer to reallocate
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 21/29] net: modify skb_rbtree_purge to return the truesize of all purged skbs.
From: Florian Fainelli @ 2018-10-09 22:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, gregkh, stable, edumazet, sthemmin, Peter Oskolkov,
	Florian Westphal
In-Reply-To: <20181009224924.30151-1-f.fainelli@gmail.com>

From: Peter Oskolkov <posk@google.com>

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c      | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c7cca35fcf6d..724c6abdb9e6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2418,7 +2418,7 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 96a553da1518..4e545e4432eb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2425,23 +2425,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  *	skb_rbtree_purge - empty a skb rbtree
  *	@root: root of the rbtree to empty
+ *	Return value: the sum of truesizes of all purged skbs.
  *
  *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  *	the list and one reference dropped. This function does not take
  *	any lock. Synchronization should be handled by the caller (e.g., TCP
  *	out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
 	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
 
 	while (p) {
 		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
 		p = rb_next(p);
 		rb_erase(&skb->rbnode, root);
+		sum += skb->truesize;
 		kfree_skb(skb);
 	}
+	return sum;
 }
 
 /**
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 19/29] ip: discard IPv4 datagrams with overlapping segments.
From: Florian Fainelli @ 2018-10-09 22:49 UTC (permalink / raw)
  To: netdev
  Cc: davem, gregkh, stable, edumazet, sthemmin, Peter Oskolkov,
	Florian Westphal
In-Reply-To: <20181009224924.30151-1-f.fainelli@gmail.com>

From: Peter Oskolkov <posk@google.com>

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c    | 75 ++++++++++-----------------------------
 net/ipv4/proc.c           |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e7a31f830690..3442a26d36d9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -55,6 +55,7 @@ enum
 	IPSTATS_MIB_ECT1PKTS,			/* InECT1Pkts */
 	IPSTATS_MIB_ECT0PKTS,			/* InECT0Pkts */
 	IPSTATS_MIB_CEPKTS,			/* InCEPkts */
+	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
 	__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 73c0adc61a65..8bfb34e9ea32 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -275,6 +275,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
 	unsigned int fragsize;
@@ -355,65 +356,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 found:
-	/* We found where to put this one.  Check for overlap with
-	 * preceding fragment, and, if needed, align things so that
-	 * any overlaps are eliminated.
+	/* RFC5722, Section 4, amended by Errata ID : 3089
+	 *                          When reassembling an IPv6 datagram, if
+	 *   one or more its constituent fragments is determined to be an
+	 *   overlapping fragment, the entire datagram (and any constituent
+	 *   fragments) MUST be silently discarded.
+	 *
+	 * We do the same here for IPv4.
 	 */
-	if (prev) {
-		int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-		if (i > 0) {
-			offset += i;
-			err = -EINVAL;
-			if (end <= offset)
-				goto err;
-			err = -ENOMEM;
-			if (!pskb_pull(skb, i))
-				goto err;
-			if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-				skb->ip_summed = CHECKSUM_NONE;
-		}
-	}
+	/* Is there an overlap with the previous fragment? */
+	if (prev &&
+	    (prev->ip_defrag_offset + prev->len) > offset)
+		goto discard_qp;
 
-	err = -ENOMEM;
-
-	while (next && next->ip_defrag_offset < end) {
-		int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-		if (i < next->len) {
-			int delta = -next->truesize;
-
-			/* Eat head of the next overlapped fragment
-			 * and leave the loop. The next ones cannot overlap.
-			 */
-			if (!pskb_pull(next, i))
-				goto err;
-			delta += next->truesize;
-			if (delta)
-				add_frag_mem_limit(qp->q.net, delta);
-			next->ip_defrag_offset += i;
-			qp->q.meat -= i;
-			if (next->ip_summed != CHECKSUM_UNNECESSARY)
-				next->ip_summed = CHECKSUM_NONE;
-			break;
-		} else {
-			struct sk_buff *free_it = next;
-
-			/* Old fragment is completely overridden with
-			 * new one drop it.
-			 */
-			next = next->next;
-
-			if (prev)
-				prev->next = next;
-			else
-				qp->q.fragments = next;
-
-			qp->q.meat -= free_it->len;
-			sub_frag_mem_limit(qp->q.net, free_it->truesize);
-			kfree_skb(free_it);
-		}
-	}
+	/* Is there an overlap with the next fragment? */
+	if (next && next->ip_defrag_offset < end)
+		goto discard_qp;
 
 	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
 	dev = skb->dev;
@@ -461,6 +420,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	skb_dst_drop(skb);
 	return -EINPROGRESS;
 
+discard_qp:
+	inet_frag_kill(&qp->q);
+	err = -EINVAL;
+	__IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
 	kfree_skb(skb);
 	return err;
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index aa1e52587bf5..ec48d8eafc7e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -134,6 +134,7 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_ITEM("InECT1Pkts", IPSTATS_MIB_ECT1PKTS),
 	SNMP_MIB_ITEM("InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("InCEPkts", IPSTATS_MIB_CEPKTS),
+	SNMP_MIB_ITEM("ReasmOverlaps", IPSTATS_MIB_REASM_OVERLAPS),
 	SNMP_MIB_SENTINEL
 };
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 16/29] inet: frags: reorganize struct netns_frags
From: Florian Fainelli @ 2018-10-09 22:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, gregkh, stable, edumazet, sthemmin
In-Reply-To: <20181009224924.30151-1-f.fainelli@gmail.com>

From: Eric Dumazet <edumazet@google.com>

Put the read-mostly fields in a separate cache line
at the beginning of struct netns_frags, to reduce
false sharing noticed in inet_frag_kill()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0)
---
 include/net/inet_frag.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index dea175f3418a..f47678d2ccc2 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -4,16 +4,17 @@
 #include <linux/rhashtable.h>
 
 struct netns_frags {
-	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
-
-	/* Keep atomic mem on separate cachelines in structs that include it */
-	atomic_long_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
 	long			high_thresh;
 	long			low_thresh;
 	int			timeout;
 	int			max_dist;
 	struct inet_frags	*f;
+
+	struct rhashtable       rhashtable ____cacheline_aligned_in_smp;
+
+	/* Keep atomic mem on separate cachelines in structs that include it */
+	atomic_long_t		mem ____cacheline_aligned_in_smp;
 };
 
 /**
-- 
2.17.1

^ permalink raw reply related

* [PATCH stable 4.9 13/29] inet: frags: do not clone skb in ip_expire()
From: Florian Fainelli @ 2018-10-09 22:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, gregkh, stable, edumazet, sthemmin
In-Reply-To: <20181009224924.30151-1-f.fainelli@gmail.com>

From: Eric Dumazet <edumazet@google.com>

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
---
 net/ipv4/ip_fragment.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 3dd19bebeb55..e235f62dab58 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -141,8 +141,8 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-	struct sk_buff *clone, *head;
 	const struct iphdr *iph;
+	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -185,16 +185,12 @@ static void ip_expire(unsigned long arg)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	clone = skb_clone(head, GFP_ATOMIC);
+	skb_get(head);
+	spin_unlock(&qp->q.lock);
+	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+	kfree_skb(head);
+	goto out_rcu_unlock;
 
-	/* Send an ICMP "Fragment Reassembly Timeout" message. */
-	if (clone) {
-		spin_unlock(&qp->q.lock);
-		icmp_send(clone, ICMP_TIME_EXCEEDED,
-			  ICMP_EXC_FRAGTIME, 0);
-		consume_skb(clone);
-		goto out_rcu_unlock;
-	}
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
-- 
2.17.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox