Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] net: hns: fix LED configuration for marvell phy
From: Pavel Machek @ 2019-07-28 13:24 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: liuyonglong, David Miller, netdev, linux-kernel, linuxarm,
	salil.mehta, yisen.zhuang, shiju.jose
In-Reply-To: <20190725042829.GB14276@lunn.ch>

On Thu 2019-07-25 06:28:29, Andrew Lunn wrote:
> On Thu, Jul 25, 2019 at 11:00:08AM +0800, liuyonglong wrote:
> > > Revert "net: hns: fix LED configuration for marvell phy"
> > > This reverts commit f4e5f775db5a4631300dccd0de5eafb50a77c131.
> > >
> > > Andrew Lunn says this should be handled another way.
> > >
> > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > 
> > Hi Andrew:
> > 
> > I see this patch have been reverted, can you tell me the better way to do this?
> > Thanks very much!
> 
> Please take a look at the work Matthias Kaehlcke is doing. It has not
> got too far yet, but when it is complete, it should define a generic
> way to configure PHY LEDs.

I don't remember PHY LED discussion from LED mailing list. Would you have a pointer?
Would it make sense to coordinate with LED subsystem?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Eric Dumazet @ 2019-07-28 13:54 UTC (permalink / raw)
  To: Josh Hunt; +Cc: netdev, David Miller
In-Reply-To: <a9ec9cfd-c381-c02e-7d67-e24373c693d6@akamai.com>

On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>
> On 7/27/19 12:05 AM, Eric Dumazet wrote:
> > On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
> >>
> >> The current implementation of TCP MTU probing can considerably
> >> underestimate the MTU on lossy connections allowing the MSS to get down to
> >> 48. We have found that in almost all of these cases on our networks these
> >> paths can handle much larger MTUs meaning the connections are being
> >> artificially limited. Even though TCP MTU probing can raise the MSS back up
> >> we have seen this not to be the case causing connections to be "stuck" with
> >> an MSS of 48 when heavy loss is present.
> >>
> >> Prior to pushing out this change we could not keep TCP MTU probing enabled
> >> b/c of the above reasons. Now with a reasonble floor set we've had it
> >> enabled for the past 6 months.
> >
> > And what reasonable value have you used ???
>
> Reasonable for some may not be reasonable for others hence the new
> sysctl :) We're currently running with a fairly high value based off of
> the v6 min MTU minus headers and options, etc. We went conservative with
> our setting initially as it seemed a reasonable first step when
> re-enabling TCP MTU probing since with no configurable floor we saw a #
> of cases where connections were using severely reduced mss b/c of loss
> and not b/c of actual path restriction. I plan to reevaluate the setting
> at some point, but since the probing method is still the same it means
> the same clients who got stuck with mss of 48 before will land at
> whatever floor we set. Looking forward we are interested in trying to
> improve TCP MTU probing so it does not penalize clients like this.
>
> A suggestion for a more reasonable floor default would be 512, which is
> the same as the min_pmtu. Given both mechanisms are trying to achieve
> the same goal it seems like they should have a similar min/floor.
>
> >
> >>
> >> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> >> administrators the ability to control the floor of MSS probing.
> >>
> >> Signed-off-by: Josh Hunt <johunt@akamai.com>
> >> ---
> >>   Documentation/networking/ip-sysctl.txt | 6 ++++++
> >>   include/net/netns/ipv4.h               | 1 +
> >>   net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
> >>   net/ipv4/tcp_ipv4.c                    | 1 +
> >>   net/ipv4/tcp_timer.c                   | 2 +-
> >>   5 files changed, 18 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> >> index df33674799b5..49e95f438ed7 100644
> >> --- a/Documentation/networking/ip-sysctl.txt
> >> +++ b/Documentation/networking/ip-sysctl.txt
> >> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
> >>          Path MTU discovery (MTU probing).  If MTU probing is enabled,
> >>          this is the initial MSS used by the connection.
> >>
> >> +tcp_mtu_probe_floor - INTEGER
> >> +       If MTU probing is enabled this caps the minimum MSS used for search_low
> >> +       for the connection.
> >> +
> >> +       Default : 48
> >> +
> >>   tcp_min_snd_mss - INTEGER
> >>          TCP SYN and SYNACK messages usually advertise an ADVMSS option,
> >>          as described in RFC 1122 and RFC 6691.
> >> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> >> index bc24a8ec1ce5..c0c0791b1912 100644
> >> --- a/include/net/netns/ipv4.h
> >> +++ b/include/net/netns/ipv4.h
> >> @@ -116,6 +116,7 @@ struct netns_ipv4 {
> >>          int sysctl_tcp_l3mdev_accept;
> >>   #endif
> >>          int sysctl_tcp_mtu_probing;
> >> +       int sysctl_tcp_mtu_probe_floor;
> >>          int sysctl_tcp_base_mss;
> >>          int sysctl_tcp_min_snd_mss;
> >>          int sysctl_tcp_probe_threshold;
> >> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> >> index 0b980e841927..59ded25acd04 100644
> >> --- a/net/ipv4/sysctl_net_ipv4.c
> >> +++ b/net/ipv4/sysctl_net_ipv4.c
> >> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
> >>                  .extra2         = &tcp_min_snd_mss_max,
> >>          },
> >>          {
> >> +               .procname       = "tcp_mtu_probe_floor",
> >> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
> >> +               .maxlen         = sizeof(int),
> >> +               .mode           = 0644,
> >> +               .proc_handler   = proc_dointvec_minmax,
> >> +               .extra1         = &tcp_min_snd_mss_min,
> >> +               .extra2         = &tcp_min_snd_mss_max,
> >> +       },
> >> +       {
> >>                  .procname       = "tcp_probe_threshold",
> >>                  .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
> >>                  .maxlen         = sizeof(int),
> >> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> >> index d57641cb3477..e0a372676329 100644
> >> --- a/net/ipv4/tcp_ipv4.c
> >> +++ b/net/ipv4/tcp_ipv4.c
> >> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
> >>          net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
> >>          net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
> >>          net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
> >> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
> >>
> >>          net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
> >>          net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
> >> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >> index c801cd37cc2a..dbd9d2d0ee63 100644
> >> --- a/net/ipv4/tcp_timer.c
> >> +++ b/net/ipv4/tcp_timer.c
> >> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
> >>          } else {
> >>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> >>                  mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
> >> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
> >> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
> >>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
> >>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> >>          }
> >
> >
> > Existing sysctl should be enough ?
>
> I don't think so. Changing tcp_min_snd_mss could impact clients that
> really want/need a small mss. When you added the new sysctl I tried to
> analyze the mss values we're seeing to understand what we could possibly
> raise it to. While not a huge amount, we see more clients than I
> expected announcing mss values in the 180-512 range. Given that I would
> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
> above.

If these clients need mss values in 180-512 ranges, how MTU probing
would work for them,
if you set a floor to 512 ?

Are we sure the intent of tcp_base_mss was not to act as a floor ?

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e
100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
inet_connection_sock *icsk, struct sock *sk)
                icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
        } else {
                mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
-               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
+               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
                mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
                mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
                icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);



>
> >
> > tcp_min_snd_mss  documentation could be slightly updated.
> >
> > And maybe its default value could be raised a bit.
> >
>
> Thanks
> Josh

^ permalink raw reply

* [PATCH net-next] rt2800usb: Add new rt2800usb device PLANEX GW-USMicroN
From: Masanari Iida @ 2019-07-28 14:07 UTC (permalink / raw)
  To: sgruszka, helmut.schaa, kvalo, davem, linux-wireless, netdev,
	linux-kernel
  Cc: Masanari Iida

This patch add a device ID for PLANEX GW-USMicroN.
Without this patch, I had to echo the device IDs in order to
recognize the device.

# lsusb |grep PLANEX
Bus 002 Device 005: ID 2019:ed14 PLANEX GW-USMicroN

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
 drivers/net/wireless/ralink/rt2x00/rt2800usb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
index fdf0504b5f1d..0dfb55c69b73 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
@@ -1086,6 +1086,7 @@ static const struct usb_device_id rt2800usb_device_table[] = {
 	{ USB_DEVICE(0x0846, 0x9013) },
 	{ USB_DEVICE(0x0846, 0x9019) },
 	/* Planex */
+	{ USB_DEVICE(0x2019, 0xed14) },
 	{ USB_DEVICE(0x2019, 0xed19) },
 	/* Ralink */
 	{ USB_DEVICE(0x148f, 0x3573) },
-- 
2.22.0.545.g9c9b961d7eb1


^ permalink raw reply related

* Re: [PATCH] gigaset: stop maintaining seperately
From: Tilman Schmidt @ 2019-07-28 14:17 UTC (permalink / raw)
  To: Paul Bolle
  Cc: David Miller, Hansjoerg Lipp, Arnd Bergmann, Karsten Keil, netdev,
	linux-kernel
In-Reply-To: <20190726220541.28783-1-pebolle@tiscali.nl>

Thanks to you, Paul, for all your contributions, and specifically for
keeping the driver maintained for four more years after I had to abandon
it for the same reason.

I had a lot of fun working on that driver and I learned a lot in the
course. Now it's time to move on without regrets.

All the best,
Tilman

Am 27.07.2019 um 00:05 schrieb Paul Bolle:
> The Dutch consumer grade ISDN network will be shut down on September 1,
> 2019. This means I'll be converted to some sort of VOIP shortly. At that
> point it would be unwise to try to maintain the gigaset driver, even for
> odd fixes as I do. So I'll stop maintaining it as a seperate driver and
> bump support to CAPI in staging. De facto this means the driver will be
> unmaintained, since no-one seems to be working on CAPI.
> 
> I've lighty tested the hardware specific modules of this driver (bas-gigaset,
> ser-gigaset, and usb-gigaset) for v5.3-rc1. The basic functionality appears to
> be working. It's unclear whether anyone still cares. I'm aware of only one
> person sort of using the driver a few years ago.
> 
> Thanks to Karsten Keil for the ISDN subsystems gigaset was using (I4L and
> CAPI). And many thanks to Hansjoerg Lipp and Tilman Schmidt for writing and
> upstreaming this driver.
> 
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
>  MAINTAINERS | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 783569e3c4b4..e99afbd13355 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6822,13 +6822,6 @@ F:	Documentation/filesystems/gfs2*.txt
>  F:	fs/gfs2/
>  F:	include/uapi/linux/gfs2_ondisk.h
>  
> -GIGASET ISDN DRIVERS
> -M:	Paul Bolle <pebolle@tiscali.nl>
> -L:	gigaset307x-common@lists.sourceforge.net
> -W:	http://gigaset307x.sourceforge.net/
> -S:	Odd Fixes
> -F:	drivers/staging/isdn/gigaset/
> -
>  GNSS SUBSYSTEM
>  M:	Johan Hovold <johan@kernel.org>
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/johan/gnss.git
> 

^ permalink raw reply

* Re: memory leak in fdb_create
From: syzbot @ 2019-07-28 14:20 UTC (permalink / raw)
  To: bridge, bsingharora, coreteam, davem, duwe, kaber, kadlec,
	linux-kernel, mingo, mpe, netdev, netfilter-devel, nikolay, pablo,
	roopa, rostedt, syzkaller-bugs
In-Reply-To: <0000000000005e6124058c0cbdbe@google.com>

syzbot has bisected this bug to:

commit 04cf31a759ef575f750a63777cee95500e410994
Author: Michael Ellerman <mpe@ellerman.id.au>
Date:   Thu Mar 24 11:04:01 2016 +0000

     ftrace: Make ftrace_location_range() global

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1538c778600000
start commit:   abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=1738c778600000
console output: https://syzkaller.appspot.com/x/log.txt?x=1338c778600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=56f1da14935c3cce
dashboard link: https://syzkaller.appspot.com/bug?extid=88533dc8b582309bf3ee
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16de5c06a00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10546026a00000

Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
Fixes: 04cf31a759ef ("ftrace: Make ftrace_location_range() global")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Matteo Croce @ 2019-07-28 14:30 UTC (permalink / raw)
  To: Antoine Tenart, Marcin Wojtas, Stefan Chulski, Maxime Chevallier
  Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhycOc8mvqeQDBcnXueUjrFQMC7hdfAOkxr5k0+xc_tnDw@mail.gmail.com>

On Sun, Jul 28, 2019 at 3:36 AM Matteo Croce <mcroce@redhat.com> wrote:
>
> On Fri, Jul 26, 2019 at 2:57 PM Antoine Tenart
> <antoine.tenart@bootlin.com> wrote:
> >
> > Hi Matteo,
> >
> > On Fri, Jul 26, 2019 at 01:15:46AM +0200, Matteo Croce wrote:
> > > The hardware can only offload checksum calculation on first port
> > > due to the Tx FIFO size limitation. Document this in a comment.
> > >
> > > Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
> > > Signed-off-by: Matteo Croce <mcroce@redhat.com>
> >
> > Looks good. Please note there's a similar code path in the probe.
> > You could also add a comment there (or move this check/comment in a
> > common place).
> >
> > Thanks!
> > Antoine
> >
>
> Hi Antoine,
>
> I was making a v2, when I looked at the mvpp2_port_probe() which does:
>
> --------------------------------%<------------------------------
> features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> NETIF_F_TSO;
>
> if (port->pool_long->id == MVPP2_BM_JUMBO && port->id != 0) {
>     dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
>     dev->hw_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
> }
>
> dev->vlan_features |= features;
> -------------------------------->%------------------------------
>
> Is it ok to remove NETIF_F_IP*_CSUM from dev->features and
> dev->hw_features but keep it in dev->vlan_features?
>
> Regards,
> --
> Matteo Croce
> per aspera ad upstream

Hi all,

probably dev->vlan_features is safe to keep the CSUM features to avoid
unnecessary calculation in some cases, but I have another question.
Does the PP2 hardware support checksumming within any offset? I
replaced 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP
header at offset 264:

ip link set $dev up
ip addr add 192.168.0.$last/24 dev $dev

for i in {1..5}; do
	ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-1)).$other
	ip link set vx$i up
	ip addr add 192.168.$i.$last/24 dev vx$i
done

00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348: 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298: 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248: 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198: 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148: 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98: 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64

It seems that the HW is capable of doing it, can someone with a
datasheet confirm this?

Regards,
-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* RE: [EXT] Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Stefan Chulski @ 2019-07-28 15:22 UTC (permalink / raw)
  To: Matteo Croce, Antoine Tenart, Marcin Wojtas, Maxime Chevallier
  Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhz+PezeLT+gyXdsnyJz2dnKpYkcb2HbqvXJoLdzNxuC6g@mail.gmail.com>

> Hi all,
> 
> probably dev->vlan_features is safe to keep the CSUM features to avoid
> unnecessary calculation in some cases, but I have another question.
> Does the PP2 hardware support checksumming within any offset? I replaced
> 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
> then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP header
> at offset 264:
> 
> ip link set $dev up
> ip addr add 192.168.0.$last/24 dev $dev
> 
> for i in {1..5}; do
> 	ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-
> 1)).$other
> 	ip link set vx$i up
> 	ip addr add 192.168.$i.$last/24 dev vx$i done
> 
> 00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348:
> 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
> 02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298:
> 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
> 12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248:
> 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
> c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198:
> 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
> 02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148:
> 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
> a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98:
> 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64
> 
> It seems that the HW is capable of doing it, can someone with a datasheet
> confirm this?

L3_offset in TX descriptor has 7 bits, so beginning of Layer3 should be less than 128 Bytes.

Stefan,
Regards.

^ permalink raw reply

* Re: ip route JSON format is unparseable for "unreachable" routes
From: Stephen Hemminger @ 2019-07-28 16:15 UTC (permalink / raw)
  To: Michael Ziegler; +Cc: netdev
In-Reply-To: <6e88311b-5edc-4c62-1581-0f5b160a5f4e@michaelziegler.name>

On Sun, 28 Jul 2019 13:09:55 +0200
Michael Ziegler <ich@michaelziegler.name> wrote:

> Hi,
> 
> I created a couple "unreachable" routes on one of my systems, like such:
> 
> > ip route add unreachable 10.0.0.0/8     metric 255
> > ip route add unreachable 192.168.0.0/16 metric 255  
> 
> Unfortunately this results in unparseable JSON output from "ip":
> 
> > # ip -j route show  | jq .
> > parse error: Objects must consist of key:value pairs at line 1, column 84  
> 
> The offending JSON objects are these:
> 
> > {"unreachable","dst":"10.0.0.0/8","metric":255,"flags":[]}
> > {"unreachable","dst":"192.168.0.0/16","metric":255,"flags":[]}  
> "unreachable" cannot appear on its own here, it needs to be some kind of
> field.
> 
> The manpage says to report here, thus I do :) I've searched the
> archives, but I wasn't able to find any existing bug reports about this.
> I'm running version
> 
> > ip utility, iproute2-ss190107  
> 
> on Debian Buster.
> 
> Regards,
> Michael.

Already fixed upstream by:

commit 073661773872709518d35d4d093f3a715281f21d
Author: Matteo Croce <mcroce@redhat.com>
Date:   Mon Mar 18 18:19:29 2019 +0100

    ip route: print route type in JSON output
    
    ip route generates an invalid JSON if the route type has to be printed,
    eg. when detailed mode is active, or the type is different that unicast:
    
        $ ip -d -j -p route show
        [ {"unicast",
                "dst": "192.168.122.0/24",
                "dev": "virbr0",
                "protocol": "kernel",
                "scope": "link",
                "prefsrc": "192.168.122.1",
                "flags": [ "linkdown" ]
            } ]
    
        $ ip -j -p route show
        [ {"unreachable",
                "dst": "192.168.23.0/24",
                "flags": [ ]
            },{"prohibit",
                "dst": "192.168.24.0/24",
                "flags": [ ]
            },{"blackhole",
                "dst": "192.168.25.0/24",
                "flags": [ ]
            } ]
    
    Fix it by printing the route type as the "type" attribute:
    
        $ ip -d -j -p route show
        [ {
                "type": "unicast",
                "dst": "default",
                "gateway": "192.168.85.1",
                "dev": "wlp3s0",
                "protocol": "dhcp",
                "scope": "global",
                "metric": 600,
                "flags": [ ]
            },{
                "type": "unreachable",
                "dst": "192.168.23.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            },{
                "type": "prohibit",
                "dst": "192.168.24.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            },{
                "type": "blackhole",
                "dst": "192.168.25.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            } ]
    
    Fixes: 663c3cb23103 ("iproute: implement JSON and color output")
    Acked-by: Phil Sutter <phil@nwl.cc>
    Reviewed-and-tested-by: Andrea Claudi <aclaudi@redhat.com>
    Signed-off-by: Matteo Croce <mcroce@redhat.com>
    Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: memory leak in fdb_create
From: Nikolay Aleksandrov @ 2019-07-28 16:51 UTC (permalink / raw)
  To: syzbot, bridge, bsingharora, coreteam, davem, duwe, kaber, kadlec,
	linux-kernel, mingo, mpe, netdev, netfilter-devel, pablo, roopa,
	rostedt, syzkaller-bugs
In-Reply-To: <0000000000008be1b2058ebe7805@google.com>

On 28/07/2019 17:20, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 04cf31a759ef575f750a63777cee95500e410994
> Author: Michael Ellerman <mpe@ellerman.id.au>
> Date:   Thu Mar 24 11:04:01 2016 +0000
> 
>     ftrace: Make ftrace_location_range() global
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1538c778600000
> start commit:   abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> git tree:       upstream
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=1738c778600000
> console output: https://syzkaller.appspot.com/x/log.txt?x=1338c778600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=56f1da14935c3cce
> dashboard link: https://syzkaller.appspot.com/bug?extid=88533dc8b582309bf3ee
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16de5c06a00000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10546026a00000
> 
> Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
> Fixes: 04cf31a759ef ("ftrace: Make ftrace_location_range() global")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection

I see the problem, it'd happen if the multicast stats memory allocation fails on bridge
init then the fdb added due to the default vlan would remain and the bridge kmem cache
would be destroyed while not empty (you can even trigger a BUG because of that).
I'll post a patch shortly after running a few tests.

Thanks,
 Nik


^ permalink raw reply

* [PATCH net v2] mvpp2: refactor the HW checksum setup
From: Matteo Croce @ 2019-07-28 17:35 UTC (permalink / raw)
  To: netdev
  Cc: Antoine Tenart, Maxime Chevallier, Marcin Wojtas, Stefan Chulski,
	LKML, David Miller

The hardware can only offload checksum calculation on first port due to
the Tx FIFO size limitation, and has a maximum L3 offset of 128 bytes.
Document this in a comment and move duplicated code in a function.

Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 35 ++++++++++++-------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 937e4b928b94..a99405135046 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -811,6 +811,26 @@ static int mvpp2_swf_bm_pool_init(struct mvpp2_port *port)
 	return 0;
 }
 
+static void mvpp2_set_hw_csum(struct mvpp2_port *port,
+			      enum mvpp2_bm_pool_log_num new_long_pool)
+{
+	const netdev_features_t csums = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+
+	/* Update L4 checksum when jumbo enable/disable on port.
+	 * Only port 0 supports hardware checksum offload due to
+	 * the Tx FIFO size limitation.
+	 * Also, don't set NETIF_F_HW_CSUM because L3_offset in TX descriptor
+	 * has 7 bits, so the maximum L3 offset is 128.
+	 */
+	if (new_long_pool == MVPP2_BM_JUMBO && port->id != 0) {
+		port->dev->features &= ~csums;
+		port->dev->hw_features &= ~csums;
+	} else {
+		port->dev->features |= csums;
+		port->dev->hw_features |= csums;
+	}
+}
+
 static int mvpp2_bm_update_mtu(struct net_device *dev, int mtu)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
@@ -843,15 +863,7 @@ static int mvpp2_bm_update_mtu(struct net_device *dev, int mtu)
 		/* Add port to new short & long pool */
 		mvpp2_swf_bm_pool_init(port);
 
-		/* Update L4 checksum when jumbo enable/disable on port */
-		if (new_long_pool == MVPP2_BM_JUMBO && port->id != 0) {
-			dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-			dev->hw_features &= ~(NETIF_F_IP_CSUM |
-					      NETIF_F_IPV6_CSUM);
-		} else {
-			dev->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-			dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-		}
+		mvpp2_set_hw_csum(port, new_long_pool);
 	}
 
 	dev->mtu = mtu;
@@ -5209,10 +5221,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		dev->features |= NETIF_F_NTUPLE;
 	}
 
-	if (port->pool_long->id == MVPP2_BM_JUMBO && port->id != 0) {
-		dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-		dev->hw_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-	}
+	mvpp2_set_hw_csum(port, port->pool_long->id);
 
 	dev->vlan_features |= features;
 	dev->gso_max_segs = MVPP2_MAX_TSO_SEGS;
-- 
2.21.0


^ permalink raw reply related

* [PATCH net] net: bridge: delete local fdbs on device init failure
From: Nikolay Aleksandrov @ 2019-07-28 18:22 UTC (permalink / raw)
  To: netdev
  Cc: davem, roopa, bridge, Nikolay Aleksandrov,
	syzbot+88533dc8b582309bf3ee

On initialization failure we have to delete all local fdbs which were
inserted due to the default pvid. This problem has been present since the
inception of default_pvid. Note that currently there are 2 cases:
1) in br_dev_init() when br_multicast_init() fails
2) if register_netdevice() fails after calling ndo_init()

This patch takes care of both since br_vlan_flush() is called on both
occasions. Also the new fdb delete would be a no-op on normal bridge device
destruction since the local fdbs would've been already flushed by
br_dev_delete(). This is not an issue for ports since nbp_vlan_init() is
called last when adding a port thus nothing can fail after it.

Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
Tested with the provided reproducer and can no longer trigger the leak.
Also tested the br_multicast_init() failure manually by making it always
return an error.

 net/bridge/br_vlan.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 021cc9f66804..3e6a702e4c21 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -715,6 +715,11 @@ void br_vlan_flush(struct net_bridge *br)
 
 	ASSERT_RTNL();
 
+	/* delete auto-added default pvid local fdbs before flushing vlans
+	 * otherwise these will be leaked on bridge device init failure
+	 */
+	br_fdb_delete_by_port(br, NULL, 0, 1);
+
 	vg = br_vlan_group(br);
 	__vlan_flush(vg);
 	RCU_INIT_POINTER(br->vlgrp, NULL);
-- 
2.21.0


^ permalink raw reply related

* [PATCH net] hv_sock: Fix hang when a connection is closed
From: Dexuan Cui @ 2019-07-28 18:32 UTC (permalink / raw)
  To: Sunil Muthuswamy, David Miller, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com


hvs_do_close_lock_held() may decrease the reference count to 0 and free the
sk struct completely, and then the following release_sock(sk) may hang.

Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org

---
With the proper kernel debugging options enabled, first a warning can
appear:

kworker/1:0/4467 is freeing memory ..., with a lock still held there!
stack backtrace:
Workqueue: events vmbus_onmessage_work [hv_vmbus]
Call Trace:
 dump_stack+0x67/0x90
 debug_check_no_locks_freed.cold.52+0x78/0x7d
 slab_free_freelist_hook+0x85/0x140
 kmem_cache_free+0xa5/0x380
 __sk_destruct+0x150/0x260
 hvs_close_connection+0x24/0x30 [hv_sock]
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

and then the following release_sock(sk) can hang:

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
...
irq event stamp: 62890
CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: G        W         5.2.0+ #39
Workqueue: events vmbus_onmessage_work [hv_vmbus]
RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
...
Call Trace:
 do_raw_spin_lock+0xab/0xb0
 release_sock+0x19/0xb0
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

 net/vmw_vsock/hyperv_transport.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index f2084e3f7aa4..efbda8ef1eff 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -309,9 +309,16 @@ static void hvs_close_connection(struct vmbus_channel *chan)
 {
 	struct sock *sk = get_per_channel_state(chan);
 
+	/* Grab an extra reference since hvs_do_close_lock_held() may decrease
+	 * the reference count to 0 by calling sock_put(sk).
+	 */
+	sock_hold(sk);
+
 	lock_sock(sk);
 	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
+
+	sock_put(sk);
 }
 
 static void hvs_open_connection(struct vmbus_channel *chan)
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-28 19:15 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Horatiu Vultur, Nikolay Aleksandrov, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190727030223.GA29731@lunn.ch>

The 07/27/2019 05:02, Andrew Lunn wrote:
> > As you properly guessed, this model is quite different from what we are used to.
> 
> Yes, it takes a while to get the idea that the hardware is just an
> accelerator for what the Linux stack can already do. And if the switch
> cannot do some feature, pass the frame to Linux so it can handle it.
This is understood, and not that different from what we are used to.

The surprise was to make all multicast traffic to go to the CPU.

> You need to keep in mind that there could be other ports in the bridge
> than switch ports, and those ports might be interested in the
> multicast traffic. Hence the CPU needs to see the traffic.
This is a good argument, but I was under the impression that not all HW/drivers
supports foreign interfaces (see ocelot_netdevice_dev_check and
mlxsw_sp_port_dev_check).

> But IGMP snooping can be used to optimise this.
Yes, IGMP snooping can limit the multicast storm of multicast IP traffic, but
not for L2 non-IP multicast traffic.

We could really use something similar for non-IP multicast MAC addresses.

Trying to get back to the original problem:

We have a network which implements the ODVA/DLR ring protocol. This protocol
sends out a beacon frame as often as every 3 us (as far as I recall, default I
believe is 400 us) to this MAC address: 01:21:6C:00:00:01.

Try take a quick look at slide 10 in [1].

If we assume that the SwitchDev driver implemented such that all multicast
traffic goes to the CPU, then we should really have a way to install a HW
offload path in the silicon, such that these packets does not go to the CPU (as
they are known not to be use full, and a frame every 3 us is a significant load
on small DMA connections and CPU resources).

If we assume that the SwitchDev driver implemented such that only "needed"
multicast packets goes to the CPU, then we need a way to get these packets in
case we want to implement the DLR protocol.

I'm sure that both models can work, and I do not think that this is the main
issue here.

Our initial attempt was to allow install static L2-MAC entries and append
multiple ports to such an entry in the MAC table. This was rejected, for several
good reasons it seems. But I'm not sure it was clear what we wanted to achieve,
and why we find it to be important. Hopefully this is clear with a real world
use-case.

Any hints or ideas on what would be a better way to solve this problems will be
much appreciated.

/Allan

[1] https://www.odva.org/Portals/0/Library/Conference/2017-ODVA-Conference_Woods_High%20Availability_Guidelines%20for%20Use%20of%20DLR%20in%20EtherNetIP%20Networks_FINAL%20PPT.pdf

^ permalink raw reply

* Linux Plumbers BPF micro-conference CFP (reminder)
From: Alexei Starovoitov @ 2019-07-28 19:24 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, Network Development
In-Reply-To: <CAADnVQJ0ATngyqo8xjXdDsyFuuov3KRtbHMR1LcV8VnEDUK8Fg@mail.gmail.com>

Hey Folks,

August 2nd deadline to submit a proposal for BPF uconf
is quickly approaching.
If you're attending LPC in Lisbon and interested
in awesome BPF uconf you need to submit a proposal.

Some of you already submitted them to lpc-bpf@vger
per instructions that were sent back on July 12.
Some proposals were sent via website.
We'd like all proposals to be seen in the website.
Could you please re-enter your proposal there?
Please go to:
https://www.linuxplumbersconf.org/event/4/abstracts/
click on 'submit new proposal'
and copy-paste what you've already sent to lpc-bpf@vger.
Much appreciate it and sorry for confusion.

There is still room for few new proposals,
but space is getting very limited.
Please don't delay.

Thanks!

> ---------- Forwarded message ---------
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, Jul 12, 2019 at 7:26 AM
> Subject: Linux Plumbers BPF micro-conference CFP (reminder)
> To: <bpf@vger.kernel.org>
> Cc: <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
> <xdp-newbies@vger.kernel.org>, <iovisor-dev@lists.iovisor.org>,
> <lpc-bpf@vger.kernel.org>, <alexei.starovoitov@gmail.com>
>
>
> This is a call for proposals for the BPF micro-conference at this
> years' Linux Plumbers Conference (LPC) 2019 which will be held in
> Lisbon, Portugal for September 9-11.
>
> The goal of the BPF micro-conference is to bring BPF developers
> together to discuss topics around Linux kernel work related to
> the BPF core infrastructure as well as its many subsystems under
> tracing, networking, security, and BPF user space tooling (LLVM,
> libbpf, bpftool and many others).
>
> The format of the micro-conference has a main focus on discussion,
> therefore each accepted topic will provide a short 1-2 slide
> introduction with subsequent discussion for the rest of the given
> time slot.
>
> The BPF micro-conference is a community-driven event and open to
> all LPC attendees, there is no additional registration required.
>
> Please submit your discussion proposals to the LPC BPF micro-conference
> organizers at:
>
>         lpc-bpf@vger.kernel.org
>
> Proposals must be submitted until August 2nd, and submitters will
> be notified of acceptance at latest by August 9. (Please note that
> proposals must not be sent as html mail as they are otherwise dropped
> by vger.)
>
> The format of the submission and many other details can be found at:
>
>         http://vger.kernel.org/lpc-bpf.html
>
> Looking forward to seeing you all in Lisbon in September!

^ permalink raw reply

* Re: [PATCH net-next v4 2/3] flow_offload: Support get default block from tc immediately
From: Jakub Kicinski @ 2019-07-28 20:16 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <1564296769-32294-3-git-send-email-wenxu@ucloud.cn>

On Sun, 28 Jul 2019 14:52:48 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> When thre indr device register, it can get the default block
> from tc immediately if the block is exist.
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>
> ---
> v3: no change
> v4: get tc default block without callback

Please stop reposting new versions of the patches while discussion is
ongoing, it makes it harder to follow.

The TC default block is there because the indirect registration may
happen _after_ the block is installed and populated.  It's the device
driver that usually does the indirect registration, the tunnel device
and its rules may already be set when device driver is loaded or
reloaded.

I don't know the nft code, but it seems unlikely it wouldn't have the
same problem/need..

Please explain.

^ permalink raw reply

* Re: Slowness forming TIPC cluster with explicit node addresses
From: Chris Packham @ 2019-07-28 21:04 UTC (permalink / raw)
  To: jon.maloy@ericsson.com, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CH2PR15MB35754D65AB240A74AE488E719AC00@CH2PR15MB3575.namprd15.prod.outlook.com>

On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> 
> > 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> > On
> > Behalf Of Chris Packham
> > Sent: 25-Jul-19 19:37
> > To: tipc-discussion@lists.sourceforge.net
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Slowness forming TIPC cluster with explicit node addresses
> > 
> > Hi,
> > 
> > I'm having problems forming a TIPC cluster between 2 nodes.
> > 
> > This is the basic steps I'm going through on each node.
> > 
> > modprobe tipc
> > ip link set eth2 up
> > tipc node set addr 1.1.5 # or 1.1.6
> > tipc bearer enable media eth dev eth0
> eth2, I assume...
> 

Yes sorry I keep switching between between Ethernet ports for testing
so I hand edited the email.

> > 
> > 
> > Then to confirm if the cluster is formed I use tipc link list
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > ...
> > 
> > Looking at tcpdump the two nodes are sending packets
> > 
> > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60 bytes,
> > MessageSize
> > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > request
> > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60 bytes,
> > MessageSize
> > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > request
> > 
> > Eventually (after a few minutes) the link does come up
> > 
> > [root@node-6 ~]# tipc link list
> > broadcast-link: up
> > 1001006:eth2-1001005:eth2: up
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > 1001005:eth2-1001006:eth2: up
> > 
> > When I remove the "tipc node set addr" things seem to kick into
> > life straight
> > away
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > 
> > So there appears to be some difference in behaviour between having
> > an
> > explicit node address and using the default. Unfortunately our
> > application
> > relies on setting the node addresses.
> I do this many times a day, without any problems. If there would be
> any time difference, I would expect the 'auto configurable' version
> to be slower, because it involves a DAD step.
> Are you sure you don't have any other nodes running in your system?
> 
> ///jon
> 

Nope the two nodes are connected back to back. Does the number of
Ethernet interfaces make a difference? As you can see I've got 3 on
each node. One is completely disconnected, one is for booting over TFTP
 (only used by U-boot) and the other is the USB Ethernet I'm using for
testing.

> 
> > 
> > 
> > [root@node-5 ~]# uname -a
> > Linux linuxbox 5.2.0-at1+ #8 SMP Thu Jul 25 23:22:41 UTC 2019 ppc
> > GNU/Linux
> > 
> > Any thoughts on the problem?

^ permalink raw reply

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-28 21:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <CANn89iLqeixzZkop8tqOQka_9ZiKurZL9Vj05bgU99M5Pbenqw@mail.gmail.com>

On 7/28/19 6:54 AM, Eric Dumazet wrote:
> On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>>
>> On 7/27/19 12:05 AM, Eric Dumazet wrote:
>>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>
>>>> The current implementation of TCP MTU probing can considerably
>>>> underestimate the MTU on lossy connections allowing the MSS to get down to
>>>> 48. We have found that in almost all of these cases on our networks these
>>>> paths can handle much larger MTUs meaning the connections are being
>>>> artificially limited. Even though TCP MTU probing can raise the MSS back up
>>>> we have seen this not to be the case causing connections to be "stuck" with
>>>> an MSS of 48 when heavy loss is present.
>>>>
>>>> Prior to pushing out this change we could not keep TCP MTU probing enabled
>>>> b/c of the above reasons. Now with a reasonble floor set we've had it
>>>> enabled for the past 6 months.
>>>
>>> And what reasonable value have you used ???
>>
>> Reasonable for some may not be reasonable for others hence the new
>> sysctl :) We're currently running with a fairly high value based off of
>> the v6 min MTU minus headers and options, etc. We went conservative with
>> our setting initially as it seemed a reasonable first step when
>> re-enabling TCP MTU probing since with no configurable floor we saw a #
>> of cases where connections were using severely reduced mss b/c of loss
>> and not b/c of actual path restriction. I plan to reevaluate the setting
>> at some point, but since the probing method is still the same it means
>> the same clients who got stuck with mss of 48 before will land at
>> whatever floor we set. Looking forward we are interested in trying to
>> improve TCP MTU probing so it does not penalize clients like this.
>>
>> A suggestion for a more reasonable floor default would be 512, which is
>> the same as the min_pmtu. Given both mechanisms are trying to achieve
>> the same goal it seems like they should have a similar min/floor.
>>
>>>
>>>>
>>>> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>>>> administrators the ability to control the floor of MSS probing.
>>>>
>>>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>>>> ---
>>>>    Documentation/networking/ip-sysctl.txt | 6 ++++++
>>>>    include/net/netns/ipv4.h               | 1 +
>>>>    net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
>>>>    net/ipv4/tcp_ipv4.c                    | 1 +
>>>>    net/ipv4/tcp_timer.c                   | 2 +-
>>>>    5 files changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>>>> index df33674799b5..49e95f438ed7 100644
>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>>>>           Path MTU discovery (MTU probing).  If MTU probing is enabled,
>>>>           this is the initial MSS used by the connection.
>>>>
>>>> +tcp_mtu_probe_floor - INTEGER
>>>> +       If MTU probing is enabled this caps the minimum MSS used for search_low
>>>> +       for the connection.
>>>> +
>>>> +       Default : 48
>>>> +
>>>>    tcp_min_snd_mss - INTEGER
>>>>           TCP SYN and SYNACK messages usually advertise an ADVMSS option,
>>>>           as described in RFC 1122 and RFC 6691.
>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>> index bc24a8ec1ce5..c0c0791b1912 100644
>>>> --- a/include/net/netns/ipv4.h
>>>> +++ b/include/net/netns/ipv4.h
>>>> @@ -116,6 +116,7 @@ struct netns_ipv4 {
>>>>           int sysctl_tcp_l3mdev_accept;
>>>>    #endif
>>>>           int sysctl_tcp_mtu_probing;
>>>> +       int sysctl_tcp_mtu_probe_floor;
>>>>           int sysctl_tcp_base_mss;
>>>>           int sysctl_tcp_min_snd_mss;
>>>>           int sysctl_tcp_probe_threshold;
>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>> index 0b980e841927..59ded25acd04 100644
>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>>>>                   .extra2         = &tcp_min_snd_mss_max,
>>>>           },
>>>>           {
>>>> +               .procname       = "tcp_mtu_probe_floor",
>>>> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>>>> +               .maxlen         = sizeof(int),
>>>> +               .mode           = 0644,
>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>> +               .extra1         = &tcp_min_snd_mss_min,
>>>> +               .extra2         = &tcp_min_snd_mss_max,
>>>> +       },
>>>> +       {
>>>>                   .procname       = "tcp_probe_threshold",
>>>>                   .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
>>>>                   .maxlen         = sizeof(int),
>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>> index d57641cb3477..e0a372676329 100644
>>>> --- a/net/ipv4/tcp_ipv4.c
>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
>>>>           net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>>>>           net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>>>           net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>>>> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
>>>>
>>>>           net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>>>>           net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>> index c801cd37cc2a..dbd9d2d0ee63 100644
>>>> --- a/net/ipv4/tcp_timer.c
>>>> +++ b/net/ipv4/tcp_timer.c
>>>> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>>>>           } else {
>>>>                   mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>>>>                   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>>> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
>>>>                   mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>>>                   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>>>           }
>>>
>>>
>>> Existing sysctl should be enough ?
>>
>> I don't think so. Changing tcp_min_snd_mss could impact clients that
>> really want/need a small mss. When you added the new sysctl I tried to
>> analyze the mss values we're seeing to understand what we could possibly
>> raise it to. While not a huge amount, we see more clients than I
>> expected announcing mss values in the 180-512 range. Given that I would
>> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
>> above.
> 
> If these clients need mss values in 180-512 ranges, how MTU probing
> would work for them,
> if you set a floor to 512 ?

First, we already seem to be fine with ignoring these paths with ICMP 
based PMTU discovery b/c of our min_pmtu default of 512 and that is 
configurable. Second by adding this sysctl we're giving administrators 
the choice to decide if they'd like to attempt to support these very 
very small # of paths which may be below 512 (MSS <= 512 does not mean 
MTU <= 512) or cover themselves by being able to raise the floor to not 
penalize clients who may be on very lossy networks.

> 
> Are we sure the intent of tcp_base_mss was not to act as a floor ?

My understanding is that tcp_base_mss is meant to be the initial value 
of search_low (as per Docs). Then in RFC 4821 [1] Sections 7.2, shows 
search_low should be configurable, and 7.7 we see that in response to 
successive black hole detection search_low should be halved. So I don't 
think it was meant to be a floor, but just the initial search_low param. 
Also note that in that same section they suggest a floor of 68 for v4, 
but a floor of 1280 for v6 which we do not adhere to currently.

We actually set tcp_base_mss to something close to the value suggested 
towards the end of section 7.2 of the RFC of 1400 bytes minus IP and 
Transport overheads and options. This way we have more realistic 
searching based on the majority of clients that we see. The kernel winds 
up using initial search_low/tcp_base_mss as initial eff_pmtu, so we see 
something like:

21:03:41.314612 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:41.670307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:42.030308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:42.534307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:43.198308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:44.478307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:47.742310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
1:1349, ack 1, win 229, length 1348: HTTP
21:03:56.702310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
1:675, ack 1, win 229, length 674: HTTP

For further evidence this is a real problem here's a sample of mss 
values I found when originally investigating this problem for us:

I dug up some #s I found when originally investigating this problem:

# ss -emoitn | grep mss | sed "s/.*mss:\([0-9]*\).*/\1/" | sort -u | 
sort -g | head -5

36:11
64:7
72:1
128:13
144:4

 From what I could tell these connections were on paths much larger than 
the mss they were being forced to use. I determined this by looking at 
the mss used for other objects fetched from the same IPs.

Josh

[1] - https://www.ietf.org/rfc/rfc4821.txt

> 
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e
> 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
> inet_connection_sock *icsk, struct sock *sk)
>                  icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
>          } else {
>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> -               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
> +               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
>                  mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> 
> 
> 
>>
>>>
>>> tcp_min_snd_mss  documentation could be slightly updated.
>>>
>>> And maybe its default value could be raised a bit.
>>>
>>
>> Thanks
>> Josh

^ permalink raw reply

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-28 21:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <5a054ca5-4077-5e91-69d5-f1add8dc8bfa@akamai.com>

On 7/28/19 2:14 PM, Josh Hunt wrote:
> On 7/28/19 6:54 AM, Eric Dumazet wrote:
>> On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>>>
>>> On 7/27/19 12:05 AM, Eric Dumazet wrote:
>>>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>>
>>>>> The current implementation of TCP MTU probing can considerably
>>>>> underestimate the MTU on lossy connections allowing the MSS to get 
>>>>> down to
>>>>> 48. We have found that in almost all of these cases on our networks 
>>>>> these
>>>>> paths can handle much larger MTUs meaning the connections are being
>>>>> artificially limited. Even though TCP MTU probing can raise the MSS 
>>>>> back up
>>>>> we have seen this not to be the case causing connections to be 
>>>>> "stuck" with
>>>>> an MSS of 48 when heavy loss is present.
>>>>>
>>>>> Prior to pushing out this change we could not keep TCP MTU probing 
>>>>> enabled
>>>>> b/c of the above reasons. Now with a reasonble floor set we've had it
>>>>> enabled for the past 6 months.
>>>>
>>>> And what reasonable value have you used ???
>>>
>>> Reasonable for some may not be reasonable for others hence the new
>>> sysctl :) We're currently running with a fairly high value based off of
>>> the v6 min MTU minus headers and options, etc. We went conservative with
>>> our setting initially as it seemed a reasonable first step when
>>> re-enabling TCP MTU probing since with no configurable floor we saw a #
>>> of cases where connections were using severely reduced mss b/c of loss
>>> and not b/c of actual path restriction. I plan to reevaluate the setting
>>> at some point, but since the probing method is still the same it means
>>> the same clients who got stuck with mss of 48 before will land at
>>> whatever floor we set. Looking forward we are interested in trying to
>>> improve TCP MTU probing so it does not penalize clients like this.
>>>
>>> A suggestion for a more reasonable floor default would be 512, which is
>>> the same as the min_pmtu. Given both mechanisms are trying to achieve
>>> the same goal it seems like they should have a similar min/floor.
>>>
>>>>
>>>>>
>>>>> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>>>>> administrators the ability to control the floor of MSS probing.
>>>>>
>>>>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>>>>> ---
>>>>>    Documentation/networking/ip-sysctl.txt | 6 ++++++
>>>>>    include/net/netns/ipv4.h               | 1 +
>>>>>    net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
>>>>>    net/ipv4/tcp_ipv4.c                    | 1 +
>>>>>    net/ipv4/tcp_timer.c                   | 2 +-
>>>>>    5 files changed, 18 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/Documentation/networking/ip-sysctl.txt 
>>>>> b/Documentation/networking/ip-sysctl.txt
>>>>> index df33674799b5..49e95f438ed7 100644
>>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>>> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>>>>>           Path MTU discovery (MTU probing).  If MTU probing is 
>>>>> enabled,
>>>>>           this is the initial MSS used by the connection.
>>>>>
>>>>> +tcp_mtu_probe_floor - INTEGER
>>>>> +       If MTU probing is enabled this caps the minimum MSS used 
>>>>> for search_low
>>>>> +       for the connection.
>>>>> +
>>>>> +       Default : 48
>>>>> +
>>>>>    tcp_min_snd_mss - INTEGER
>>>>>           TCP SYN and SYNACK messages usually advertise an ADVMSS 
>>>>> option,
>>>>>           as described in RFC 1122 and RFC 6691.
>>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>>> index bc24a8ec1ce5..c0c0791b1912 100644
>>>>> --- a/include/net/netns/ipv4.h
>>>>> +++ b/include/net/netns/ipv4.h
>>>>> @@ -116,6 +116,7 @@ struct netns_ipv4 {
>>>>>           int sysctl_tcp_l3mdev_accept;
>>>>>    #endif
>>>>>           int sysctl_tcp_mtu_probing;
>>>>> +       int sysctl_tcp_mtu_probe_floor;
>>>>>           int sysctl_tcp_base_mss;
>>>>>           int sysctl_tcp_min_snd_mss;
>>>>>           int sysctl_tcp_probe_threshold;
>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>>> index 0b980e841927..59ded25acd04 100644
>>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>>> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>>>>>                   .extra2         = &tcp_min_snd_mss_max,
>>>>>           },
>>>>>           {
>>>>> +               .procname       = "tcp_mtu_probe_floor",
>>>>> +               .data           = 
>>>>> &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>>>>> +               .maxlen         = sizeof(int),
>>>>> +               .mode           = 0644,
>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>> +               .extra1         = &tcp_min_snd_mss_min,
>>>>> +               .extra2         = &tcp_min_snd_mss_max,
>>>>> +       },
>>>>> +       {
>>>>>                   .procname       = "tcp_probe_threshold",
>>>>>                   .data           = 
>>>>> &init_net.ipv4.sysctl_tcp_probe_threshold,
>>>>>                   .maxlen         = sizeof(int),
>>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>>> index d57641cb3477..e0a372676329 100644
>>>>> --- a/net/ipv4/tcp_ipv4.c
>>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>>> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net 
>>>>> *net)
>>>>>           net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>>>>>           net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>>>>           net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>>>>> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
>>>>>
>>>>>           net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>>>>>           net->ipv4.sysctl_tcp_keepalive_probes = 
>>>>> TCP_KEEPALIVE_PROBES;
>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>>> index c801cd37cc2a..dbd9d2d0ee63 100644
>>>>> --- a/net/ipv4/tcp_timer.c
>>>>> +++ b/net/ipv4/tcp_timer.c
>>>>> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct 
>>>>> inet_connection_sock *icsk, struct sock *sk)
>>>>>           } else {
>>>>>                   mss = tcp_mtu_to_mss(sk, 
>>>>> icsk->icsk_mtup.search_low) >> 1;
>>>>>                   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>>> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>>>> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
>>>>>                   mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>>>>                   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, 
>>>>> mss);
>>>>>           }
>>>>
>>>>
>>>> Existing sysctl should be enough ?
>>>
>>> I don't think so. Changing tcp_min_snd_mss could impact clients that
>>> really want/need a small mss. When you added the new sysctl I tried to
>>> analyze the mss values we're seeing to understand what we could possibly
>>> raise it to. While not a huge amount, we see more clients than I
>>> expected announcing mss values in the 180-512 range. Given that I would
>>> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
>>> above.
>>
>> If these clients need mss values in 180-512 ranges, how MTU probing
>> would work for them,
>> if you set a floor to 512 ?
> 
> First, we already seem to be fine with ignoring these paths with ICMP 
> based PMTU discovery b/c of our min_pmtu default of 512 and that is 
> configurable. Second by adding this sysctl we're giving administrators 
> the choice to decide if they'd like to attempt to support these very 
> very small # of paths which may be below 512 (MSS <= 512 does not mean 
> MTU <= 512) or cover themselves by being able to raise the floor to not 
> penalize clients who may be on very lossy networks.
> 
>>
>> Are we sure the intent of tcp_base_mss was not to act as a floor ?
> 
> My understanding is that tcp_base_mss is meant to be the initial value 
> of search_low (as per Docs). Then in RFC 4821 [1] Sections 7.2, shows 
> search_low should be configurable, and 7.7 we see that in response to 
> successive black hole detection search_low should be halved. So I don't 
> think it was meant to be a floor, but just the initial search_low param. 
> Also note that in that same section they suggest a floor of 68 for v4, 
> but a floor of 1280 for v6 which we do not adhere to currently.
> 

Clarification. We == Akamai in regards to setting tcp_base_mss to 
1400-overheads. Upstream default is 1024.

> We actually set tcp_base_mss to something close to the value suggested 
> towards the end of section 7.2 of the RFC of 1400 bytes minus IP and 
> Transport overheads and options. This way we have more realistic 
> searching based on the majority of clients that we see. The kernel winds 
> up using initial search_low/tcp_base_mss as initial eff_pmtu, so we see 
> something like:
> 
> 21:03:41.314612 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:41.670307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:42.030308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:42.534307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:43.198308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:44.478307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:47.742310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
> 1:1349, ack 1, win 229, length 1348: HTTP
> 21:03:56.702310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
> 1:675, ack 1, win 229, length 674: HTTP
> 
> For further evidence this is a real problem here's a sample of mss 
> values I found when originally investigating this problem for us:
> 
> I dug up some #s I found when originally investigating this problem:
> 
> # ss -emoitn | grep mss | sed "s/.*mss:\([0-9]*\).*/\1/" | sort -u | 
> sort -g | head -5
> 
> 36:11
> 64:7
> 72:1
> 128:13
> 144:4
> 
>  From what I could tell these connections were on paths much larger than 
> the mss they were being forced to use. I determined this by looking at 
> the mss used for other objects fetched from the same IPs.
> 
> Josh
> 
> [1] - https://www.ietf.org/rfc/rfc4821.txt
> 
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 
>> c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e 
>>
>> 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
>> inet_connection_sock *icsk, struct sock *sk)
>>                  icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
>>          } else {
>>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) 
>> >> 1;
>> -               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>> +               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
>>                  mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>
>>
>>
>>>
>>>>
>>>> tcp_min_snd_mss  documentation could be slightly updated.
>>>>
>>>> And maybe its default value could be raised a bit.
>>>>
>>>
>>> Thanks
>>> Josh

^ permalink raw reply

* Re: [PATCH net] net: hns: fix LED configuration for marvell phy
From: Andrew Lunn @ 2019-07-28 22:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: liuyonglong, David Miller, netdev, linux-kernel, linuxarm,
	salil.mehta, yisen.zhuang, shiju.jose
In-Reply-To: <20190728132412.GC8718@xo-6d-61-c0.localdomain>

On Sun, Jul 28, 2019 at 03:24:12PM +0200, Pavel Machek wrote:
> On Thu 2019-07-25 06:28:29, Andrew Lunn wrote:
> > On Thu, Jul 25, 2019 at 11:00:08AM +0800, liuyonglong wrote:
> > > > Revert "net: hns: fix LED configuration for marvell phy"
> > > > This reverts commit f4e5f775db5a4631300dccd0de5eafb50a77c131.
> > > >
> > > > Andrew Lunn says this should be handled another way.
> > > >
> > > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > > 
> > > 
> > > Hi Andrew:
> > > 
> > > I see this patch have been reverted, can you tell me the better way to do this?
> > > Thanks very much!
> > 
> > Please take a look at the work Matthias Kaehlcke is doing. It has not
> > got too far yet, but when it is complete, it should define a generic
> > way to configure PHY LEDs.
> 
> I don't remember PHY LED discussion from LED mailing list. Would you have a pointer?

Hi Pavel 

So far, it has not made it onto the generic LED list. And the current
implementation is unlikely to go as far as using the generic LED
code. But i would like the binding to be compatible with it, so that
some time in the future it could be migrated to being part of the
generic LED code. But that would also require extensions to the
generic LED code to support hardware offload of triggers.

	Andrew

^ permalink raw reply

* Re: [PATCH v3] net: dsa: qca8k: enable port flow control
From: Andrew Lunn @ 2019-07-28 22:31 UTC (permalink / raw)
  To: xiaofeis
  Cc: davem, vkoul, netdev, linux-arm-msm, bjorn.andersson,
	vivien.didelot, f.fainelli, niklas.cassel, xiazha
In-Reply-To: <1564275470-52666-1-git-send-email-xiaofeis@codeaurora.org>

On Sun, Jul 28, 2019 at 08:57:50AM +0800, xiaofeis wrote:
> Set phy device advertising to enable MAC flow control.

Hi Xiaofei.

This is half of the needed change for MAC flow control.

phy_support_asym_pause(phy) is used by the MAC to tell the PHY layer
that the MAC supports flow control. The PHY will then advertise
this. When auto-negotiation is completed, the PHY layer will call
qca8k_adjust_link() with the results. It could be that the peer does
not support flow control, or only supports symmetric flow control.  So
in that function, you need to program the MAC with the results of the
auto-neg. This is currently missing. You need to look at phydev->pause
and phydev->asym_pause to decide how to configure the MAC.

       Andrew

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Andrew Lunn @ 2019-07-28 23:07 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Horatiu Vultur, Nikolay Aleksandrov, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190728191558.zuopgfqza2iz5d5b@lx-anielsen.microsemi.net>

> Trying to get back to the original problem:
> 
> We have a network which implements the ODVA/DLR ring protocol. This protocol
> sends out a beacon frame as often as every 3 us (as far as I recall, default I
> believe is 400 us) to this MAC address: 01:21:6C:00:00:01.
> 
> Try take a quick look at slide 10 in [1].
> 
> If we assume that the SwitchDev driver implemented such that all multicast
> traffic goes to the CPU, then we should really have a way to install a HW
> offload path in the silicon, such that these packets does not go to the CPU (as
> they are known not to be use full, and a frame every 3 us is a significant load
> on small DMA connections and CPU resources).
> 
> If we assume that the SwitchDev driver implemented such that only "needed"
> multicast packets goes to the CPU, then we need a way to get these packets in
> case we want to implement the DLR protocol.
> 
> I'm sure that both models can work, and I do not think that this is the main
> issue here.
> 
> Our initial attempt was to allow install static L2-MAC entries and append
> multiple ports to such an entry in the MAC table. This was rejected, for several
> good reasons it seems. But I'm not sure it was clear what we wanted to achieve,
> and why we find it to be important. Hopefully this is clear with a real world
> use-case.
> 
> Any hints or ideas on what would be a better way to solve this problems will be
> much appreciated.

I always try to think about how this would work if i had a bunch of
discrete network interfaces, not a switch. What APIs are involved in
configuring such a system? How does the Linux network stack perform
software DLR? How is the reception and blocking of the multicast group
performed?

Once you understand how it works in the software implement, it should
then be more obvious which switchdev hooks should be used to
accelerate this using hardware.

	   Andrew

^ permalink raw reply

* Re: memory leak in bio_copy_user_iov
From: syzbot @ 2019-07-29  0:38 UTC (permalink / raw)
  To: agk, axboe, coreteam, davem, dm-devel, hdanton, kaber, kadlec,
	linux-block, linux-kernel, linux-raid, netdev, netfilter-devel,
	pablo, shli, snitzer, syzkaller-bugs
In-Reply-To: <000000000000c75fb7058ba0c0e4@google.com>

syzbot has bisected this bug to:

commit 664820265d70a759dceca87b6eb200cd2b93cda8
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Thu Feb 18 20:44:39 2016 +0000

     dm: do not return target from dm_get_live_table_for_ioctl()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13f4eb64600000
start commit:   0011572c Merge branch 'for-5.2-fixes' of git://git.kernel...
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=100ceb64600000
console output: https://syzkaller.appspot.com/x/log.txt?x=17f4eb64600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=cb38d33cd06d8d48
dashboard link: https://syzkaller.appspot.com/bug?extid=03e5c8ebd22cc6c3a8cb
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13244221a00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=117b2432a00000

Reported-by: syzbot+03e5c8ebd22cc6c3a8cb@syzkaller.appspotmail.com
Fixes: 664820265d70 ("dm: do not return target from  
dm_get_live_table_for_ioctl()")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* [PATCH] net: spider_net: Mark expected switch fall-through
From: Gustavo A. R. Silva @ 2019-07-29  0:32 UTC (permalink / raw)
  To: Ishizaki Kou, David S. Miller
  Cc: netdev, linux-kernel, Gustavo A. R. Silva, Stephen Rothwell,
	Kees Cook

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning:

drivers/net/ethernet/toshiba/spider_net.c: In function 'spider_net_release_tx_chain':
drivers/net/ethernet/toshiba/spider_net.c:783:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (!brutal) {
       ^
drivers/net/ethernet/toshiba/spider_net.c:792:3: note: here
   case SPIDER_NET_DESCR_RESPONSE_ERROR:
   ^~~~

Notice that, in this particular case, the code comment is
modified in accordance with what GCC is expecting to find.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/ethernet/toshiba/spider_net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c
index 5b196ebfed49..0f346761a2b2 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -788,6 +788,7 @@ spider_net_release_tx_chain(struct spider_net_card *card, int brutal)
 			/* fallthrough, if we release the descriptors
 			 * brutally (then we don't care about
 			 * SPIDER_NET_DESCR_CARDOWNED) */
+			/* Fall through */
 
 		case SPIDER_NET_DESCR_RESPONSE_ERROR:
 		case SPIDER_NET_DESCR_PROTECTION_ERROR:
-- 
2.22.0


^ permalink raw reply related

* [PATCH] net: ehea: Mark expected switch fall-through
From: Gustavo A. R. Silva @ 2019-07-29  0:30 UTC (permalink / raw)
  To: Douglas Miller, David S. Miller
  Cc: netdev, linux-kernel, Gustavo A. R. Silva, Stephen Rothwell,
	Kees Cook

Mark switch cases where we are expecting to fall through.

This patch fixes the following warning:

drivers/net/ethernet/ibm/ehea/ehea_main.c: In function 'ehea_mem_notifier':
include/linux/printk.h:311:2: warning: this statement may fall through [-Wimplicit-fallthrough=]
  printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/ibm/ehea/ehea_main.c:3253:3: note: in expansion of macro 'pr_info'
   pr_info("memory offlining canceled");
   ^~~~~~~
drivers/net/ethernet/ibm/ehea/ehea_main.c:3256:2: note: here
  case MEM_ONLINE:
  ^~~~

Notice that, in this particular case, the code comment is
modified in accordance with what GCC is expecting to find.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/net/ethernet/ibm/ehea/ehea_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c b/drivers/net/ethernet/ibm/ehea/ehea_main.c
index 4138a8480347..cca71ba7a74a 100644
--- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
+++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
@@ -3251,7 +3251,7 @@ static int ehea_mem_notifier(struct notifier_block *nb,
 	switch (action) {
 	case MEM_CANCEL_OFFLINE:
 		pr_info("memory offlining canceled");
-		/* Fall through: re-add canceled memory block */
+		/* Fall through - re-add canceled memory block */
 
 	case MEM_ONLINE:
 		pr_info("memory is going online");
-- 
2.22.0


^ permalink raw reply related

* Re: memory leak in bio_copy_user_iov
From: Bob Liu @ 2019-07-29  1:03 UTC (permalink / raw)
  To: syzbot, agk, axboe, coreteam, davem, dm-devel, hdanton, kaber,
	kadlec, linux-block, linux-kernel, linux-raid, netdev,
	netfilter-devel, pablo, shli, snitzer, syzkaller-bugs
In-Reply-To: <000000000000aec4ec058ec71a3d@google.com>

On 7/29/19 8:38 AM, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 664820265d70a759dceca87b6eb200cd2b93cda8
> Author: Mike Snitzer <snitzer@redhat.com>
> Date:   Thu Feb 18 20:44:39 2016 +0000
> 
>     dm: do not return target from dm_get_live_table_for_ioctl()
> 

This(and previous bisection) look not related to the reported leak.


A possible reason may be KASAN can't recognize the failure path of bio_alloc_bioset()
where mempool_free() is called but not kmalloc(p).

But it's not a real bug, because we have the condition if (nr_iovecs > inline_vecs).

Below fix may avoid the syzbot bug report..

diff --git a/block/bio.c b/block/bio.c
index 4db1008..04a7879 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -513,8 +513,10 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
                        bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
                }
 
-               if (unlikely(!bvl))
-                       goto err_free;
+               if (unlikely(!bvl)) {
+                       mempool_free(p, &bs->bio_pool);
+                       return NULL;
+               }
 
                bio->bi_flags |= idx << BVEC_POOL_OFFSET;
        } else if (nr_iovecs) {
@@ -525,10 +527,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
        bio->bi_max_vecs = nr_iovecs;
        bio->bi_io_vec = bvl;
        return bio;
-
-err_free:
-       mempool_free(p, &bs->bio_pool);
-       return NULL;
 }
 EXPORT_SYMBOL(bio_alloc_bioset);


Regards, -Bob

> bisection log:  https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_bisect.txt-3Fx-3D13f4eb64600000&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=MNjYy_nft_s0ErmK2n89p7y2yhKmeWlxWch0z7_dsm8&e=start commit:   0011572c Merge branch 'for-5.2-fixes' of git://git.kernel...
> git tree:       upstream
> final crash:    https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_report.txt-3Fx-3D100ceb64600000&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=iviPOQNPEIjkuqBma_VWEQ9l1Ve3eOiTwads42E4ZPo&e=console output: https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_log.txt-3Fx-3D17f4eb64600000&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=MBwnFwjEcSQfYymfv8EYt_EawVdK9vD-OAqDMutO-YY&e=kernel config:  https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_.config-3Fx-3Dcb38d33cd06d8d48&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=SqmDUenNFS-961PGgiMW5mIUv0nIBrf0oBrzUxYZ8Do&e=dashboard link:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fextid-3D03e5c8ebd22cc6c3a8cb&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=jKd2ocY5X94uyB8Or-OC3yffbOgClPQPlXqFnLzvvSY&e=syz repro:      https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.syz-3Fx-3D13244221a00000&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=K-C39Kcd1oEOtJKwnby-s1EyEZZA10mr9bcXZ0J9Kh0&e=C reproducer:   https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_x_repro.c-3Fx-3D117b2432a00000&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=7J685CwQN6_FA2KgO3Vgy1msF0zi5O0OqZj_bgvEqBE&e=
> Reported-by: syzbot+03e5c8ebd22cc6c3a8cb@syzkaller.appspotmail.com
> Fixes: 664820265d70 ("dm: do not return target from dm_get_live_table_for_ioctl()")
> 
> For information about bisection process see: https://urldefense.proofpoint.com/v2/url?u=https-3A__goo.gl_tpsmEJ-23bisection&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=1ktT0U2YS_I8Zz2o-MS1YcCAzWZ6hFGtyTgvVMGM7gI&m=NfGQRVxYCfZacAKiml9Wue-G1r2h8qkuAhAMOx_uFcc&s=rs52TkiEQCrV4V8YQa2wT55HD8E-0AX9pn7MNIDcje4&e=


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox