Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Andrew Lunn @ 2019-07-28 23:07 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Horatiu Vultur, Nikolay Aleksandrov, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190728191558.zuopgfqza2iz5d5b@lx-anielsen.microsemi.net>

> Trying to get back to the original problem:
> 
> We have a network which implements the ODVA/DLR ring protocol. This protocol
> sends out a beacon frame as often as every 3 us (as far as I recall, default I
> believe is 400 us) to this MAC address: 01:21:6C:00:00:01.
> 
> Try take a quick look at slide 10 in [1].
> 
> If we assume that the SwitchDev driver implemented such that all multicast
> traffic goes to the CPU, then we should really have a way to install a HW
> offload path in the silicon, such that these packets does not go to the CPU (as
> they are known not to be use full, and a frame every 3 us is a significant load
> on small DMA connections and CPU resources).
> 
> If we assume that the SwitchDev driver implemented such that only "needed"
> multicast packets goes to the CPU, then we need a way to get these packets in
> case we want to implement the DLR protocol.
> 
> I'm sure that both models can work, and I do not think that this is the main
> issue here.
> 
> Our initial attempt was to allow install static L2-MAC entries and append
> multiple ports to such an entry in the MAC table. This was rejected, for several
> good reasons it seems. But I'm not sure it was clear what we wanted to achieve,
> and why we find it to be important. Hopefully this is clear with a real world
> use-case.
> 
> Any hints or ideas on what would be a better way to solve this problems will be
> much appreciated.

I always try to think about how this would work if i had a bunch of
discrete network interfaces, not a switch. What APIs are involved in
configuring such a system? How does the Linux network stack perform
software DLR? How is the reception and blocking of the multicast group
performed?

Once you understand how it works in the software implement, it should
then be more obvious which switchdev hooks should be used to
accelerate this using hardware.

	   Andrew

^ permalink raw reply

* Re: [PATCH v3] net: dsa: qca8k: enable port flow control
From: Andrew Lunn @ 2019-07-28 22:31 UTC (permalink / raw)
  To: xiaofeis
  Cc: davem, vkoul, netdev, linux-arm-msm, bjorn.andersson,
	vivien.didelot, f.fainelli, niklas.cassel, xiazha
In-Reply-To: <1564275470-52666-1-git-send-email-xiaofeis@codeaurora.org>

On Sun, Jul 28, 2019 at 08:57:50AM +0800, xiaofeis wrote:
> Set phy device advertising to enable MAC flow control.

Hi Xiaofei.

This is half of the needed change for MAC flow control.

phy_support_asym_pause(phy) is used by the MAC to tell the PHY layer
that the MAC supports flow control. The PHY will then advertise
this. When auto-negotiation is completed, the PHY layer will call
qca8k_adjust_link() with the results. It could be that the peer does
not support flow control, or only supports symmetric flow control.  So
in that function, you need to program the MAC with the results of the
auto-neg. This is currently missing. You need to look at phydev->pause
and phydev->asym_pause to decide how to configure the MAC.

       Andrew

^ permalink raw reply

* Re: [PATCH net] net: hns: fix LED configuration for marvell phy
From: Andrew Lunn @ 2019-07-28 22:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: liuyonglong, David Miller, netdev, linux-kernel, linuxarm,
	salil.mehta, yisen.zhuang, shiju.jose
In-Reply-To: <20190728132412.GC8718@xo-6d-61-c0.localdomain>

On Sun, Jul 28, 2019 at 03:24:12PM +0200, Pavel Machek wrote:
> On Thu 2019-07-25 06:28:29, Andrew Lunn wrote:
> > On Thu, Jul 25, 2019 at 11:00:08AM +0800, liuyonglong wrote:
> > > > Revert "net: hns: fix LED configuration for marvell phy"
> > > > This reverts commit f4e5f775db5a4631300dccd0de5eafb50a77c131.
> > > >
> > > > Andrew Lunn says this should be handled another way.
> > > >
> > > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > > 
> > > 
> > > Hi Andrew:
> > > 
> > > I see this patch have been reverted, can you tell me the better way to do this?
> > > Thanks very much!
> > 
> > Please take a look at the work Matthias Kaehlcke is doing. It has not
> > got too far yet, but when it is complete, it should define a generic
> > way to configure PHY LEDs.
> 
> I don't remember PHY LED discussion from LED mailing list. Would you have a pointer?

Hi Pavel 

So far, it has not made it onto the generic LED list. And the current
implementation is unlikely to go as far as using the generic LED
code. But i would like the binding to be compatible with it, so that
some time in the future it could be migrated to being part of the
generic LED code. But that would also require extensions to the
generic LED code to support hardware offload of triggers.

	Andrew

^ permalink raw reply

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-28 21:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <5a054ca5-4077-5e91-69d5-f1add8dc8bfa@akamai.com>

On 7/28/19 2:14 PM, Josh Hunt wrote:
> On 7/28/19 6:54 AM, Eric Dumazet wrote:
>> On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>>>
>>> On 7/27/19 12:05 AM, Eric Dumazet wrote:
>>>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>>
>>>>> The current implementation of TCP MTU probing can considerably
>>>>> underestimate the MTU on lossy connections allowing the MSS to get 
>>>>> down to
>>>>> 48. We have found that in almost all of these cases on our networks 
>>>>> these
>>>>> paths can handle much larger MTUs meaning the connections are being
>>>>> artificially limited. Even though TCP MTU probing can raise the MSS 
>>>>> back up
>>>>> we have seen this not to be the case causing connections to be 
>>>>> "stuck" with
>>>>> an MSS of 48 when heavy loss is present.
>>>>>
>>>>> Prior to pushing out this change we could not keep TCP MTU probing 
>>>>> enabled
>>>>> b/c of the above reasons. Now with a reasonble floor set we've had it
>>>>> enabled for the past 6 months.
>>>>
>>>> And what reasonable value have you used ???
>>>
>>> Reasonable for some may not be reasonable for others hence the new
>>> sysctl :) We're currently running with a fairly high value based off of
>>> the v6 min MTU minus headers and options, etc. We went conservative with
>>> our setting initially as it seemed a reasonable first step when
>>> re-enabling TCP MTU probing since with no configurable floor we saw a #
>>> of cases where connections were using severely reduced mss b/c of loss
>>> and not b/c of actual path restriction. I plan to reevaluate the setting
>>> at some point, but since the probing method is still the same it means
>>> the same clients who got stuck with mss of 48 before will land at
>>> whatever floor we set. Looking forward we are interested in trying to
>>> improve TCP MTU probing so it does not penalize clients like this.
>>>
>>> A suggestion for a more reasonable floor default would be 512, which is
>>> the same as the min_pmtu. Given both mechanisms are trying to achieve
>>> the same goal it seems like they should have a similar min/floor.
>>>
>>>>
>>>>>
>>>>> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>>>>> administrators the ability to control the floor of MSS probing.
>>>>>
>>>>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>>>>> ---
>>>>>    Documentation/networking/ip-sysctl.txt | 6 ++++++
>>>>>    include/net/netns/ipv4.h               | 1 +
>>>>>    net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
>>>>>    net/ipv4/tcp_ipv4.c                    | 1 +
>>>>>    net/ipv4/tcp_timer.c                   | 2 +-
>>>>>    5 files changed, 18 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/Documentation/networking/ip-sysctl.txt 
>>>>> b/Documentation/networking/ip-sysctl.txt
>>>>> index df33674799b5..49e95f438ed7 100644
>>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>>> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>>>>>           Path MTU discovery (MTU probing).  If MTU probing is 
>>>>> enabled,
>>>>>           this is the initial MSS used by the connection.
>>>>>
>>>>> +tcp_mtu_probe_floor - INTEGER
>>>>> +       If MTU probing is enabled this caps the minimum MSS used 
>>>>> for search_low
>>>>> +       for the connection.
>>>>> +
>>>>> +       Default : 48
>>>>> +
>>>>>    tcp_min_snd_mss - INTEGER
>>>>>           TCP SYN and SYNACK messages usually advertise an ADVMSS 
>>>>> option,
>>>>>           as described in RFC 1122 and RFC 6691.
>>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>>> index bc24a8ec1ce5..c0c0791b1912 100644
>>>>> --- a/include/net/netns/ipv4.h
>>>>> +++ b/include/net/netns/ipv4.h
>>>>> @@ -116,6 +116,7 @@ struct netns_ipv4 {
>>>>>           int sysctl_tcp_l3mdev_accept;
>>>>>    #endif
>>>>>           int sysctl_tcp_mtu_probing;
>>>>> +       int sysctl_tcp_mtu_probe_floor;
>>>>>           int sysctl_tcp_base_mss;
>>>>>           int sysctl_tcp_min_snd_mss;
>>>>>           int sysctl_tcp_probe_threshold;
>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>>> index 0b980e841927..59ded25acd04 100644
>>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>>> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>>>>>                   .extra2         = &tcp_min_snd_mss_max,
>>>>>           },
>>>>>           {
>>>>> +               .procname       = "tcp_mtu_probe_floor",
>>>>> +               .data           = 
>>>>> &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>>>>> +               .maxlen         = sizeof(int),
>>>>> +               .mode           = 0644,
>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>> +               .extra1         = &tcp_min_snd_mss_min,
>>>>> +               .extra2         = &tcp_min_snd_mss_max,
>>>>> +       },
>>>>> +       {
>>>>>                   .procname       = "tcp_probe_threshold",
>>>>>                   .data           = 
>>>>> &init_net.ipv4.sysctl_tcp_probe_threshold,
>>>>>                   .maxlen         = sizeof(int),
>>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>>> index d57641cb3477..e0a372676329 100644
>>>>> --- a/net/ipv4/tcp_ipv4.c
>>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>>> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net 
>>>>> *net)
>>>>>           net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>>>>>           net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>>>>           net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>>>>> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
>>>>>
>>>>>           net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>>>>>           net->ipv4.sysctl_tcp_keepalive_probes = 
>>>>> TCP_KEEPALIVE_PROBES;
>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>>> index c801cd37cc2a..dbd9d2d0ee63 100644
>>>>> --- a/net/ipv4/tcp_timer.c
>>>>> +++ b/net/ipv4/tcp_timer.c
>>>>> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct 
>>>>> inet_connection_sock *icsk, struct sock *sk)
>>>>>           } else {
>>>>>                   mss = tcp_mtu_to_mss(sk, 
>>>>> icsk->icsk_mtup.search_low) >> 1;
>>>>>                   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>>> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>>>> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
>>>>>                   mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>>>>                   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, 
>>>>> mss);
>>>>>           }
>>>>
>>>>
>>>> Existing sysctl should be enough ?
>>>
>>> I don't think so. Changing tcp_min_snd_mss could impact clients that
>>> really want/need a small mss. When you added the new sysctl I tried to
>>> analyze the mss values we're seeing to understand what we could possibly
>>> raise it to. While not a huge amount, we see more clients than I
>>> expected announcing mss values in the 180-512 range. Given that I would
>>> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
>>> above.
>>
>> If these clients need mss values in 180-512 ranges, how MTU probing
>> would work for them,
>> if you set a floor to 512 ?
> 
> First, we already seem to be fine with ignoring these paths with ICMP 
> based PMTU discovery b/c of our min_pmtu default of 512 and that is 
> configurable. Second by adding this sysctl we're giving administrators 
> the choice to decide if they'd like to attempt to support these very 
> very small # of paths which may be below 512 (MSS <= 512 does not mean 
> MTU <= 512) or cover themselves by being able to raise the floor to not 
> penalize clients who may be on very lossy networks.
> 
>>
>> Are we sure the intent of tcp_base_mss was not to act as a floor ?
> 
> My understanding is that tcp_base_mss is meant to be the initial value 
> of search_low (as per Docs). Then in RFC 4821 [1] Sections 7.2, shows 
> search_low should be configurable, and 7.7 we see that in response to 
> successive black hole detection search_low should be halved. So I don't 
> think it was meant to be a floor, but just the initial search_low param. 
> Also note that in that same section they suggest a floor of 68 for v4, 
> but a floor of 1280 for v6 which we do not adhere to currently.
> 

Clarification. We == Akamai in regards to setting tcp_base_mss to 
1400-overheads. Upstream default is 1024.

> We actually set tcp_base_mss to something close to the value suggested 
> towards the end of section 7.2 of the RFC of 1400 bytes minus IP and 
> Transport overheads and options. This way we have more realistic 
> searching based on the majority of clients that we see. The kernel winds 
> up using initial search_low/tcp_base_mss as initial eff_pmtu, so we see 
> something like:
> 
> 21:03:41.314612 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:41.670307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:42.030308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:42.534307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:43.198308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:44.478307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
> 1:1461, ack 1, win 229, length 1460: HTTP
> 21:03:47.742310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
> 1:1349, ack 1, win 229, length 1348: HTTP
> 21:03:56.702310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
> 1:675, ack 1, win 229, length 674: HTTP
> 
> For further evidence this is a real problem here's a sample of mss 
> values I found when originally investigating this problem for us:
> 
> I dug up some #s I found when originally investigating this problem:
> 
> # ss -emoitn | grep mss | sed "s/.*mss:\([0-9]*\).*/\1/" | sort -u | 
> sort -g | head -5
> 
> 36:11
> 64:7
> 72:1
> 128:13
> 144:4
> 
>  From what I could tell these connections were on paths much larger than 
> the mss they were being forced to use. I determined this by looking at 
> the mss used for other objects fetched from the same IPs.
> 
> Josh
> 
> [1] - https://www.ietf.org/rfc/rfc4821.txt
> 
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 
>> c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e 
>>
>> 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
>> inet_connection_sock *icsk, struct sock *sk)
>>                  icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
>>          } else {
>>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) 
>> >> 1;
>> -               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>> +               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
>>                  mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>
>>
>>
>>>
>>>>
>>>> tcp_min_snd_mss  documentation could be slightly updated.
>>>>
>>>> And maybe its default value could be raised a bit.
>>>>
>>>
>>> Thanks
>>> Josh

^ permalink raw reply

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-28 21:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <CANn89iLqeixzZkop8tqOQka_9ZiKurZL9Vj05bgU99M5Pbenqw@mail.gmail.com>

On 7/28/19 6:54 AM, Eric Dumazet wrote:
> On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>>
>> On 7/27/19 12:05 AM, Eric Dumazet wrote:
>>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>
>>>> The current implementation of TCP MTU probing can considerably
>>>> underestimate the MTU on lossy connections allowing the MSS to get down to
>>>> 48. We have found that in almost all of these cases on our networks these
>>>> paths can handle much larger MTUs meaning the connections are being
>>>> artificially limited. Even though TCP MTU probing can raise the MSS back up
>>>> we have seen this not to be the case causing connections to be "stuck" with
>>>> an MSS of 48 when heavy loss is present.
>>>>
>>>> Prior to pushing out this change we could not keep TCP MTU probing enabled
>>>> b/c of the above reasons. Now with a reasonble floor set we've had it
>>>> enabled for the past 6 months.
>>>
>>> And what reasonable value have you used ???
>>
>> Reasonable for some may not be reasonable for others hence the new
>> sysctl :) We're currently running with a fairly high value based off of
>> the v6 min MTU minus headers and options, etc. We went conservative with
>> our setting initially as it seemed a reasonable first step when
>> re-enabling TCP MTU probing since with no configurable floor we saw a #
>> of cases where connections were using severely reduced mss b/c of loss
>> and not b/c of actual path restriction. I plan to reevaluate the setting
>> at some point, but since the probing method is still the same it means
>> the same clients who got stuck with mss of 48 before will land at
>> whatever floor we set. Looking forward we are interested in trying to
>> improve TCP MTU probing so it does not penalize clients like this.
>>
>> A suggestion for a more reasonable floor default would be 512, which is
>> the same as the min_pmtu. Given both mechanisms are trying to achieve
>> the same goal it seems like they should have a similar min/floor.
>>
>>>
>>>>
>>>> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>>>> administrators the ability to control the floor of MSS probing.
>>>>
>>>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>>>> ---
>>>>    Documentation/networking/ip-sysctl.txt | 6 ++++++
>>>>    include/net/netns/ipv4.h               | 1 +
>>>>    net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
>>>>    net/ipv4/tcp_ipv4.c                    | 1 +
>>>>    net/ipv4/tcp_timer.c                   | 2 +-
>>>>    5 files changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>>>> index df33674799b5..49e95f438ed7 100644
>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>>>>           Path MTU discovery (MTU probing).  If MTU probing is enabled,
>>>>           this is the initial MSS used by the connection.
>>>>
>>>> +tcp_mtu_probe_floor - INTEGER
>>>> +       If MTU probing is enabled this caps the minimum MSS used for search_low
>>>> +       for the connection.
>>>> +
>>>> +       Default : 48
>>>> +
>>>>    tcp_min_snd_mss - INTEGER
>>>>           TCP SYN and SYNACK messages usually advertise an ADVMSS option,
>>>>           as described in RFC 1122 and RFC 6691.
>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>> index bc24a8ec1ce5..c0c0791b1912 100644
>>>> --- a/include/net/netns/ipv4.h
>>>> +++ b/include/net/netns/ipv4.h
>>>> @@ -116,6 +116,7 @@ struct netns_ipv4 {
>>>>           int sysctl_tcp_l3mdev_accept;
>>>>    #endif
>>>>           int sysctl_tcp_mtu_probing;
>>>> +       int sysctl_tcp_mtu_probe_floor;
>>>>           int sysctl_tcp_base_mss;
>>>>           int sysctl_tcp_min_snd_mss;
>>>>           int sysctl_tcp_probe_threshold;
>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>> index 0b980e841927..59ded25acd04 100644
>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>>>>                   .extra2         = &tcp_min_snd_mss_max,
>>>>           },
>>>>           {
>>>> +               .procname       = "tcp_mtu_probe_floor",
>>>> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>>>> +               .maxlen         = sizeof(int),
>>>> +               .mode           = 0644,
>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>> +               .extra1         = &tcp_min_snd_mss_min,
>>>> +               .extra2         = &tcp_min_snd_mss_max,
>>>> +       },
>>>> +       {
>>>>                   .procname       = "tcp_probe_threshold",
>>>>                   .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
>>>>                   .maxlen         = sizeof(int),
>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>> index d57641cb3477..e0a372676329 100644
>>>> --- a/net/ipv4/tcp_ipv4.c
>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
>>>>           net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>>>>           net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>>>           net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>>>> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
>>>>
>>>>           net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>>>>           net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>> index c801cd37cc2a..dbd9d2d0ee63 100644
>>>> --- a/net/ipv4/tcp_timer.c
>>>> +++ b/net/ipv4/tcp_timer.c
>>>> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>>>>           } else {
>>>>                   mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>>>>                   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>>> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
>>>>                   mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>>>                   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>>>           }
>>>
>>>
>>> Existing sysctl should be enough ?
>>
>> I don't think so. Changing tcp_min_snd_mss could impact clients that
>> really want/need a small mss. When you added the new sysctl I tried to
>> analyze the mss values we're seeing to understand what we could possibly
>> raise it to. While not a huge amount, we see more clients than I
>> expected announcing mss values in the 180-512 range. Given that I would
>> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
>> above.
> 
> If these clients need mss values in 180-512 ranges, how MTU probing
> would work for them,
> if you set a floor to 512 ?

First, we already seem to be fine with ignoring these paths with ICMP 
based PMTU discovery b/c of our min_pmtu default of 512 and that is 
configurable. Second by adding this sysctl we're giving administrators 
the choice to decide if they'd like to attempt to support these very 
very small # of paths which may be below 512 (MSS <= 512 does not mean 
MTU <= 512) or cover themselves by being able to raise the floor to not 
penalize clients who may be on very lossy networks.

> 
> Are we sure the intent of tcp_base_mss was not to act as a floor ?

My understanding is that tcp_base_mss is meant to be the initial value 
of search_low (as per Docs). Then in RFC 4821 [1] Sections 7.2, shows 
search_low should be configurable, and 7.7 we see that in response to 
successive black hole detection search_low should be halved. So I don't 
think it was meant to be a floor, but just the initial search_low param. 
Also note that in that same section they suggest a floor of 68 for v4, 
but a floor of 1280 for v6 which we do not adhere to currently.

We actually set tcp_base_mss to something close to the value suggested 
towards the end of section 7.2 of the RFC of 1400 bytes minus IP and 
Transport overheads and options. This way we have more realistic 
searching based on the majority of clients that we see. The kernel winds 
up using initial search_low/tcp_base_mss as initial eff_pmtu, so we see 
something like:

21:03:41.314612 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:41.670307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:42.030308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:42.534307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:43.198308 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:44.478307 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [P.], seq 
1:1461, ack 1, win 229, length 1460: HTTP
21:03:47.742310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
1:1349, ack 1, win 229, length 1348: HTTP
21:03:56.702310 IP 192.168.0.1.8080 > 192.0.2.1.41523: Flags [.], seq 
1:675, ack 1, win 229, length 674: HTTP

For further evidence this is a real problem here's a sample of mss 
values I found when originally investigating this problem for us:

I dug up some #s I found when originally investigating this problem:

# ss -emoitn | grep mss | sed "s/.*mss:\([0-9]*\).*/\1/" | sort -u | 
sort -g | head -5

36:11
64:7
72:1
128:13
144:4

 From what I could tell these connections were on paths much larger than 
the mss they were being forced to use. I determined this by looking at 
the mss used for other objects fetched from the same IPs.

Josh

[1] - https://www.ietf.org/rfc/rfc4821.txt

> 
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e
> 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
> inet_connection_sock *icsk, struct sock *sk)
>                  icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
>          } else {
>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> -               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
> +               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
>                  mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> 
> 
> 
>>
>>>
>>> tcp_min_snd_mss  documentation could be slightly updated.
>>>
>>> And maybe its default value could be raised a bit.
>>>
>>
>> Thanks
>> Josh

^ permalink raw reply

* Re: Slowness forming TIPC cluster with explicit node addresses
From: Chris Packham @ 2019-07-28 21:04 UTC (permalink / raw)
  To: jon.maloy@ericsson.com, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CH2PR15MB35754D65AB240A74AE488E719AC00@CH2PR15MB3575.namprd15.prod.outlook.com>

On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> 
> > 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> > On
> > Behalf Of Chris Packham
> > Sent: 25-Jul-19 19:37
> > To: tipc-discussion@lists.sourceforge.net
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Slowness forming TIPC cluster with explicit node addresses
> > 
> > Hi,
> > 
> > I'm having problems forming a TIPC cluster between 2 nodes.
> > 
> > This is the basic steps I'm going through on each node.
> > 
> > modprobe tipc
> > ip link set eth2 up
> > tipc node set addr 1.1.5 # or 1.1.6
> > tipc bearer enable media eth dev eth0
> eth2, I assume...
> 

Yes sorry I keep switching between between Ethernet ports for testing
so I hand edited the email.

> > 
> > 
> > Then to confirm if the cluster is formed I use tipc link list
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > ...
> > 
> > Looking at tcpdump the two nodes are sending packets
> > 
> > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60 bytes,
> > MessageSize
> > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > request
> > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60 bytes,
> > MessageSize
> > 76 bytes, Neighbor Detection Protocol internal, messageType Link
> > request
> > 
> > Eventually (after a few minutes) the link does come up
> > 
> > [root@node-6 ~]# tipc link list
> > broadcast-link: up
> > 1001006:eth2-1001005:eth2: up
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > 1001005:eth2-1001006:eth2: up
> > 
> > When I remove the "tipc node set addr" things seem to kick into
> > life straight
> > away
> > 
> > [root@node-5 ~]# tipc link list
> > broadcast-link: up
> > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > 
> > So there appears to be some difference in behaviour between having
> > an
> > explicit node address and using the default. Unfortunately our
> > application
> > relies on setting the node addresses.
> I do this many times a day, without any problems. If there would be
> any time difference, I would expect the 'auto configurable' version
> to be slower, because it involves a DAD step.
> Are you sure you don't have any other nodes running in your system?
> 
> ///jon
> 

Nope the two nodes are connected back to back. Does the number of
Ethernet interfaces make a difference? As you can see I've got 3 on
each node. One is completely disconnected, one is for booting over TFTP
 (only used by U-boot) and the other is the USB Ethernet I'm using for
testing.

> 
> > 
> > 
> > [root@node-5 ~]# uname -a
> > Linux linuxbox 5.2.0-at1+ #8 SMP Thu Jul 25 23:22:41 UTC 2019 ppc
> > GNU/Linux
> > 
> > Any thoughts on the problem?

^ permalink raw reply

* Re: [PATCH net-next v4 2/3] flow_offload: Support get default block from tc immediately
From: Jakub Kicinski @ 2019-07-28 20:16 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <1564296769-32294-3-git-send-email-wenxu@ucloud.cn>

On Sun, 28 Jul 2019 14:52:48 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> When thre indr device register, it can get the default block
> from tc immediately if the block is exist.
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>
> ---
> v3: no change
> v4: get tc default block without callback

Please stop reposting new versions of the patches while discussion is
ongoing, it makes it harder to follow.

The TC default block is there because the indirect registration may
happen _after_ the block is installed and populated.  It's the device
driver that usually does the indirect registration, the tunnel device
and its rules may already be set when device driver is loaded or
reloaded.

I don't know the nft code, but it seems unlikely it wouldn't have the
same problem/need..

Please explain.

^ permalink raw reply

* Linux Plumbers BPF micro-conference CFP (reminder)
From: Alexei Starovoitov @ 2019-07-28 19:24 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: bpf, Network Development
In-Reply-To: <CAADnVQJ0ATngyqo8xjXdDsyFuuov3KRtbHMR1LcV8VnEDUK8Fg@mail.gmail.com>

Hey Folks,

August 2nd deadline to submit a proposal for BPF uconf
is quickly approaching.
If you're attending LPC in Lisbon and interested
in awesome BPF uconf you need to submit a proposal.

Some of you already submitted them to lpc-bpf@vger
per instructions that were sent back on July 12.
Some proposals were sent via website.
We'd like all proposals to be seen in the website.
Could you please re-enter your proposal there?
Please go to:
https://www.linuxplumbersconf.org/event/4/abstracts/
click on 'submit new proposal'
and copy-paste what you've already sent to lpc-bpf@vger.
Much appreciate it and sorry for confusion.

There is still room for few new proposals,
but space is getting very limited.
Please don't delay.

Thanks!

> ---------- Forwarded message ---------
> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Fri, Jul 12, 2019 at 7:26 AM
> Subject: Linux Plumbers BPF micro-conference CFP (reminder)
> To: <bpf@vger.kernel.org>
> Cc: <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
> <xdp-newbies@vger.kernel.org>, <iovisor-dev@lists.iovisor.org>,
> <lpc-bpf@vger.kernel.org>, <alexei.starovoitov@gmail.com>
>
>
> This is a call for proposals for the BPF micro-conference at this
> years' Linux Plumbers Conference (LPC) 2019 which will be held in
> Lisbon, Portugal for September 9-11.
>
> The goal of the BPF micro-conference is to bring BPF developers
> together to discuss topics around Linux kernel work related to
> the BPF core infrastructure as well as its many subsystems under
> tracing, networking, security, and BPF user space tooling (LLVM,
> libbpf, bpftool and many others).
>
> The format of the micro-conference has a main focus on discussion,
> therefore each accepted topic will provide a short 1-2 slide
> introduction with subsequent discussion for the rest of the given
> time slot.
>
> The BPF micro-conference is a community-driven event and open to
> all LPC attendees, there is no additional registration required.
>
> Please submit your discussion proposals to the LPC BPF micro-conference
> organizers at:
>
>         lpc-bpf@vger.kernel.org
>
> Proposals must be submitted until August 2nd, and submitters will
> be notified of acceptance at latest by August 9. (Please note that
> proposals must not be sent as html mail as they are otherwise dropped
> by vger.)
>
> The format of the submission and many other details can be found at:
>
>         http://vger.kernel.org/lpc-bpf.html
>
> Looking forward to seeing you all in Lisbon in September!

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-28 19:15 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Horatiu Vultur, Nikolay Aleksandrov, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190727030223.GA29731@lunn.ch>

The 07/27/2019 05:02, Andrew Lunn wrote:
> > As you properly guessed, this model is quite different from what we are used to.
> 
> Yes, it takes a while to get the idea that the hardware is just an
> accelerator for what the Linux stack can already do. And if the switch
> cannot do some feature, pass the frame to Linux so it can handle it.
This is understood, and not that different from what we are used to.

The surprise was to make all multicast traffic to go to the CPU.

> You need to keep in mind that there could be other ports in the bridge
> than switch ports, and those ports might be interested in the
> multicast traffic. Hence the CPU needs to see the traffic.
This is a good argument, but I was under the impression that not all HW/drivers
supports foreign interfaces (see ocelot_netdevice_dev_check and
mlxsw_sp_port_dev_check).

> But IGMP snooping can be used to optimise this.
Yes, IGMP snooping can limit the multicast storm of multicast IP traffic, but
not for L2 non-IP multicast traffic.

We could really use something similar for non-IP multicast MAC addresses.

Trying to get back to the original problem:

We have a network which implements the ODVA/DLR ring protocol. This protocol
sends out a beacon frame as often as every 3 us (as far as I recall, default I
believe is 400 us) to this MAC address: 01:21:6C:00:00:01.

Try take a quick look at slide 10 in [1].

If we assume that the SwitchDev driver implemented such that all multicast
traffic goes to the CPU, then we should really have a way to install a HW
offload path in the silicon, such that these packets does not go to the CPU (as
they are known not to be use full, and a frame every 3 us is a significant load
on small DMA connections and CPU resources).

If we assume that the SwitchDev driver implemented such that only "needed"
multicast packets goes to the CPU, then we need a way to get these packets in
case we want to implement the DLR protocol.

I'm sure that both models can work, and I do not think that this is the main
issue here.

Our initial attempt was to allow install static L2-MAC entries and append
multiple ports to such an entry in the MAC table. This was rejected, for several
good reasons it seems. But I'm not sure it was clear what we wanted to achieve,
and why we find it to be important. Hopefully this is clear with a real world
use-case.

Any hints or ideas on what would be a better way to solve this problems will be
much appreciated.

/Allan

[1] https://www.odva.org/Portals/0/Library/Conference/2017-ODVA-Conference_Woods_High%20Availability_Guidelines%20for%20Use%20of%20DLR%20in%20EtherNetIP%20Networks_FINAL%20PPT.pdf

^ permalink raw reply

* [PATCH net] hv_sock: Fix hang when a connection is closed
From: Dexuan Cui @ 2019-07-28 18:32 UTC (permalink / raw)
  To: Sunil Muthuswamy, David Miller, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com

hvs_do_close_lock_held() may decrease the reference count to 0 and free the
sk struct completely, and then the following release_sock(sk) may hang.

Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org

---
With the proper kernel debugging options enabled, first a warning can
appear:

kworker/1:0/4467 is freeing memory ..., with a lock still held there!
stack backtrace:
Workqueue: events vmbus_onmessage_work [hv_vmbus]
Call Trace:
 dump_stack+0x67/0x90
 debug_check_no_locks_freed.cold.52+0x78/0x7d
 slab_free_freelist_hook+0x85/0x140
 kmem_cache_free+0xa5/0x380
 __sk_destruct+0x150/0x260
 hvs_close_connection+0x24/0x30 [hv_sock]
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

and then the following release_sock(sk) can hang:

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
...
irq event stamp: 62890
CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: G        W         5.2.0+ #39
Workqueue: events vmbus_onmessage_work [hv_vmbus]
RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
...
Call Trace:
 do_raw_spin_lock+0xab/0xb0
 release_sock+0x19/0xb0
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

 net/vmw_vsock/hyperv_transport.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index f2084e3f7aa4..efbda8ef1eff 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -309,9 +309,16 @@ static void hvs_close_connection(struct vmbus_channel *chan)
 {
 	struct sock *sk = get_per_channel_state(chan);

+	/* Grab an extra reference since hvs_do_close_lock_held() may decrease
+	 * the reference count to 0 by calling sock_put(sk).
+	 */
+	sock_hold(sk);
+
 	lock_sock(sk);
 	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
+
+	sock_put(sk);
 }

 static void hvs_open_connection(struct vmbus_channel *chan)
-- 
2.19.1

^ permalink raw reply related

* [PATCH net] net: bridge: delete local fdbs on device init failure
From: Nikolay Aleksandrov @ 2019-07-28 18:22 UTC (permalink / raw)
  To: netdev
  Cc: davem, roopa, bridge, Nikolay Aleksandrov,
	syzbot+88533dc8b582309bf3ee

On initialization failure we have to delete all local fdbs which were
inserted due to the default pvid. This problem has been present since the
inception of default_pvid. Note that currently there are 2 cases:
1) in br_dev_init() when br_multicast_init() fails
2) if register_netdevice() fails after calling ndo_init()

This patch takes care of both since br_vlan_flush() is called on both
occasions. Also the new fdb delete would be a no-op on normal bridge device
destruction since the local fdbs would've been already flushed by
br_dev_delete(). This is not an issue for ports since nbp_vlan_init() is
called last when adding a port thus nothing can fail after it.

Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
Tested with the provided reproducer and can no longer trigger the leak.
Also tested the br_multicast_init() failure manually by making it always
return an error.

 net/bridge/br_vlan.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 021cc9f66804..3e6a702e4c21 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -715,6 +715,11 @@ void br_vlan_flush(struct net_bridge *br)

 	ASSERT_RTNL();

+	/* delete auto-added default pvid local fdbs before flushing vlans
+	 * otherwise these will be leaked on bridge device init failure
+	 */
+	br_fdb_delete_by_port(br, NULL, 0, 1);
+
 	vg = br_vlan_group(br);
 	__vlan_flush(vg);
 	RCU_INIT_POINTER(br->vlgrp, NULL);
-- 
2.21.0

^ permalink raw reply related

* [PATCH net v2] mvpp2: refactor the HW checksum setup
From: Matteo Croce @ 2019-07-28 17:35 UTC (permalink / raw)
  To: netdev
  Cc: Antoine Tenart, Maxime Chevallier, Marcin Wojtas, Stefan Chulski,
	LKML, David Miller

The hardware can only offload checksum calculation on first port due to
the Tx FIFO size limitation, and has a maximum L3 offset of 128 bytes.
Document this in a comment and move duplicated code in a function.

Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 35 ++++++++++++-------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 937e4b928b94..a99405135046 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -811,6 +811,26 @@ static int mvpp2_swf_bm_pool_init(struct mvpp2_port *port)
 	return 0;
 }
 
+static void mvpp2_set_hw_csum(struct mvpp2_port *port,
+			      enum mvpp2_bm_pool_log_num new_long_pool)
+{
+	const netdev_features_t csums = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
+
+	/* Update L4 checksum when jumbo enable/disable on port.
+	 * Only port 0 supports hardware checksum offload due to
+	 * the Tx FIFO size limitation.
+	 * Also, don't set NETIF_F_HW_CSUM because L3_offset in TX descriptor
+	 * has 7 bits, so the maximum L3 offset is 128.
+	 */
+	if (new_long_pool == MVPP2_BM_JUMBO && port->id != 0) {
+		port->dev->features &= ~csums;
+		port->dev->hw_features &= ~csums;
+	} else {
+		port->dev->features |= csums;
+		port->dev->hw_features |= csums;
+	}
+}
+
 static int mvpp2_bm_update_mtu(struct net_device *dev, int mtu)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
@@ -843,15 +863,7 @@ static int mvpp2_bm_update_mtu(struct net_device *dev, int mtu)
 		/* Add port to new short & long pool */
 		mvpp2_swf_bm_pool_init(port);
 
-		/* Update L4 checksum when jumbo enable/disable on port */
-		if (new_long_pool == MVPP2_BM_JUMBO && port->id != 0) {
-			dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-			dev->hw_features &= ~(NETIF_F_IP_CSUM |
-					      NETIF_F_IPV6_CSUM);
-		} else {
-			dev->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-			dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-		}
+		mvpp2_set_hw_csum(port, new_long_pool);
 	}
 
 	dev->mtu = mtu;
@@ -5209,10 +5221,7 @@ static int mvpp2_port_probe(struct platform_device *pdev,
 		dev->features |= NETIF_F_NTUPLE;
 	}
 
-	if (port->pool_long->id == MVPP2_BM_JUMBO && port->id != 0) {
-		dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-		dev->hw_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-	}
+	mvpp2_set_hw_csum(port, port->pool_long->id);
 
 	dev->vlan_features |= features;
 	dev->gso_max_segs = MVPP2_MAX_TSO_SEGS;
-- 
2.21.0


^ permalink raw reply related

* Re: memory leak in fdb_create
From: Nikolay Aleksandrov @ 2019-07-28 16:51 UTC (permalink / raw)
  To: syzbot, bridge, bsingharora, coreteam, davem, duwe, kaber, kadlec,
	linux-kernel, mingo, mpe, netdev, netfilter-devel, pablo, roopa,
	rostedt, syzkaller-bugs
In-Reply-To: <0000000000008be1b2058ebe7805@google.com>

On 28/07/2019 17:20, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 04cf31a759ef575f750a63777cee95500e410994
> Author: Michael Ellerman <mpe@ellerman.id.au>
> Date:   Thu Mar 24 11:04:01 2016 +0000
> 
>     ftrace: Make ftrace_location_range() global
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1538c778600000
> start commit:   abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> git tree:       upstream
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=1738c778600000
> console output: https://syzkaller.appspot.com/x/log.txt?x=1338c778600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=56f1da14935c3cce
> dashboard link: https://syzkaller.appspot.com/bug?extid=88533dc8b582309bf3ee
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16de5c06a00000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10546026a00000
> 
> Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
> Fixes: 04cf31a759ef ("ftrace: Make ftrace_location_range() global")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection

I see the problem, it'd happen if the multicast stats memory allocation fails on bridge
init then the fdb added due to the default vlan would remain and the bridge kmem cache
would be destroyed while not empty (you can even trigger a BUG because of that).
I'll post a patch shortly after running a few tests.

Thanks,
 Nik


^ permalink raw reply

* Re: ip route JSON format is unparseable for "unreachable" routes
From: Stephen Hemminger @ 2019-07-28 16:15 UTC (permalink / raw)
  To: Michael Ziegler; +Cc: netdev
In-Reply-To: <6e88311b-5edc-4c62-1581-0f5b160a5f4e@michaelziegler.name>

On Sun, 28 Jul 2019 13:09:55 +0200
Michael Ziegler <ich@michaelziegler.name> wrote:

> Hi,
> 
> I created a couple "unreachable" routes on one of my systems, like such:
> 
> > ip route add unreachable 10.0.0.0/8     metric 255
> > ip route add unreachable 192.168.0.0/16 metric 255  
> 
> Unfortunately this results in unparseable JSON output from "ip":
> 
> > # ip -j route show  | jq .
> > parse error: Objects must consist of key:value pairs at line 1, column 84  
> 
> The offending JSON objects are these:
> 
> > {"unreachable","dst":"10.0.0.0/8","metric":255,"flags":[]}
> > {"unreachable","dst":"192.168.0.0/16","metric":255,"flags":[]}  
> "unreachable" cannot appear on its own here, it needs to be some kind of
> field.
> 
> The manpage says to report here, thus I do :) I've searched the
> archives, but I wasn't able to find any existing bug reports about this.
> I'm running version
> 
> > ip utility, iproute2-ss190107  
> 
> on Debian Buster.
> 
> Regards,
> Michael.

Already fixed upstream by:

commit 073661773872709518d35d4d093f3a715281f21d
Author: Matteo Croce <mcroce@redhat.com>
Date:   Mon Mar 18 18:19:29 2019 +0100

    ip route: print route type in JSON output
    
    ip route generates an invalid JSON if the route type has to be printed,
    eg. when detailed mode is active, or the type is different that unicast:
    
        $ ip -d -j -p route show
        [ {"unicast",
                "dst": "192.168.122.0/24",
                "dev": "virbr0",
                "protocol": "kernel",
                "scope": "link",
                "prefsrc": "192.168.122.1",
                "flags": [ "linkdown" ]
            } ]
    
        $ ip -j -p route show
        [ {"unreachable",
                "dst": "192.168.23.0/24",
                "flags": [ ]
            },{"prohibit",
                "dst": "192.168.24.0/24",
                "flags": [ ]
            },{"blackhole",
                "dst": "192.168.25.0/24",
                "flags": [ ]
            } ]
    
    Fix it by printing the route type as the "type" attribute:
    
        $ ip -d -j -p route show
        [ {
                "type": "unicast",
                "dst": "default",
                "gateway": "192.168.85.1",
                "dev": "wlp3s0",
                "protocol": "dhcp",
                "scope": "global",
                "metric": 600,
                "flags": [ ]
            },{
                "type": "unreachable",
                "dst": "192.168.23.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            },{
                "type": "prohibit",
                "dst": "192.168.24.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            },{
                "type": "blackhole",
                "dst": "192.168.25.0/24",
                "protocol": "boot",
                "scope": "global",
                "flags": [ ]
            } ]
    
    Fixes: 663c3cb23103 ("iproute: implement JSON and color output")
    Acked-by: Phil Sutter <phil@nwl.cc>
    Reviewed-and-tested-by: Andrea Claudi <aclaudi@redhat.com>
    Signed-off-by: Matteo Croce <mcroce@redhat.com>
    Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* RE: [EXT] Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Stefan Chulski @ 2019-07-28 15:22 UTC (permalink / raw)
  To: Matteo Croce, Antoine Tenart, Marcin Wojtas, Maxime Chevallier
  Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhz+PezeLT+gyXdsnyJz2dnKpYkcb2HbqvXJoLdzNxuC6g@mail.gmail.com>

> Hi all,
> 
> probably dev->vlan_features is safe to keep the CSUM features to avoid
> unnecessary calculation in some cases, but I have another question.
> Does the PP2 hardware support checksumming within any offset? I replaced
> 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
> then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP header
> at offset 264:
> 
> ip link set $dev up
> ip addr add 192.168.0.$last/24 dev $dev
> 
> for i in {1..5}; do
> 	ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-
> 1)).$other
> 	ip link set vx$i up
> 	ip addr add 192.168.$i.$last/24 dev vx$i done
> 
> 00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348:
> 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
> 02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298:
> 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
> 12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248:
> 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
> c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198:
> 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
> 02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148:
> 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
> a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98:
> 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64
> 
> It seems that the HW is capable of doing it, can someone with a datasheet
> confirm this?

L3_offset in TX descriptor has 7 bits, so beginning of Layer3 should be less than 128 Bytes.

Stefan,
Regards.

^ permalink raw reply

* Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Matteo Croce @ 2019-07-28 14:30 UTC (permalink / raw)
  To: Antoine Tenart, Marcin Wojtas, Stefan Chulski, Maxime Chevallier
  Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhycOc8mvqeQDBcnXueUjrFQMC7hdfAOkxr5k0+xc_tnDw@mail.gmail.com>

On Sun, Jul 28, 2019 at 3:36 AM Matteo Croce <mcroce@redhat.com> wrote:
>
> On Fri, Jul 26, 2019 at 2:57 PM Antoine Tenart
> <antoine.tenart@bootlin.com> wrote:
> >
> > Hi Matteo,
> >
> > On Fri, Jul 26, 2019 at 01:15:46AM +0200, Matteo Croce wrote:
> > > The hardware can only offload checksum calculation on first port
> > > due to the Tx FIFO size limitation. Document this in a comment.
> > >
> > > Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
> > > Signed-off-by: Matteo Croce <mcroce@redhat.com>
> >
> > Looks good. Please note there's a similar code path in the probe.
> > You could also add a comment there (or move this check/comment in a
> > common place).
> >
> > Thanks!
> > Antoine
> >
>
> Hi Antoine,
>
> I was making a v2, when I looked at the mvpp2_port_probe() which does:
>
> --------------------------------%<------------------------------
> features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> NETIF_F_TSO;
>
> if (port->pool_long->id == MVPP2_BM_JUMBO && port->id != 0) {
>     dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
>     dev->hw_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
> }
>
> dev->vlan_features |= features;
> -------------------------------->%------------------------------
>
> Is it ok to remove NETIF_F_IP*_CSUM from dev->features and
> dev->hw_features but keep it in dev->vlan_features?
>
> Regards,
> --
> Matteo Croce
> per aspera ad upstream

Hi all,

probably dev->vlan_features is safe to keep the CSUM features to avoid
unnecessary calculation in some cases, but I have another question.
Does the PP2 hardware support checksumming within any offset? I
replaced 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP
header at offset 264:

ip link set $dev up
ip addr add 192.168.0.$last/24 dev $dev

for i in {1..5}; do
	ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-1)).$other
	ip link set vx$i up
	ip addr add 192.168.$i.$last/24 dev vx$i
done

00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348: 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298: 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248: 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198: 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148: 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98: 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64

It seems that the HW is capable of doing it, can someone with a
datasheet confirm this?

Regards,
-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* Re: memory leak in fdb_create
From: syzbot @ 2019-07-28 14:20 UTC (permalink / raw)
  To: bridge, bsingharora, coreteam, davem, duwe, kaber, kadlec,
	linux-kernel, mingo, mpe, netdev, netfilter-devel, nikolay, pablo,
	roopa, rostedt, syzkaller-bugs
In-Reply-To: <0000000000005e6124058c0cbdbe@google.com>

syzbot has bisected this bug to:

commit 04cf31a759ef575f750a63777cee95500e410994
Author: Michael Ellerman <mpe@ellerman.id.au>
Date:   Thu Mar 24 11:04:01 2016 +0000

     ftrace: Make ftrace_location_range() global

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1538c778600000
start commit:   abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=1738c778600000
console output: https://syzkaller.appspot.com/x/log.txt?x=1338c778600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=56f1da14935c3cce
dashboard link: https://syzkaller.appspot.com/bug?extid=88533dc8b582309bf3ee
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16de5c06a00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10546026a00000

Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
Fixes: 04cf31a759ef ("ftrace: Make ftrace_location_range() global")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH] gigaset: stop maintaining seperately
From: Tilman Schmidt @ 2019-07-28 14:17 UTC (permalink / raw)
  To: Paul Bolle
  Cc: David Miller, Hansjoerg Lipp, Arnd Bergmann, Karsten Keil, netdev,
	linux-kernel
In-Reply-To: <20190726220541.28783-1-pebolle@tiscali.nl>

Thanks to you, Paul, for all your contributions, and specifically for
keeping the driver maintained for four more years after I had to abandon
it for the same reason.

I had a lot of fun working on that driver and I learned a lot in the
course. Now it's time to move on without regrets.

All the best,
Tilman

Am 27.07.2019 um 00:05 schrieb Paul Bolle:
> The Dutch consumer grade ISDN network will be shut down on September 1,
> 2019. This means I'll be converted to some sort of VOIP shortly. At that
> point it would be unwise to try to maintain the gigaset driver, even for
> odd fixes as I do. So I'll stop maintaining it as a seperate driver and
> bump support to CAPI in staging. De facto this means the driver will be
> unmaintained, since no-one seems to be working on CAPI.
> 
> I've lighty tested the hardware specific modules of this driver (bas-gigaset,
> ser-gigaset, and usb-gigaset) for v5.3-rc1. The basic functionality appears to
> be working. It's unclear whether anyone still cares. I'm aware of only one
> person sort of using the driver a few years ago.
> 
> Thanks to Karsten Keil for the ISDN subsystems gigaset was using (I4L and
> CAPI). And many thanks to Hansjoerg Lipp and Tilman Schmidt for writing and
> upstreaming this driver.
> 
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
>  MAINTAINERS | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 783569e3c4b4..e99afbd13355 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6822,13 +6822,6 @@ F:	Documentation/filesystems/gfs2*.txt
>  F:	fs/gfs2/
>  F:	include/uapi/linux/gfs2_ondisk.h
>  
> -GIGASET ISDN DRIVERS
> -M:	Paul Bolle <pebolle@tiscali.nl>
> -L:	gigaset307x-common@lists.sourceforge.net
> -W:	http://gigaset307x.sourceforge.net/
> -S:	Odd Fixes
> -F:	drivers/staging/isdn/gigaset/
> -
>  GNSS SUBSYSTEM
>  M:	Johan Hovold <johan@kernel.org>
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/johan/gnss.git
> 

^ permalink raw reply

* [PATCH net-next] rt2800usb: Add new rt2800usb device PLANEX GW-USMicroN
From: Masanari Iida @ 2019-07-28 14:07 UTC (permalink / raw)
  To: sgruszka, helmut.schaa, kvalo, davem, linux-wireless, netdev,
	linux-kernel
  Cc: Masanari Iida

This patch add a device ID for PLANEX GW-USMicroN.
Without this patch, I had to echo the device IDs in order to
recognize the device.

# lsusb |grep PLANEX
Bus 002 Device 005: ID 2019:ed14 PLANEX GW-USMicroN

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
 drivers/net/wireless/ralink/rt2x00/rt2800usb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
index fdf0504b5f1d..0dfb55c69b73 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
@@ -1086,6 +1086,7 @@ static const struct usb_device_id rt2800usb_device_table[] = {
 	{ USB_DEVICE(0x0846, 0x9013) },
 	{ USB_DEVICE(0x0846, 0x9019) },
 	/* Planex */
+	{ USB_DEVICE(0x2019, 0xed14) },
 	{ USB_DEVICE(0x2019, 0xed19) },
 	/* Ralink */
 	{ USB_DEVICE(0x148f, 0x3573) },
-- 
2.22.0.545.g9c9b961d7eb1


^ permalink raw reply related

* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Eric Dumazet @ 2019-07-28 13:54 UTC (permalink / raw)
  To: Josh Hunt; +Cc: netdev, David Miller
In-Reply-To: <a9ec9cfd-c381-c02e-7d67-e24373c693d6@akamai.com>

On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>
> On 7/27/19 12:05 AM, Eric Dumazet wrote:
> > On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
> >>
> >> The current implementation of TCP MTU probing can considerably
> >> underestimate the MTU on lossy connections allowing the MSS to get down to
> >> 48. We have found that in almost all of these cases on our networks these
> >> paths can handle much larger MTUs meaning the connections are being
> >> artificially limited. Even though TCP MTU probing can raise the MSS back up
> >> we have seen this not to be the case causing connections to be "stuck" with
> >> an MSS of 48 when heavy loss is present.
> >>
> >> Prior to pushing out this change we could not keep TCP MTU probing enabled
> >> b/c of the above reasons. Now with a reasonble floor set we've had it
> >> enabled for the past 6 months.
> >
> > And what reasonable value have you used ???
>
> Reasonable for some may not be reasonable for others hence the new
> sysctl :) We're currently running with a fairly high value based off of
> the v6 min MTU minus headers and options, etc. We went conservative with
> our setting initially as it seemed a reasonable first step when
> re-enabling TCP MTU probing since with no configurable floor we saw a #
> of cases where connections were using severely reduced mss b/c of loss
> and not b/c of actual path restriction. I plan to reevaluate the setting
> at some point, but since the probing method is still the same it means
> the same clients who got stuck with mss of 48 before will land at
> whatever floor we set. Looking forward we are interested in trying to
> improve TCP MTU probing so it does not penalize clients like this.
>
> A suggestion for a more reasonable floor default would be 512, which is
> the same as the min_pmtu. Given both mechanisms are trying to achieve
> the same goal it seems like they should have a similar min/floor.
>
> >
> >>
> >> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> >> administrators the ability to control the floor of MSS probing.
> >>
> >> Signed-off-by: Josh Hunt <johunt@akamai.com>
> >> ---
> >>   Documentation/networking/ip-sysctl.txt | 6 ++++++
> >>   include/net/netns/ipv4.h               | 1 +
> >>   net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
> >>   net/ipv4/tcp_ipv4.c                    | 1 +
> >>   net/ipv4/tcp_timer.c                   | 2 +-
> >>   5 files changed, 18 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> >> index df33674799b5..49e95f438ed7 100644
> >> --- a/Documentation/networking/ip-sysctl.txt
> >> +++ b/Documentation/networking/ip-sysctl.txt
> >> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
> >>          Path MTU discovery (MTU probing).  If MTU probing is enabled,
> >>          this is the initial MSS used by the connection.
> >>
> >> +tcp_mtu_probe_floor - INTEGER
> >> +       If MTU probing is enabled this caps the minimum MSS used for search_low
> >> +       for the connection.
> >> +
> >> +       Default : 48
> >> +
> >>   tcp_min_snd_mss - INTEGER
> >>          TCP SYN and SYNACK messages usually advertise an ADVMSS option,
> >>          as described in RFC 1122 and RFC 6691.
> >> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> >> index bc24a8ec1ce5..c0c0791b1912 100644
> >> --- a/include/net/netns/ipv4.h
> >> +++ b/include/net/netns/ipv4.h
> >> @@ -116,6 +116,7 @@ struct netns_ipv4 {
> >>          int sysctl_tcp_l3mdev_accept;
> >>   #endif
> >>          int sysctl_tcp_mtu_probing;
> >> +       int sysctl_tcp_mtu_probe_floor;
> >>          int sysctl_tcp_base_mss;
> >>          int sysctl_tcp_min_snd_mss;
> >>          int sysctl_tcp_probe_threshold;
> >> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> >> index 0b980e841927..59ded25acd04 100644
> >> --- a/net/ipv4/sysctl_net_ipv4.c
> >> +++ b/net/ipv4/sysctl_net_ipv4.c
> >> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
> >>                  .extra2         = &tcp_min_snd_mss_max,
> >>          },
> >>          {
> >> +               .procname       = "tcp_mtu_probe_floor",
> >> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
> >> +               .maxlen         = sizeof(int),
> >> +               .mode           = 0644,
> >> +               .proc_handler   = proc_dointvec_minmax,
> >> +               .extra1         = &tcp_min_snd_mss_min,
> >> +               .extra2         = &tcp_min_snd_mss_max,
> >> +       },
> >> +       {
> >>                  .procname       = "tcp_probe_threshold",
> >>                  .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
> >>                  .maxlen         = sizeof(int),
> >> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> >> index d57641cb3477..e0a372676329 100644
> >> --- a/net/ipv4/tcp_ipv4.c
> >> +++ b/net/ipv4/tcp_ipv4.c
> >> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
> >>          net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
> >>          net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
> >>          net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
> >> +       net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
> >>
> >>          net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
> >>          net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
> >> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >> index c801cd37cc2a..dbd9d2d0ee63 100644
> >> --- a/net/ipv4/tcp_timer.c
> >> +++ b/net/ipv4/tcp_timer.c
> >> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
> >>          } else {
> >>                  mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> >>                  mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
> >> -               mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
> >> +               mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
> >>                  mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
> >>                  icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> >>          }
> >
> >
> > Existing sysctl should be enough ?
>
> I don't think so. Changing tcp_min_snd_mss could impact clients that
> really want/need a small mss. When you added the new sysctl I tried to
> analyze the mss values we're seeing to understand what we could possibly
> raise it to. While not a huge amount, we see more clients than I
> expected announcing mss values in the 180-512 range. Given that I would
> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
> above.

If these clients need mss values in 180-512 ranges, how MTU probing
would work for them,
if you set a floor to 512 ?

Are we sure the intent of tcp_base_mss was not to act as a floor ?

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e
100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
inet_connection_sock *icsk, struct sock *sk)
                icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
        } else {
                mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
-               mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
+               mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
                mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
                mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
                icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);



>
> >
> > tcp_min_snd_mss  documentation could be slightly updated.
> >
> > And maybe its default value could be raised a bit.
> >
>
> Thanks
> Josh

^ permalink raw reply

* Re: [PATCH net] net: hns: fix LED configuration for marvell phy
From: Pavel Machek @ 2019-07-28 13:24 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: liuyonglong, David Miller, netdev, linux-kernel, linuxarm,
	salil.mehta, yisen.zhuang, shiju.jose
In-Reply-To: <20190725042829.GB14276@lunn.ch>

On Thu 2019-07-25 06:28:29, Andrew Lunn wrote:
> On Thu, Jul 25, 2019 at 11:00:08AM +0800, liuyonglong wrote:
> > > Revert "net: hns: fix LED configuration for marvell phy"
> > > This reverts commit f4e5f775db5a4631300dccd0de5eafb50a77c131.
> > >
> > > Andrew Lunn says this should be handled another way.
> > >
> > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > 
> > Hi Andrew:
> > 
> > I see this patch have been reverted, can you tell me the better way to do this?
> > Thanks very much!
> 
> Please take a look at the work Matthias Kaehlcke is doing. It has not
> got too far yet, but when it is complete, it should define a generic
> way to configure PHY LEDs.

I don't remember PHY LED discussion from LED mailing list. Would you have a pointer?
Would it make sense to coordinate with LED subsystem?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* [patch net] net: fix ifindex collision during namespace removal
From: Jiri Pirko @ 2019-07-28 12:56 UTC (permalink / raw)
  To: netdev
  Cc: davem, xemul, edumazet, pabeni, idosch, petrm, sd, f.fainelli,
	stephen, mlxsw, Jiri Pirko

From: Jiri Pirko <jiri@mellanox.com>

Commit aca51397d014 ("netns: Fix arbitrary net_device-s corruptions
on net_ns stop.") introduced a possibility to hit a BUG in case device
is returning back to init_net and two following conditions are met:
1) dev->ifindex value is used in a name of another "dev%d"
   device in init_net.
2) dev->name is used by another device in init_net.

Under real life circumstances this is hard to get. Therefore this has
been present happily for over 10 years. To reproduce:

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
$ ip netns add ns1
$ ip -n ns1 link add dummy1ns1 type dummy
$ ip -n ns1 link add dummy2ns1 type dummy
$ ip link set enp0s2 netns ns1
$ ip -n ns1 link set enp0s2 name dummy0
[  100.858894] virtio_net virtio0 dummy0: renamed from enp0s2
$ ip link add dev4 type dummy
$ ip -n ns1 a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff
3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff
4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff
$ ip netns del ns1
[  158.717795] default_device_exit: failed to move dummy0 to init_net: -17
[  158.719316] ------------[ cut here ]------------
[  158.720591] kernel BUG at net/core/dev.c:9824!
[  158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI
[  158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18
[  158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[  158.727508] Workqueue: netns cleanup_net
[  158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f
[  158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
[  158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
[  158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
[  158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
[  158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
[  158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
[  158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
[  158.750638] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  158.752944] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
[  158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  158.762758] Call Trace:
[  158.763882]  ? dev_change_net_namespace+0xbb0/0xbb0
[  158.766148]  ? devlink_nl_cmd_set_doit+0x520/0x520
[  158.768034]  ? dev_change_net_namespace+0xbb0/0xbb0
[  158.769870]  ops_exit_list.isra.0+0xa8/0x150
[  158.771544]  cleanup_net+0x446/0x8f0
[  158.772945]  ? unregister_pernet_operations+0x4a0/0x4a0
[  158.775294]  process_one_work+0xa1a/0x1740
[  158.776896]  ? pwq_dec_nr_in_flight+0x310/0x310
[  158.779143]  ? do_raw_spin_lock+0x11b/0x280
[  158.780848]  worker_thread+0x9e/0x1060
[  158.782500]  ? process_one_work+0x1740/0x1740
[  158.784454]  kthread+0x31b/0x420
[  158.786082]  ? __kthread_create_on_node+0x3f0/0x3f0
[  158.788286]  ret_from_fork+0x3a/0x50
[  158.789871] ---[ end trace defd6c657c71f936 ]---
[  158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f
[  158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
[  158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
[  158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
[  158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
[  158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
[  158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
[  158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
[  158.829899] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[  158.834923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
[  158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fix this by checking if a device with the same name exists in init_net
and fallback to original code - dev%d to allocate name - in case it does.

This was found using syzkaller.

Fixes: aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2a3be2b279d3..1a24ba26b098 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9817,6 +9817,8 @@ static void __net_exit default_device_exit(struct net *net)
 
 		/* Push remaining network devices to init_net */
 		snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
+		if (__dev_get_by_name(&init_net, fb_name))
+			snprintf(fb_name, IFNAMSIZ, "dev%%d");
 		err = dev_change_net_namespace(dev, &init_net, fb_name);
 		if (err) {
 			pr_emerg("%s: failed to move %s to init_net: %d\n",
-- 
2.21.0


^ permalink raw reply related

* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-07-28 11:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Martin Lau, Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Clang-Built-Linux ML, Kees Cook, Nick Desaulniers,
	Nathan Chancellor
In-Reply-To: <57169960-35c2-d9d3-94e4-3b5a43d5aca7@fb.com>

On Sat, Jul 27, 2019 at 7:11 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/27/19 1:16 AM, Sedat Dilek wrote:
> > On Sat, Jul 27, 2019 at 9:36 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>
> >> On Sat, Jul 27, 2019 at 4:24 AM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Fri, Jul 26, 2019 at 2:19 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>
> >>>> On Fri, Jul 26, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 7/26/19 2:02 PM, Sedat Dilek wrote:
> >>>>>> On Fri, Jul 26, 2019 at 10:38 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Yonghong Song,
> >>>>>>>
> >>>>>>> On Fri, Jul 26, 2019 at 5:45 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7/26/19 1:26 AM, Sedat Dilek wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I have opened a new issue in the ClangBuiltLinux issue tracker.
> >>>>>>>>
> >>>>>>>> Glad to know clang 9 has asm goto support and now It can compile
> >>>>>>>> kernel again.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yupp.
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> I am seeing a problem in the area bpf/seccomp causing
> >>>>>>>>> systemd/journald/udevd services to fail.
> >>>>>>>>>
> >>>>>>>>> [Fri Jul 26 08:08:43 2019] systemd[453]: systemd-udevd.service: Failed
> >>>>>>>>> to connect stdout to the journal socket, ignoring: Connection refused
> >>>>>>>>>
> >>>>>>>>> This happens when I use the (LLVM) LLD ld.lld-9 linker but not with
> >>>>>>>>> BFD linker ld.bfd on Debian/buster AMD64.
> >>>>>>>>> In both cases I use clang-9 (prerelease).
> >>>>>>>>
> >>>>>>>> Looks like it is a lld bug.
> >>>>>>>>
> >>>>>>>> I see the stack trace has __bpf_prog_run32() which is used by
> >>>>>>>> kernel bpf interpreter. Could you try to enable bpf jit
> >>>>>>>>      sysctl net.core.bpf_jit_enable = 1
> >>>>>>>> If this passed, it will prove it is interpreter related.
> >>>>>>>>
> >>>>>>>
> >>>>>>> After...
> >>>>>>>
> >>>>>>> sysctl -w net.core.bpf_jit_enable=1
> >>>>>>>
> >>>>>>> I can start all failed systemd services.
> >>>>>>>
> >>>>>>> systemd-journald.service
> >>>>>>> systemd-udevd.service
> >>>>>>> haveged.service
> >>>>>>>
> >>>>>>> This is in maintenance mode.
> >>>>>>>
> >>>>>>> What is next: Do set a permanent sysctl setting for net.core.bpf_jit_enable?
> >>>>>>>
> >>>>>>
> >>>>>> This is what I did:
> >>>>>
> >>>>> I probably won't have cycles to debug this potential lld issue.
> >>>>> Maybe you already did, I suggest you put enough reproducible
> >>>>> details in the bug you filed against lld so they can take a look.
> >>>>>
> >>>>
> >>>> I understand and will put the journalctl-log into the CBL issue
> >>>> tracker and update informations.
> >>>>
> >>>> Thanks for your help understanding the BPF correlations.
> >>>>
> >>>> Is setting 'net.core.bpf_jit_enable = 2' helpful here?
> >>>
> >>> jit_enable=1 is enough.
> >>> Or use CONFIG_BPF_JIT_ALWAYS_ON to workaround.
> >>>
> >>> It sounds like clang miscompiles interpreter.
> >
> > Just to clarify:
> > This does not happen with clang-9 + ld.bfd (GNU/ld linker).
> >
> >>> modprobe test_bpf
> >>> should be able to point out which part of interpreter is broken.
> >>
> >> Maybe we need something like...
> >>
> >> "bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()"
> >>
> >> ...for clang?
> >>
> >
> > Not sure if something like GCC's...
> >
> > -fgcse
> >
> > Perform a global common subexpression elimination pass. This pass also
> > performs global constant and copy propagation.
> >
> > Note: When compiling a program using computed gotos, a GCC extension,
> > you may get better run-time performance if you disable the global
> > common subexpression elimination pass by adding -fno-gcse to the
> > command line.
> >
> > Enabled at levels -O2, -O3, -Os.
> >
> > ...is available for clang.
> >
> > I tried with hopping to turn off "global common subexpression elimination":
> >
> > diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
> > index 383c87300b0d..92f934a1e9ff 100644
> > --- a/arch/x86/net/Makefile
> > +++ b/arch/x86/net/Makefile
> > @@ -3,6 +3,8 @@
> >   # Arch-specific network modules
> >   #
> >
> > +KBUILD_CFLAGS += -O0
>
> This won't work. First, you added to the wrong file. The interpreter
> is at kernel/bpf/core.c.
>

Thanks for the clarification.
I mixed up the x86 BPF JIT compiler with the BPF interpreter.

I see no diff in the disassembled kernel/bpf/core.o in my clang9-bfd
and clang9-lld build-dirs.

l$ objdump -M intel -d linux.clang9-bfd/kernel/bpf/core.o >
bpf_core_o_clang9-bfd.txt
$ objdump -M intel -d linux.clang9-lld/kernel/bpf/core.o >
bpf_core_o_clang9-lld.txt

--- bpf_core_o_clang9-bfd.txt   2019-07-28 13:11:59.363552042 +0200
+++ bpf_core_o_clang9-lld.txt   2019-07-28 13:12:09.975535278 +0200
@@ -1,5 +1,5 @@

-linux.clang9-bfd/kernel/bpf/core.o:     file format elf64-x86-64
+linux.clang9-lld/kernel/bpf/core.o:     file format elf64-x86-64


 Disassembly of section .text:

> Second, kernel may have compilation issues with -O0.
>

Confirmed.

- Sedat -

> > +
> >   ifeq ($(CONFIG_X86_32),y)
> >           obj-$(CONFIG_BPF_JIT) += bpf_jit_comp32.o
> >   else
> >
> > Still see...
> > BROKEN: test_bpf: #294 BPF_MAXINSNS: Jump, gap, jump, ... jited:0
> >
> > - Sedat -
> >

^ permalink raw reply

* ip route JSON format is unparseable for "unreachable" routes
From: Michael Ziegler @ 2019-07-28 11:09 UTC (permalink / raw)
  To: netdev

Hi,

I created a couple "unreachable" routes on one of my systems, like such:

> ip route add unreachable 10.0.0.0/8     metric 255
> ip route add unreachable 192.168.0.0/16 metric 255

Unfortunately this results in unparseable JSON output from "ip":

> # ip -j route show  | jq .
> parse error: Objects must consist of key:value pairs at line 1, column 84

The offending JSON objects are these:

> {"unreachable","dst":"10.0.0.0/8","metric":255,"flags":[]}
> {"unreachable","dst":"192.168.0.0/16","metric":255,"flags":[]}
"unreachable" cannot appear on its own here, it needs to be some kind of
field.

The manpage says to report here, thus I do :) I've searched the
archives, but I wasn't able to find any existing bug reports about this.
I'm running version

> ip utility, iproute2-ss190107

on Debian Buster.

Regards,
Michael.

^ permalink raw reply

* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-07-28 11:09 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Martin Lau, Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Clang-Built-Linux ML, Kees Cook, Nick Desaulniers,
	Nathan Chancellor
In-Reply-To: <934a2a0a-c3fb-fd75-b8a3-c1042d73ca0c@fb.com>

On Sat, Jul 27, 2019 at 7:08 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/27/19 12:36 AM, Sedat Dilek wrote:
> > On Sat, Jul 27, 2019 at 4:24 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Fri, Jul 26, 2019 at 2:19 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>
> >>> On Fri, Jul 26, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/26/19 2:02 PM, Sedat Dilek wrote:
> >>>>> On Fri, Jul 26, 2019 at 10:38 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Yonghong Song,
> >>>>>>
> >>>>>> On Fri, Jul 26, 2019 at 5:45 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 7/26/19 1:26 AM, Sedat Dilek wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I have opened a new issue in the ClangBuiltLinux issue tracker.
> >>>>>>>
> >>>>>>> Glad to know clang 9 has asm goto support and now It can compile
> >>>>>>> kernel again.
> >>>>>>>
> >>>>>>
> >>>>>> Yupp.
> >>>>>>
> >>>>>>>>
> >>>>>>>> I am seeing a problem in the area bpf/seccomp causing
> >>>>>>>> systemd/journald/udevd services to fail.
> >>>>>>>>
> >>>>>>>> [Fri Jul 26 08:08:43 2019] systemd[453]: systemd-udevd.service: Failed
> >>>>>>>> to connect stdout to the journal socket, ignoring: Connection refused
> >>>>>>>>
> >>>>>>>> This happens when I use the (LLVM) LLD ld.lld-9 linker but not with
> >>>>>>>> BFD linker ld.bfd on Debian/buster AMD64.
> >>>>>>>> In both cases I use clang-9 (prerelease).
> >>>>>>>
> >>>>>>> Looks like it is a lld bug.
> >>>>>>>
> >>>>>>> I see the stack trace has __bpf_prog_run32() which is used by
> >>>>>>> kernel bpf interpreter. Could you try to enable bpf jit
> >>>>>>>      sysctl net.core.bpf_jit_enable = 1
> >>>>>>> If this passed, it will prove it is interpreter related.
> >>>>>>>
> >>>>>>
> >>>>>> After...
> >>>>>>
> >>>>>> sysctl -w net.core.bpf_jit_enable=1
> >>>>>>
> >>>>>> I can start all failed systemd services.
> >>>>>>
> >>>>>> systemd-journald.service
> >>>>>> systemd-udevd.service
> >>>>>> haveged.service
> >>>>>>
> >>>>>> This is in maintenance mode.
> >>>>>>
> >>>>>> What is next: Do set a permanent sysctl setting for net.core.bpf_jit_enable?
> >>>>>>
> >>>>>
> >>>>> This is what I did:
> >>>>
> >>>> I probably won't have cycles to debug this potential lld issue.
> >>>> Maybe you already did, I suggest you put enough reproducible
> >>>> details in the bug you filed against lld so they can take a look.
> >>>>
> >>>
> >>> I understand and will put the journalctl-log into the CBL issue
> >>> tracker and update informations.
> >>>
> >>> Thanks for your help understanding the BPF correlations.
> >>>
> >>> Is setting 'net.core.bpf_jit_enable = 2' helpful here?
> >>
> >> jit_enable=1 is enough.
> >> Or use CONFIG_BPF_JIT_ALWAYS_ON to workaround.
> >>
> >> It sounds like clang miscompiles interpreter.
> >> modprobe test_bpf
> >> should be able to point out which part of interpreter is broken.
> >
> > Maybe we need something like...
> >
> > "bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()"
> >
> > ...for clang?
>
> Not sure how do you get conclusion it is gcse causing the problem.
> But anyway, adding such flag in the kernel is not a good idea.
> clang/llvm should be fixed instead. Esp. there is still time
> for 9.0.0 release to fix bugs.
>

To clarify: This is a snapshot release of clang-9 built with tc-build.

Building with -O0 is not possible as I see asm-goto failing.

- Sedat -

[1] https://github.com/ClangBuiltLinux/tc-build

> >
> > - Sedat -
> >
> > [1] https://git.kernel.org/linus/3193c0836f203a91bef96d88c64cccf0be090d9c
> >

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox