Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 22:21 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ioana Ciornei, netdev@vger.kernel.org, davem@davemloft.net,
	Vladimir Oltean, Claudiu Manoil, Alexandru Marginean,
	michael@walle.cc, f.fainelli@gmail.com
In-Reply-To: <20200618220331.GA279339@lunn.ch>

On Fri, Jun 19, 2020 at 12:03:31AM +0200, Andrew Lunn wrote:
> > Are there really instances where the ethernet driver has to manage multiple
> > different types of PCSs? I am not sure this type of snippet of code is really
> > going to occur.
> 
> Hi Ioana
> 
> The Marvell mv88e6390 family has three PCS's, one for SGMII/1000BaseX,
> a 10Gbase-X4/X2 and a 10GBAse-R. So this sort of code could appear.

It in fact already does, but only for the 1G and 10GBase-R cases.
We don't have a PHY interface mode for 10Gbase-X4/X2, so I never
added that, but if it happens, we end up with a third case.

It may be tempting to suggest that moving the mv88e6390 PCS support
to common code would be a good idea, but it's really not - the PCS
don't follow IEEE 802.3 register layout.  The 1G registers are in
the PHYXS MMD, following Clause 22 layout at offset 0x2000.  The
10GBASE-R registers are in the PHYXS MMD at offset 0x1000.

So, I don't think it's sensible to compare the mv88e6390 switch
with this; the mv88e6390 serdes PCS is unlikely to be useful
outside of the DSA switch environment.

However, the Lynx PCS appears to be used in a range of different
situations, which include DSA switches and conventional ethernet
drivers.  That means we need some kind of solution where the code
driving the PCS does not rely on the code structures for the
device it is embedded with.

The solution I've suggested for DSA may help us get towards having
generic PCS drivers, but it doesn't fully solve the problem.  I
would ideally like DSA drivers to have access to "struct phylink"
so that they can attach the PCS themselves without having any of
the DSA layering veneers get in the way between phylink and the
PCS code - thereby allowing the PCS code to be re-used cleanly
with a conventional network driver.

Basically, I'm thinking is:

int phylink_attach_pcs(struct phylink *pl, const struct phylink_pcs_ops *ops,
		       void *pcs_priv);

which the PCS driver itself would call when requested to by the
DSA driver or ethernet driver.

Thoughts?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Vladimir Oltean @ 2020-06-18 22:25 UTC (permalink / raw)
  To: Andrew Lunn, Ioana Ciornei
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Claudiu Manoil,
	Alexandru Marginean, michael@walle.cc, linux@armlinux.org.uk,
	f.fainelli@gmail.com
In-Reply-To: <20200618221352.GB279339@lunn.ch>

Hi Andrew,

On 6/19/20 1:13 AM, Andrew Lunn wrote:
> 
>>   MAINTAINERS                     |   7 +
>>   drivers/net/phy/Kconfig         |   6 +
>>   drivers/net/phy/Makefile        |   1 +
>>   drivers/net/phy/mdio-lynx-pcs.c | 358 ++++++++++++++++++++++++++++++++
>>   include/linux/mdio-lynx-pcs.h   |  43 ++++
>>   5 files changed, 415 insertions(+)
>>   create mode 100644 drivers/net/phy/mdio-lynx-pcs.c
>>   create mode 100644 include/linux/mdio-lynx-pcs.h
> 
> Hi Ioana
> 
> We should think about naming convention here.
> 
> All MDIO bus driver, MDIO multiplexors etc use mdio- as a prefix.
> 
> This is not a bus driver, so i don't think it should use the mdio-
> prefix. How about pcs-lynx.c?
> 
> In terms of Kconfig, MDIO_ prefix is used for MDIO bus drivers etc.  I
> don't think it is appropriate here. How about PCS_LYNX? I don't think
> any other subsystem is using PCS_ as a prefix.
> 
>> --- a/drivers/net/phy/Kconfig
>> +++ b/drivers/net/phy/Kconfig
>> @@ -235,6 +235,12 @@ config MDIO_XPCS
>>          This module provides helper functions for Synopsys DesignWare XPCS
>>          controllers.
>>
>> +config MDIO_LYNX_PCS
>> +     bool
>> +     help
>> +       This module provides helper functions for Lynx PCS enablement
>> +       representing the PCS as an MDIO device.
>> +
>>   endif
>>   endif
> 
> Maybe add this at the end, and add a
> 
> comment "PCS device drivers"
> 
> before it? I'm assuming with time we will have more of these drivers.
> 
>         Andrew
> 

This driver faithfully follows the model of drivers/net/phy/mdio-xpcs.c. 
Should we rename that as well?

Thanks,
-Vladimir

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 22:27 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ioana Ciornei, netdev, davem, vladimir.oltean, claudiu.manoil,
	alexandru.marginean, michael, f.fainelli
In-Reply-To: <20200618221352.GB279339@lunn.ch>

On Fri, Jun 19, 2020 at 12:13:52AM +0200, Andrew Lunn wrote:
> >  MAINTAINERS                     |   7 +
> >  drivers/net/phy/Kconfig         |   6 +
> >  drivers/net/phy/Makefile        |   1 +
> >  drivers/net/phy/mdio-lynx-pcs.c | 358 ++++++++++++++++++++++++++++++++
> >  include/linux/mdio-lynx-pcs.h   |  43 ++++
> >  5 files changed, 415 insertions(+)
> >  create mode 100644 drivers/net/phy/mdio-lynx-pcs.c
> >  create mode 100644 include/linux/mdio-lynx-pcs.h
> 
> Hi Ioana
> 
> We should think about naming convention here.
> 
> All MDIO bus driver, MDIO multiplexors etc use mdio- as a prefix.
> 
> This is not a bus driver, so i don't think it should use the mdio-
> prefix. How about pcs-lynx.c?
> 
> In terms of Kconfig, MDIO_ prefix is used for MDIO bus drivers etc.  I
> don't think it is appropriate here. How about PCS_LYNX? I don't think
> any other subsystem is using PCS_ as a prefix.
> 
> > --- a/drivers/net/phy/Kconfig
> > +++ b/drivers/net/phy/Kconfig
> > @@ -235,6 +235,12 @@ config MDIO_XPCS
> >  	  This module provides helper functions for Synopsys DesignWare XPCS
> >  	  controllers.
> >  
> > +config MDIO_LYNX_PCS
> > +	bool
> > +	help
> > +	  This module provides helper functions for Lynx PCS enablement
> > +	  representing the PCS as an MDIO device.
> > +
> >  endif
> >  endif
> 
> Maybe add this at the end, and add a
> 
> comment "PCS device drivers"
> 
> before it? I'm assuming with time we will have more of these drivers.

It think we will.

The other thing is, drivers/net/phy is becoming a little cluttered -
we have approaching 100 files there.

Should we be thinking about drivers/net/phy/mdio/ (for mdio*),
drivers/net/phy/lib/ for the core phylib code or leaving it where
it is, and, hmm, drivers/net/phy/media/ maybe for the PHY and PCS
drivers?  Or something like that?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [net-next PATCH] net: Avoid overwriting valid skb->napi_id
From: Eric Dumazet @ 2020-06-18 22:29 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: netdev, David Miller, Alexander Duyck, Eliezer Tamir,
	Samudrala, Sridhar
In-Reply-To: <159251533557.7557.5381023439094175695.stgit@anambiarhost.jf.intel.com>

On Thu, Jun 18, 2020 at 2:21 PM Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
>
> This will be useful to allow busy poll for tunneled traffic. In case of
> busy poll for sessions over tunnels, the underlying physical device's
> queues need to be polled.
>
> Tunnels schedule NAPI either via netif_rx() for backlog queue or
> schedule the gro_cell_poll(). netif_rx() propagates the valid skb->napi_id
> to the socket. OTOH, gro_cell_poll() stamps the skb->napi_id again by
> calling skb_mark_napi_id() with the tunnel NAPI which is not a busy poll
> candidate.


Yes the tunnel NAPI id should be 0 really (look for NAPI_STATE_NO_BUSY_POLL)

> This was preventing tunneled traffic to use busy poll. A valid
> NAPI ID in the skb indicates it was already marked for busy poll by a
> NAPI driver and hence needs to be copied into the socket.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  include/net/busy_poll.h |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 86e028388bad..b001fa91c14e 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -114,7 +114,11 @@ static inline void skb_mark_napi_id(struct sk_buff *skb,
>                                     struct napi_struct *napi)
>  {
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> -       skb->napi_id = napi->napi_id;
> +       /* If the skb was already marked with a valid NAPI ID, avoid overwriting
> +        * it.
> +        */
> +       if (skb->napi_id < MIN_NAPI_ID)
> +               skb->napi_id = napi->napi_id;


It should be faster to not depend on MIN_NAPI_ID (aka NR_CPUS + 1)

    if (napi->napi_id)
       skb->napi_id = napi->napi_id;

Probably not a big deal.

Reviewed-by: Eric Dumazet <edumazet@google.com>






>
>  #endif
>  }
>
>

^ permalink raw reply

* Re: [Patch net] net: change addr_list_lock back to static key
From: Andrew Lunn @ 2020-06-18 22:29 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Cong Wang, netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo,
	Dmitry Vyukov
In-Reply-To: <CA+h21hpjsth_1t1ZaBcTd1i3RPXZGqzSyegSSPS2Ns=uq5-HJw@mail.gmail.com>

On Thu, Jun 18, 2020 at 11:33:44PM +0300, Vladimir Oltean wrote:
> On Thu, 18 Jun 2020 at 23:06, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 12:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > > >
> > > > It's me with the stacked DSA devices again:
> > >
> > > It looks like DSA never uses netdev API to link master
> > > device with slave devices? If so, their dev->lower_level
> > > are always 1, therefore triggers this warning.
> > >
> > > I think it should call one of these netdev_upper_dev_link()
> > > API's when creating a slave device.
> > >
> >
> > I don't know whether DSA is too special to use the API, but
> > something like this should work:
> >
> > diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> > index 4c7f086a047b..f7a2a281e7f0 100644
> > --- a/net/dsa/slave.c
> > +++ b/net/dsa/slave.c
> > @@ -1807,6 +1807,11 @@ int dsa_slave_create(struct dsa_port *port)
> >                            ret, slave_dev->name);
> >                 goto out_phy;
> >         }
> > +       ret = netdev_upper_dev_link(slave_dev, master, NULL);
> > +       if (ret) {
> > +               unregister_netdevice(slave_dev);
> > +               goto out_phy;
> > +       }
> >
> >         return 0;
> >
> > @@ -1832,6 +1837,7 @@ void dsa_slave_destroy(struct net_device *slave_dev)
> >         netif_carrier_off(slave_dev);
> >         rtnl_lock();
> >         phylink_disconnect_phy(dp->pl);
> > +       netdev_upper_dev_unlink(slave_dev, dp->master);
> >         rtnl_unlock();
> >
> >         dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
> 
> Thanks. This is a good approximation of what needed to be done:
> - netdev_upper_dev_link needs to be under rtnl,
> - "dp->master" should be "dsa_slave_to_master(slave_dev)" since it's
> actually a union if you look at struct dsa_port).

> - And, most importantly, I think the hierarchy should be reversed: a
> (virtual) DSA switch port net device (slave) should be an upper of the
> (real) DSA master (the host port). Think of it like this: a DSA switch
> is a sort of port multiplier for a host port, based on a frame header.
> But, it works!

Hi Vladimir

So you are suggesting this?

> > +       ret = netdev_upper_dev_link(master, slave_dev, NULL);

  Andrew

^ permalink raw reply

* Re: [Patch net] net: change addr_list_lock back to static key
From: Vladimir Oltean @ 2020-06-18 22:32 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Cong Wang, netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo,
	Dmitry Vyukov
In-Reply-To: <20200618222959.GC279339@lunn.ch>

On Fri, 19 Jun 2020 at 01:30, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Thu, Jun 18, 2020 at 11:33:44PM +0300, Vladimir Oltean wrote:
> > On Thu, 18 Jun 2020 at 23:06, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > > > >
> > > > > It's me with the stacked DSA devices again:
> > > >
> > > > It looks like DSA never uses netdev API to link master
> > > > device with slave devices? If so, their dev->lower_level
> > > > are always 1, therefore triggers this warning.
> > > >
> > > > I think it should call one of these netdev_upper_dev_link()
> > > > API's when creating a slave device.
> > > >
> > >
> > > I don't know whether DSA is too special to use the API, but
> > > something like this should work:
> > >
> > > diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> > > index 4c7f086a047b..f7a2a281e7f0 100644
> > > --- a/net/dsa/slave.c
> > > +++ b/net/dsa/slave.c
> > > @@ -1807,6 +1807,11 @@ int dsa_slave_create(struct dsa_port *port)
> > >                            ret, slave_dev->name);
> > >                 goto out_phy;
> > >         }
> > > +       ret = netdev_upper_dev_link(slave_dev, master, NULL);
> > > +       if (ret) {
> > > +               unregister_netdevice(slave_dev);
> > > +               goto out_phy;
> > > +       }
> > >
> > >         return 0;
> > >
> > > @@ -1832,6 +1837,7 @@ void dsa_slave_destroy(struct net_device *slave_dev)
> > >         netif_carrier_off(slave_dev);
> > >         rtnl_lock();
> > >         phylink_disconnect_phy(dp->pl);
> > > +       netdev_upper_dev_unlink(slave_dev, dp->master);
> > >         rtnl_unlock();
> > >
> > >         dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
> >
> > Thanks. This is a good approximation of what needed to be done:
> > - netdev_upper_dev_link needs to be under rtnl,
> > - "dp->master" should be "dsa_slave_to_master(slave_dev)" since it's
> > actually a union if you look at struct dsa_port).
>
> > - And, most importantly, I think the hierarchy should be reversed: a
> > (virtual) DSA switch port net device (slave) should be an upper of the
> > (real) DSA master (the host port). Think of it like this: a DSA switch
> > is a sort of port multiplier for a host port, based on a frame header.
> > But, it works!
>
> Hi Vladimir
>
> So you are suggesting this?
>
> > > +       ret = netdev_upper_dev_link(master, slave_dev, NULL);
>
>   Andrew

Yes, basically this:

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 4c7f086a047b..6aff8cfc9cf1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1807,6 +1807,13 @@ int dsa_slave_create(struct dsa_port *port)
                           ret, slave_dev->name);
                goto out_phy;
        }
+       rtnl_lock();
+       ret = netdev_upper_dev_link(master, slave_dev, NULL);
+       rtnl_unlock();
+       if (ret) {
+               unregister_netdevice(slave_dev);
+               goto out_phy;
+       }

        return 0;

@@ -1826,12 +1833,14 @@ int dsa_slave_create(struct dsa_port *port)

 void dsa_slave_destroy(struct net_device *slave_dev)
 {
+       struct net_device *master = dsa_slave_to_master(slave_dev);
        struct dsa_port *dp = dsa_slave_to_port(slave_dev);
        struct dsa_slave_priv *p = netdev_priv(slave_dev);

        netif_carrier_off(slave_dev);
        rtnl_lock();
        phylink_disconnect_phy(dp->pl);
+       netdev_upper_dev_unlink(master, slave_dev);
        rtnl_unlock();

        dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);

Do you see a problem with it?

Thanks,
-Vladimir

^ permalink raw reply related

* Re: [PATCH RFC v3 bpf-next 2/4] bpf: Add bpf_copy_from_user() helper.
From: Andrii Nakryiko @ 2020-06-18 22:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Paul E . McKenney, Networking,
	bpf, Kernel Team
In-Reply-To: <20200611222340.24081-3-alexei.starovoitov@gmail.com>

On Thu, Jun 11, 2020 at 3:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Sleepable BPF programs can now use copy_from_user() to access user memory.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 11 ++++++++++-
>  kernel/bpf/helpers.c           | 22 ++++++++++++++++++++++
>  kernel/trace/bpf_trace.c       |  2 ++
>  tools/include/uapi/linux/bpf.h | 11 ++++++++++-
>  5 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6819000682a5..c8c9217f3ac9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1632,6 +1632,7 @@ extern const struct bpf_func_proto bpf_ringbuf_reserve_proto;
>  extern const struct bpf_func_proto bpf_ringbuf_submit_proto;
>  extern const struct bpf_func_proto bpf_ringbuf_discard_proto;
>  extern const struct bpf_func_proto bpf_ringbuf_query_proto;
> +extern const struct bpf_func_proto bpf_copy_from_user_proto;
>
>  const struct bpf_func_proto *bpf_tracing_func_proto(
>         enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0bef454c9598..a38c806d34ad 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3260,6 +3260,13 @@ union bpf_attr {
>   *             case of **BPF_CSUM_LEVEL_QUERY**, the current skb->csum_level
>   *             is returned or the error code -EACCES in case the skb is not
>   *             subject to CHECKSUM_UNNECESSARY.
> + *
> + * int bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)

Can we also add bpf_copy_str_from_user (or bpf_copy_from_user_str,
whichever makes more sense) as well?

> + *     Description
> + *             Read *size* bytes from user space address *user_ptr* and store
> + *             the data in *dst*. This is a wrapper of copy_from_user().
> + *     Return
> + *             0 on success, or a negative error in case of failure.
>   */

[...]

^ permalink raw reply

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
From: Eric Dumazet @ 2020-06-18 22:34 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan,
	Michal Kubecek
In-Reply-To: <20200618215053.qxnjegm4h5i3mvfu@bsd-mbp.dhcp.thefacebook.com>



On 6/18/20 2:50 PM, Jonathan Lemon wrote:
> On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
>>
>>
>> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
>>> Adds a "rx_hd_split" private flag parameter to ethtool.
>>>
>>> This enables header splitting, and sets up the fragment mappings.
>>> The feature is currently only enabled for netgpu channels.
>>
>> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
>> to enable TCP RX zerocopy on x86.
>>
>> Patch for mlx4 has not been sent upstream yet.
>>
>> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
>> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
>>
>> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
>>
>>
>> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
>> so that the exact number of bytes can be tuned, instead of hard coded in the driver.
> 
> Sounds reasonable - in the long run, it would be ideal to have the
> hardware actually perform header splitting, but for now using a tunable
> fixed offset will work.  In the same vein, there should be a similar
> setting for the TCP option padding on the sender side.
> 

Some NIC have variable header split (Intel ixgbe I am pretty sure)

We use a mix of NIC, some with variable header splits, some with fixed pseudo header split (mlx4)

Because of this, we had to limit TCP advmss to 4108 (4096 + 12), regardless of the NIC abilities.



^ permalink raw reply

* Re: [PATCH RFC v3 bpf-next 3/4] libbpf: support sleepable progs
From: Andrii Nakryiko @ 2020-06-18 22:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Paul E . McKenney, Networking,
	bpf, Kernel Team
In-Reply-To: <20200611222340.24081-4-alexei.starovoitov@gmail.com>

On Thu, Jun 11, 2020 at 3:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Pass request to load program as sleepable via ".s" suffix in the section name.
> If it happens in the future that all map types and helpers are allowed with
> BPF_F_SLEEPABLE flag "fmod_ret/" and "lsm/" can be aliased to "fmod_ret.s/" and
> "lsm.s/" to make all lsm and fmod_ret programs sleepable by default. The fentry
> and fexit programs would always need to have sleepable vs non-sleepable
> distinction, since not all fentry/fexit progs will be attached to sleepable
> kernel functions.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: KP Singh <kpsingh@google.com>
> ---

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  tools/lib/bpf/libbpf.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>

[...]

^ permalink raw reply

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
From: Eric Dumazet @ 2020-06-18 22:36 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, kernel-team, axboe, Govindarajulu Varadarajan,
	Michal Kubecek
In-Reply-To: <20200618215053.qxnjegm4h5i3mvfu@bsd-mbp.dhcp.thefacebook.com>



On 6/18/20 2:50 PM, Jonathan Lemon wrote:

> In the same vein, there should be a similar
> setting for the TCP option padding on the sender side.


We had no need for such hack in the TCP stack :)



^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Andrew Lunn @ 2020-06-18 22:37 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Ioana Ciornei, netdev, davem, vladimir.oltean, claudiu.manoil,
	alexandru.marginean, michael, f.fainelli
In-Reply-To: <20200618222727.GM1551@shell.armlinux.org.uk>

> The other thing is, drivers/net/phy is becoming a little cluttered -
> we have approaching 100 files there.
> 
> Should we be thinking about drivers/net/phy/mdio/ (for mdio*),
> drivers/net/phy/lib/ for the core phylib code or leaving it where
> it is, and, hmm, drivers/net/phy/media/ maybe for the PHY and PCS
> drivers?  Or something like that?

Hi Russell

Do you have any experience how good git is at following file moves
like this. We don't want to make it too hard for backporters of fixes.

If it is not going to be too painful, then yes, mdio, phy and pcs
subdirectories would be nice.

       Andrew

^ permalink raw reply

* Re: [PATCH RFC v3 bpf-next 4/4] selftests/bpf: basic sleepable tests
From: Andrii Nakryiko @ 2020-06-18 22:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Paul E . McKenney, Networking,
	bpf, Kernel Team
In-Reply-To: <20200611222340.24081-5-alexei.starovoitov@gmail.com>

On Thu, Jun 11, 2020 at 3:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Modify few tests to sanity test sleepable bpf functionality.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: KP Singh <kpsingh@google.com>
> ---
>  tools/testing/selftests/bpf/bench.c             |  2 ++
>  .../selftests/bpf/benchs/bench_trigger.c        | 17 +++++++++++++++++
>  tools/testing/selftests/bpf/progs/lsm.c         | 14 ++++++++++++--
>  .../testing/selftests/bpf/progs/trigger_bench.c |  7 +++++++
>  4 files changed, 38 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index 944ad4721c83..1a427685a8a8 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -317,6 +317,7 @@ extern const struct bench bench_trig_tp;
>  extern const struct bench bench_trig_rawtp;
>  extern const struct bench bench_trig_kprobe;
>  extern const struct bench bench_trig_fentry;
> +extern const struct bench bench_trig_fentry_sleep;
>  extern const struct bench bench_trig_fmodret;
>  extern const struct bench bench_rb_libbpf;
>  extern const struct bench bench_rb_custom;
> @@ -338,6 +339,7 @@ static const struct bench *benchs[] = {
>         &bench_trig_rawtp,
>         &bench_trig_kprobe,
>         &bench_trig_fentry,
> +       &bench_trig_fentry_sleep,

Can you please add results to commit description for fentry and
fentry_sleep benchmark, just for comparison?

>         &bench_trig_fmodret,
>         &bench_rb_libbpf,
>         &bench_rb_custom,

[...]

>
> @@ -28,6 +30,9 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
>         is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
>                     vma->vm_end >= vma->vm_mm->start_stack);
>
> +       bpf_copy_from_user(args, sizeof(args), (void *)vma->vm_mm->arg_start);

is there some way to ensure that user memory definitely is not paged
in (and do it from user-space selftest part), so that
bpf_copy_from_user() *definitely* sleeps? That would be the real test.
As is, even bpf_probe_read_user() might succeed as well, isn't that
right?

Seems like doing madvise(MADV_DONTNEED) should be able to accomplish
that? So instead of reading arg_start, we can pre-setup mmap()'ed
file, MADV_DONTNEED it, then trigger LSM program and let it attempt
reading that chunk of memory?


> +       /*bpf_printk("args=%s\n", args);*/

debugging leftover?

> +
>         if (is_stack && monitored_pid == pid) {
>                 mprotect_count++;
>                 ret = -EPERM;
> @@ -36,7 +41,7 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
>         return ret;
>  }
>
> -SEC("lsm/bprm_committed_creds")
> +SEC("lsm.s/bprm_committed_creds")
>  int BPF_PROG(test_void_hook, struct linux_binprm *bprm)
>  {
>         __u32 pid = bpf_get_current_pid_tgid() >> 32;
> @@ -46,3 +51,8 @@ int BPF_PROG(test_void_hook, struct linux_binprm *bprm)
>
>         return 0;
>  }

nit: empty line here, don't squash those functions together :)


> +SEC("lsm/task_free") /* lsm/ is ok, lsm.s/ fails */
> +int BPF_PROG(test_task_free, struct task_struct *task)
> +{
> +       return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> index 8b36b6640e7e..9a4d09590b3d 100644
> --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> @@ -39,6 +39,13 @@ int bench_trigger_fentry(void *ctx)
>         return 0;
>  }
>
> +SEC("fentry.s/__x64_sys_getpgid")
> +int bench_trigger_fentry_sleep(void *ctx)
> +{
> +       __sync_add_and_fetch(&hits, 1);
> +       return 0;
> +}
> +
>  SEC("fmod_ret/__x64_sys_getpgid")
>  int bench_trigger_fmodret(void *ctx)
>  {
> --
> 2.23.0
>

^ permalink raw reply

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
From: Eric Dumazet @ 2020-06-18 22:45 UTC (permalink / raw)
  To: Michal Kubecek, Eric Dumazet
  Cc: Jonathan Lemon, netdev, kernel-team, axboe,
	Govindarajulu Varadarajan
In-Reply-To: <20200618202526.zcbuzzxtln2ljawn@lion.mk-sys.cz>



On 6/18/20 1:25 PM, Michal Kubecek wrote:
> On Thu, Jun 18, 2020 at 11:12:57AM -0700, Eric Dumazet wrote:
>> On 6/18/20 9:09 AM, Jonathan Lemon wrote:
>>> Adds a "rx_hd_split" private flag parameter to ethtool.
>>>
>>> This enables header splitting, and sets up the fragment mappings.
>>> The feature is currently only enabled for netgpu channels.
>>
>> We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
>> to enable TCP RX zerocopy on x86.
>>
>> Patch for mlx4 has not been sent upstream yet.
>>
>> For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
>> and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.
>>
>> (In our case, most of our data TCP packets only have 12 bytes of TCP options)
>>
>> I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
>> so that the exact number of bytes can be tuned, instead of hard coded in the driver.
> 
> I fully agree that such generic parameter would be a better solution
> than a private flag. But I have my doubts about adding more tunables.
> The point is that the concept of tunables looks like a workaround for
> the lack of extensibility of the ioctl interface where the space for
> adding new parameters to existing subcommands was limited (or none).
> 
> With netlink, adding new parameters is much easier and as only three
> tunables were added in 6 years (or four with your proposal), we don't
> have to worry about having too many different attributes (current code
> isn't even designed to scale well to many tunables).
> 
> This new header split parameter could IMHO be naturally put together
> with rx-copybreak and tx-copybreak and possibly any future parameters
> to control how packet contents is passed between NIC/driver and
> networking stack.

This is what I suggested, maybe this was not clear.

Currently known tunables are :

enum tunable_id {
	ETHTOOL_ID_UNSPEC,
	ETHTOOL_RX_COPYBREAK,
	ETHTOOL_TX_COPYBREAK,
	ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
	/*
	 * Add your fresh new tunable attribute above and remember to update
	 * tunable_strings[] in net/core/ethtool.c
	 */
	__ETHTOOL_TUNABLE_COUNT,
};

Ie add a new ETHTOOL_RX_HEADER_SPLIT value.


Or maybe I am misunderstanding your point.

> 
>> (Patch for the counter part of [1] was resent 10 days ago on netdev@ by Govindarajulu Varadarajan)
>> (Not sure if this has been merged yet)
> 
> Not yet, I want to take another look in the rest of this week.
> 
> Michal
> 
>> [1]
>>
>> commit f0db9b073415848709dd59a6394969882f517da9
>> Author: Govindarajulu Varadarajan <_govind@gmx.com>
>> Date:   Wed Sep 3 03:17:20 2014 +0530
>>
>>     ethtool: Add generic options for tunables
>>     
>>     This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
>>     tunable values from driver.
>>     
>>     Add get_tunable and set_tunable to ethtool_ops. Driver implements these
>>     functions for getting/setting tunable value.
>>     
>>     Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Peter Geis @ 2020-06-18 22:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Zefan Li, Linux Kernel Network Developers,
	Cameron Berkenpas, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <20200618212615.GG110603@carbon.dhcp.thefacebook.com>

On Thu, Jun 18, 2020 at 5:26 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 02:09:43PM -0700, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > > > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> > > > >
> > > > > Cc: Roman Gushchin <guro@fb.com>
> > > > >
> > > > > Thanks for fixing this.
> > > > >
> > > > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > > > even when cgroup_sk_alloc is disabled.
> > > > > >
> > > > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > > > would terminate this function if called there. And for sk_alloc()
> > > > > > skcd->val is always zero. So it's safe to factor out the code
> > > > > > to make it more readable.
> > > > > >
> > > > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> > > > >
> > > > > but I don't think the bug was introduced by this commit, because there
> > > > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > > > write_classid(), which can be triggered by writing to ifpriomap or
> > > > > classid in cgroupfs. This commit just made it much easier to happen
> > > > > with systemd invovled.
> > > > >
> > > > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > > > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> > > >
> > > > Good point.
> > > >
> > > > I take a deeper look, it looks like commit d979a39d7242e06
> > > > is the one to blame, because it is the first commit that began to
> > > > hold cgroup refcnt in cgroup_sk_alloc().
> > >
> > > I agree, ut seems that the issue is not related to bpf and probably
> > > can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> > > seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
> > >
> > > Btw, based on the number of reported-by tags it seems that there was
> > > a real issue which the patch is fixing. Maybe you'll a couple of words
> > > about how it reveals itself in the real life?
> >
> > I still have no idea how exactly this is triggered. According to the
> > people who reported this bug, they just need to wait for some hours
> > to trigger. So I am not sure what to add here, just the stack trace?
>
> Yeah, stack trace is definitely useful. So at least if someone will encounter the same
> error in the future, they can search for the stacktrace and find the fix.

I can provide at least a little insight into my configuration that
allowed the bug to trigger.
I'm running the following configuration:
Ubuntu 20.04 - aarch64
Mainline 5.6.17 kernel using Ubuntu's config.
LXD with several containers active.
Docker with several containers active.

Every crash was the same, with nfsd triggering the sequence.
My only nfs client at the time was a Windows 10 device.
Disabling nfsd prevented the bug from occurring.

Here was the last stack trace:
[23352.431106] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[23352.431938] rockpro64 kernel: Mem abort info:
[23352.432199] rockpro64 kernel:   ESR = 0x96000004
[23352.432485] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[23352.432965] rockpro64 kernel:   SET = 0, FnV = 0
[23352.433248] rockpro64 kernel:   EA = 0, S1PTW = 0
[23352.433536] rockpro64 kernel: Data abort info:
[23352.433803] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[23352.434153] rockpro64 kernel:   CM = 0, WnR = 0
[23352.434475] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=0000000094d4c000
[23352.435174] rockpro64 kernel: [0000000000000010]
pgd=0000000094f3d003, pud=00000000bdb7f003, pmd=0000000000000000
[23352.435963] rockpro64 kernel: Internal error: Oops: 96000004 [#1] SMP
[23352.436396] rockpro64 kernel: Modules linked in: xt_TCPMSS
nf_conntrack_netlink xfrm_user xt_addrtype br_netfilter ip_set_hash_ip
ip_set_hash_net xt_set ip_set cfg80211 nft_counter xt_length
xt_connmark xt_multiport xt_mark nf_log_ip>
[23352.436519] rockpro64 kernel:  ghash_ce enclosure snd_soc_es8316
scsi_transport_sas snd_seq_midi sha2_ce snd_seq_midi_event
snd_soc_simple_card snd_rawmidi snd_soc_audio_graph_card sha256_arm64
panfrost snd_soc_simple_card_utils sha1>
[23352.444216] rockpro64 kernel:  async_pq async_xor xor xor_neon
async_tx uas raid6_pq raid1 raid0 multipath linear usb_storage
xhci_plat_hcd dwc3 rtc_rk808 clk_rk808 rk808_regulator ulpi udc_core
fusb302 tcpm typec fan53555 rk808 pwm_>
[23352.455532] rockpro64 kernel: CPU: 3 PID: 1237 Comm: nfsd Not
tainted 5.6.17+ #74
[23352.456054] rockpro64 kernel: Hardware name: pine64
rockpro64_rk3399/rockpro64_rk3399, BIOS
2020.07-rc2-00124-g515f613253-dirty 05/19/2020
[23352.457010] rockpro64 kernel: pstate: 60400005 (nZCv daif +PAN -UAO)
[23352.457445] rockpro64 kernel: pc : __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.457918] rockpro64 kernel: lr : ip_finish_output+0x98/0xd0
[23352.458287] rockpro64 kernel: sp : ffff80001325b900
[23352.458581] rockpro64 kernel: x29: ffff80001325b900 x28: ffff000012f0fae0
[23352.459051] rockpro64 kernel: x27: 0000000000000001 x26: ffff00005f0ddc00
[23352.459521] rockpro64 kernel: x25: 0000000000000118 x24: ffff0000dcd3c270
[23352.459990] rockpro64 kernel: x23: 0000000000000010 x22: ffff800011b1aec0
[23352.460458] rockpro64 kernel: x21: ffff0000efcacc40 x20: 0000000000000010
[23352.460928] rockpro64 kernel: x19: ffff0000dcd3bf00 x18: 0000000000000000
[23352.461396] rockpro64 kernel: x17: 0000000000000000 x16: 0000000000000000
[23352.461863] rockpro64 kernel: x15: 0000000000000000 x14: 0000000000000004
[23352.462332] rockpro64 kernel: x13: 0000000000000001 x12: 0000000000201400
[23352.462802] rockpro64 kernel: x11: 0000000000000000 x10: 0000000000000000
[23352.463271] rockpro64 kernel: x9 : ffff800010b6f6d0 x8 : 0000000000000260
[23352.463738] rockpro64 kernel: x7 : 0000000000000000 x6 : ffff0000dc12a000
[23352.464208] rockpro64 kernel: x5 : ffff0000dcd3bf00 x4 : 0000000000000028
[23352.464677] rockpro64 kernel: x3 : 0000000000000000 x2 : ffff000012f0fb08
[23352.465145] rockpro64 kernel: x1 : ffff00005f0ddd40 x0 : 0000000000000000
[23352.465616] rockpro64 kernel: Call trace:
[23352.465843] rockpro64 kernel:  __cgroup_bpf_run_filter_skb+0x2a8/0x400
[23352.466283] rockpro64 kernel:  ip_finish_output+0x98/0xd0
[23352.466625] rockpro64 kernel:  ip_output+0xb0/0x130
[23352.466920] rockpro64 kernel:  ip_local_out+0x4c/0x60
[23352.467233] rockpro64 kernel:  __ip_queue_xmit+0x128/0x380
[23352.467584] rockpro64 kernel:  ip_queue_xmit+0x10/0x18
[23352.467903] rockpro64 kernel:  __tcp_transmit_skb+0x470/0xaf0
[23352.468274] rockpro64 kernel:  tcp_write_xmit+0x39c/0x1110
[23352.468623] rockpro64 kernel:  __tcp_push_pending_frames+0x40/0x100
[23352.469040] rockpro64 kernel:  tcp_send_fin+0x6c/0x240
[23352.469358] rockpro64 kernel:  tcp_shutdown+0x60/0x68
[23352.469669] rockpro64 kernel:  inet_shutdown+0xb0/0x120
[23352.469997] rockpro64 kernel:  kernel_sock_shutdown+0x1c/0x28
[23352.470464] rockpro64 kernel:  svc_tcp_sock_detach+0xd0/0x110 [sunrpc]
[23352.470980] rockpro64 kernel:  svc_delete_xprt+0x74/0x240 [sunrpc]
[23352.471445] rockpro64 kernel:  svc_recv+0x45c/0xb10 [sunrpc]
[23352.471864] rockpro64 kernel:  nfsd+0xdc/0x150 [nfsd]
[23352.472179] rockpro64 kernel:  kthread+0xfc/0x128
[23352.472461] rockpro64 kernel:  ret_from_fork+0x10/0x18
[23352.472785] rockpro64 kernel: Code: 9100c0c6 17ffff7b f9431cc0
91004017 (f9400814)
[23352.473324] rockpro64 kernel: ---[ end trace 978df9e144fd1235 ]---
[29973.397069] rockpro64 kernel: Unable to handle kernel read from
unreadable memory at virtual address 0000000000000010
[29973.397966] rockpro64 kernel: Mem abort info:
[29973.398224] rockpro64 kernel:   ESR = 0x96000004
[29973.398503] rockpro64 kernel:   EC = 0x25: DABT (current EL), IL = 32 bits
[29973.398976] rockpro64 kernel:   SET = 0, FnV = 0
[29973.399254] rockpro64 kernel:   EA = 0, S1PTW = 0
[29973.399537] rockpro64 kernel: Data abort info:
[29973.399799] rockpro64 kernel:   ISV = 0, ISS = 0x00000004
[29973.400143] rockpro64 kernel:   CM = 0, WnR = 0
[29973.400416] rockpro64 kernel: user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000dcdd1000
[29973.400989] rockpro64 kernel: [0000000000000010] pgd=0000000000000000
[29973.401490] rockpro64 kernel: Internal error: Oops: 96000004 [#2] SMP

I hope this helps.

>
> Thanks!

^ permalink raw reply

* Re: [Patch net] net: change addr_list_lock back to static key
From: Andrew Lunn @ 2020-06-18 22:46 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Cong Wang, netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo,
	Dmitry Vyukov
In-Reply-To: <CA+h21hrZM8Dqi7AYPkKgsAm5-q=TxEdTaci=Tq35VfoOxt_5rw@mail.gmail.com>

> > Hi Vladimir
> >
> > So you are suggesting this?
> >
> > > > +       ret = netdev_upper_dev_link(master, slave_dev, NULL);
> >
> >   Andrew
> 
> Yes, basically this:
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 4c7f086a047b..6aff8cfc9cf1 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1807,6 +1807,13 @@ int dsa_slave_create(struct dsa_port *port)
>                            ret, slave_dev->name);
>                 goto out_phy;
>         }
> +       rtnl_lock();
> +       ret = netdev_upper_dev_link(master, slave_dev, NULL);
> +       rtnl_unlock();
> +       if (ret) {
> +               unregister_netdevice(slave_dev);
> +               goto out_phy;
> +       }
> 
>         return 0;
> 
> @@ -1826,12 +1833,14 @@ int dsa_slave_create(struct dsa_port *port)
> 
>  void dsa_slave_destroy(struct net_device *slave_dev)
>  {
> +       struct net_device *master = dsa_slave_to_master(slave_dev);
>         struct dsa_port *dp = dsa_slave_to_port(slave_dev);
>         struct dsa_slave_priv *p = netdev_priv(slave_dev);
> 
>         netif_carrier_off(slave_dev);
>         rtnl_lock();
>         phylink_disconnect_phy(dp->pl);
> +       netdev_upper_dev_unlink(master, slave_dev);
>         rtnl_unlock();
> 
>         dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
> 
> Do you see a problem with it?

I was initially not sure you could do this. But it looks like you can
have N : M relationships between uppers and lowers. I suppose it does
make sense. You can have multiple VLAN uppers to one base device. You
can have multiple lowers to one bond device, etc.

I wonder what 'side effects' there are for declaring this linkage. It
is not something i've looked into before, since we never used it. So i
don't see a problem with this, other than i don't know what problems
we might run into :-)

  Andrew


^ permalink raw reply

* Re: [PATCH net-next] of: mdio: preserve phy dev_flags in of_phy_connect()
From: Andrew Lunn @ 2020-06-18 22:48 UTC (permalink / raw)
  To: rentao.bupt
  Cc: Florian Fainelli, Heiner Kallweit, Russell King, Rob Herring,
	Frank Rowand, netdev, devicetree, linux-kernel, openbmc, taoren
In-Reply-To: <20200618220444.5064-1-rentao.bupt@gmail.com>

On Thu, Jun 18, 2020 at 03:04:44PM -0700, rentao.bupt@gmail.com wrote:
> From: Tao Ren <rentao.bupt@gmail.com>
> 
> Replace assignment "=" with OR "|=" for "phy->dev_flags" so "dev_flags"
> configured in phy probe() function can be preserved.
> 
> The idea is similar to commit e7312efbd5de ("net: phy: modify assignment
> to OR for dev_flags in phy_attach_direct").
> 
> Signed-off-by: Tao Ren <rentao.bupt@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v1 3/3] net/fsl: enable extended scanning in xgmac_mdio
From: Florian Fainelli @ 2020-06-18 22:49 UTC (permalink / raw)
  To: Calvin Johnson, Jeremy Linton, Russell King - ARM Linux admin,
	Jon, Cristi Sovaiala, Ioana Ciornei, Andrew Lunn, Andy Shevchenko,
	Madalin Bucur
  Cc: netdev, linux.cj
In-Reply-To: <20200617171536.12014-4-calvin.johnson@oss.nxp.com>



On 6/17/2020 10:15 AM, Calvin Johnson wrote:
> From: Jeremy Linton <jeremy.linton@arm.com>
> 
> Since we know the xgmac hardware always has a c45
> complaint bus, lets try scanning for c22 capable
> phys first. If we fail to find any, then it with
> fall back to c45 automatically.

s/complaint/compliant/
s/lets/let's/
s/phys/PHYs/
s/with/will/

> 
> Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
> Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com>
> 
> ---
> 
>  drivers/net/ethernet/freescale/xgmac_mdio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/freescale/xgmac_mdio.c b/drivers/net/ethernet/freescale/xgmac_mdio.c
> index fb7f8caff643..5732ca13b821 100644
> --- a/drivers/net/ethernet/freescale/xgmac_mdio.c
> +++ b/drivers/net/ethernet/freescale/xgmac_mdio.c
> @@ -263,6 +263,7 @@ static int xgmac_mdio_probe(struct platform_device *pdev)
>  	bus->read = xgmac_mdio_read;
>  	bus->write = xgmac_mdio_write;
>  	bus->parent = &pdev->dev;
> +	bus->probe_capabilities = MDIOBUS_C22_C45;
>  	snprintf(bus->id, MII_BUS_ID_SIZE, "%llx", (unsigned long long)res->start);
>  
>  	/* Set the PHY base address */
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 22:49 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Ioana Ciornei, netdev, davem, vladimir.oltean, claudiu.manoil,
	alexandru.marginean, michael, f.fainelli
In-Reply-To: <20200618223740.GD279339@lunn.ch>

On Fri, Jun 19, 2020 at 12:37:40AM +0200, Andrew Lunn wrote:
> > The other thing is, drivers/net/phy is becoming a little cluttered -
> > we have approaching 100 files there.
> > 
> > Should we be thinking about drivers/net/phy/mdio/ (for mdio*),
> > drivers/net/phy/lib/ for the core phylib code or leaving it where
> > it is, and, hmm, drivers/net/phy/media/ maybe for the PHY and PCS
> > drivers?  Or something like that?
> 
> Hi Russell
> 
> Do you have any experience how good git is at following file moves
> like this. We don't want to make it too hard for backporters of fixes.
> 
> If it is not going to be too painful, then yes, mdio, phy and pcs
> subdirectories would be nice.

It becomes problematical if git doesn't have the objects referenced
on the index line in the patch file when applying.  If it has the
objects, then git can work out where the file moved to via it's
rename detection.

I think the stable team probably have a better idea of how big an
issue it may be, but over the years there have been some quite big
reorganisations done.  The biggest I remember is a whole raft of
drivers moving from (iirc) drivers/net into
drivers/net/ethernet/<vendor> - moving the files in drivers/net/phy
would be quite a bit smaller in comparison.  I think drivers/media
has also seen quite a lot of renames, and drivers/video has been
significantly reorganised.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: general protection fault in __bfs (2)
From: syzbot @ 2020-06-18 22:52 UTC (permalink / raw)
  To: amitc, andy.shevchenko, bgolaszewski, bp, davem, douly.fnst, hpa,
	idosch, jon.maloy, konrad.wilk, len.brown, linus.walleij,
	linux-gpio, linux-kernel, mingo, netdev, petrm, puwen, rppt,
	syzkaller-bugs, tglx, tipc-discussion, x86, ying.xue
In-Reply-To: <00000000000086d87305801011c4@google.com>

syzbot suspects this bug was fixed by commit:

commit 46ca11177ed593f39d534f8d2c74ec5344e90c11
Author: Amit Cohen <amitc@mellanox.com>
Date:   Thu May 21 12:11:45 2020 +0000

    selftests: mlxsw: qos_mc_aware: Specify arping timeout as an integer

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1315b059100000
start commit:   8834f560 Linux 5.0-rc5
git tree:       upstream
kernel config:  https://syzkaller.appspot.com/x/.config?x=8f00801d7b7c4fe6
dashboard link: https://syzkaller.appspot.com/bug?extid=c58fa3b1231d2ea0c4d3
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15bab650c00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17a331df400000

If the result looks correct, please mark the bug fixed by replying with:

#syz fix: selftests: mlxsw: qos_mc_aware: Specify arping timeout as an integer

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [Patch net] net: change addr_list_lock back to static key
From: Vladimir Oltean @ 2020-06-18 22:53 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Cong Wang, netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo,
	Dmitry Vyukov
In-Reply-To: <20200618224632.GE279339@lunn.ch>

On Fri, 19 Jun 2020 at 01:46, Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > Hi Vladimir
> > >
> > > So you are suggesting this?
> > >
> > > > > +       ret = netdev_upper_dev_link(master, slave_dev, NULL);
> > >
> > >   Andrew
> >
> > Yes, basically this:
> >
> > diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> > index 4c7f086a047b..6aff8cfc9cf1 100644
> > --- a/net/dsa/slave.c
> > +++ b/net/dsa/slave.c
> > @@ -1807,6 +1807,13 @@ int dsa_slave_create(struct dsa_port *port)
> >                            ret, slave_dev->name);
> >                 goto out_phy;
> >         }
> > +       rtnl_lock();
> > +       ret = netdev_upper_dev_link(master, slave_dev, NULL);
> > +       rtnl_unlock();
> > +       if (ret) {
> > +               unregister_netdevice(slave_dev);
> > +               goto out_phy;
> > +       }
> >
> >         return 0;
> >
> > @@ -1826,12 +1833,14 @@ int dsa_slave_create(struct dsa_port *port)
> >
> >  void dsa_slave_destroy(struct net_device *slave_dev)
> >  {
> > +       struct net_device *master = dsa_slave_to_master(slave_dev);
> >         struct dsa_port *dp = dsa_slave_to_port(slave_dev);
> >         struct dsa_slave_priv *p = netdev_priv(slave_dev);
> >
> >         netif_carrier_off(slave_dev);
> >         rtnl_lock();
> >         phylink_disconnect_phy(dp->pl);
> > +       netdev_upper_dev_unlink(master, slave_dev);
> >         rtnl_unlock();
> >
> >         dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
> >
> > Do you see a problem with it?
>
> I was initially not sure you could do this. But it looks like you can
> have N : M relationships between uppers and lowers. I suppose it does
> make sense. You can have multiple VLAN uppers to one base device. You
> can have multiple lowers to one bond device, etc.
>
> I wonder what 'side effects' there are for declaring this linkage. It
> is not something i've looked into before, since we never used it. So i
> don't see a problem with this, other than i don't know what problems
> we might run into :-)
>
>   Andrew
>

It was surprising to me as well, since I was used to the bridge model
(a port can have only one bridge master). But it looks like, that is
the difference between netdev_upper_dev_link and
netdev_master_upper_dev_link. This uses the former, and the bridge
layer uses the latter.

So I guess it is ok.

Thanks,
-Vladimir

^ permalink raw reply

* RE: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Ioana Ciornei @ 2020-06-18 22:56 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux admin, netdev@vger.kernel.org,
	davem@davemloft.net, Vladimir Oltean, Claudiu Manoil,
	Alexandru Marginean, michael@walle.cc, f.fainelli@gmail.com
In-Reply-To: <20200618220331.GA279339@lunn.ch>

> Subject: Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
> 
> > Are there really instances where the ethernet driver has to manage
> > multiple different types of PCSs? I am not sure this type of snippet
> > of code is really going to occur.
> 
> Hi Ioana
> 
> The Marvell mv88e6390 family has three PCS's, one for SGMII/1000BaseX, a
> 10Gbase-X4/X2 and a 10GBAse-R. So this sort of code could appear.
> 

I should have been more clear as I was wondering about different types of PCS IPs
(i.e. different vendors and such).

We are in the same case as the Marvell mv88e6390 family, it seems, and we treat
this directly in the Lynx PCS module by a 'switch case' statement by the interface type.
The MAC driver just calls the functions, no choosing necessary on its part, all of that
is done by the PCS module.

Ioana

^ permalink raw reply

* RE: [net-next 02/15] iecm: Add framework set of header files
From: Michael, Alice @ 2020-06-18 23:07 UTC (permalink / raw)
  To: Joe Perches, Kirsher, Jeffrey T, davem@davemloft.net
  Cc: netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	Brady, Alan, Burra, Phani R, Hay, Joshua A, Chittim, Madhu,
	Linga, Pavan Kumar, Skidmore, Donald C, Brandeburg, Jesse,
	Samudrala, Sridhar
In-Reply-To: <57110139759cc77f21a150a3faed0d2584dfbe21.camel@perches.com>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Joe Perches
> Sent: Wednesday, June 17, 2020 10:33 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net
> Cc: Michael, Alice <alice.michael@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; Brady, Alan
> <alan.brady@intel.com>; Burra, Phani R <phani.r.burra@intel.com>; Hay,
> Joshua A <joshua.a.hay@intel.com>; Chittim, Madhu
> <madhu.chittim@intel.com>; Linga, Pavan Kumar
> <pavan.kumar.linga@intel.com>; Skidmore, Donald C
> <donald.c.skidmore@intel.com>; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>; Samudrala, Sridhar
> <sridhar.samudrala@intel.com>
> Subject: Re: [net-next 02/15] iecm: Add framework set of header files
> 
> On Wed, 2020-06-17 at 22:13 -0700, Jeff Kirsher wrote:
> > From: Alice Michael <alice.michael@intel.com>
> []
> > diff --git a/include/linux/net/intel/iecm_controlq_api.h
> > b/include/linux/net/intel/iecm_controlq_api.h
> []
> > +enum iecm_ctlq_err {
> > +	IECM_CTLQ_RC_OK		= 0,  /* Success */
> 
> Why is it necessary to effectively duplicate the generic error codes with
> different error numbers?
> 
Ah this is a good point.  This was an old API we were looking at early on and abandoned and I am working to pull this chunk of memory out of the patches for a V2.  Thanks

> > +	IECM_CTLQ_RC_EPERM	= 1,  /* Operation not permitted */
> > +	IECM_CTLQ_RC_ENOENT	= 2,  /* No such element */
> > +	IECM_CTLQ_RC_ESRCH	= 3,  /* Bad opcode */
> > +	IECM_CTLQ_RC_EINTR	= 4,  /* Operation interrupted */
> > +	IECM_CTLQ_RC_EIO	= 5,  /* I/O error */
> > +	IECM_CTLQ_RC_ENXIO	= 6,  /* No such resource */
> > +	IECM_CTLQ_RC_E2BIG	= 7,  /* Arg too long */
> > +	IECM_CTLQ_RC_EAGAIN	= 8,  /* Try again */
> > +	IECM_CTLQ_RC_ENOMEM	= 9,  /* Out of memory */
> > +	IECM_CTLQ_RC_EACCES	= 10, /* Permission denied */
> > +	IECM_CTLQ_RC_EFAULT	= 11, /* Bad address */
> > +	IECM_CTLQ_RC_EBUSY	= 12, /* Device or resource busy */
> > +	IECM_CTLQ_RC_EEXIST	= 13, /* object already exists */
> > +	IECM_CTLQ_RC_EINVAL	= 14, /* Invalid argument */
> > +	IECM_CTLQ_RC_ENOTTY	= 15, /* Not a typewriter */
> > +	IECM_CTLQ_RC_ENOSPC	= 16, /* No space left or allocation
> failure */
> > +	IECM_CTLQ_RC_ENOSYS	= 17, /* Function not implemented */
> > +	IECM_CTLQ_RC_ERANGE	= 18, /* Parameter out of range */
> > +	IECM_CTLQ_RC_EFLUSHED	= 19, /* Cmd flushed due to prev cmd
> error */
> > +	IECM_CTLQ_RC_BAD_ADDR	= 20, /* Descriptor contains a bad
> pointer */
> > +	IECM_CTLQ_RC_EMODE	= 21, /* Op not allowed in current
> dev mode */
> > +	IECM_CTLQ_RC_EFBIG	= 22, /* File too big */
> > +	IECM_CTLQ_RC_ENOSEC	= 24, /* Missing security manifest */
> > +	IECM_CTLQ_RC_EBADSIG	= 25, /* Bad RSA signature */
> > +	IECM_CTLQ_RC_ESVN	= 26, /* SVN number prohibits this package
> */
> > +	IECM_CTLQ_RC_EBADMAN	= 27, /* Manifest hash mismatch */
> > +	IECM_CTLQ_RC_EBADBUF	= 28, /* Buffer hash mismatches
> manifest */
> > +};
> 


^ permalink raw reply

* af_decnet.c: missing semi-colon and/or indentation?
From: Randy Dunlap @ 2020-06-18 23:19 UTC (permalink / raw)
  To: netdev@vger.kernel.org, David Miller


Please see lines 1250-1251.


	case TIOCINQ:
		lock_sock(sk);
		skb = skb_peek(&scp->other_receive_queue);
		if (skb) {
			amount = skb->len;
		} else {
			skb_queue_walk(&sk->sk_receive_queue, skb)     <<<<<
				amount += skb->len;                    <<<<<
		}
		release_sock(sk);
		err = put_user(amount, (int __user *)arg);
		break;



or is this some kind of GCC nested function magic?


commit bec571ec762a4cf855ad4446f833086fc154b60e
Author: David S. Miller <davem@davemloft.net>
Date:   Thu May 28 16:43:52 2009 -0700

    decnet: Use SKB queue and list helpers instead of doing it by-hand.



thanks.
-- 
~Randy
Reported-by: Randy Dunlap <rdunlap@infradead.org>

^ permalink raw reply

* Re: [PATCH net-next v8 2/5] net: phy: Add a helper to return the index for of the internal delay
From: kernel test robot @ 2020-06-18 23:29 UTC (permalink / raw)
  To: Dan Murphy, andrew, f.fainelli, hkallweit1, davem, robh
  Cc: kbuild-all, netdev, linux-kernel, devicetree, Dan Murphy
In-Reply-To: <20200618211011.28837-3-dmurphy@ti.com>

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Dan-Murphy/RGMII-Internal-delay-common-property/20200619-051238
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git cb8e59cc87201af93dfbb6c3dccc8fcad72a09c2
config: i386-defconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce (this is a W=1 build):
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/net/phy/phy_device.c: In function 'phy_get_int_delay_property':
>> drivers/net/phy/phy_device.c:2678:1: error: expected ';' before '}' token
    2678 | }
         | ^

vim +2678 drivers/net/phy/phy_device.c

  2660	
  2661	#if IS_ENABLED(CONFIG_OF_MDIO)
  2662	static int phy_get_int_delay_property(struct device *dev, const char *name)
  2663	{
  2664		s32 int_delay;
  2665		int ret;
  2666	
  2667		ret = device_property_read_u32(dev, name, &int_delay);
  2668		if (ret)
  2669			return ret;
  2670	
  2671		return int_delay;
  2672	}
  2673	#else
  2674	static inline int phy_get_int_delay_property(struct device *dev,
  2675						     const char *name)
  2676	{
  2677		return -EINVAL
> 2678	}
  2679	#endif
  2680	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28707 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
From: John Fastabend @ 2020-06-18 23:30 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team
In-Reply-To: <CAEf4BzZmWO=hO0kmtwkACEfHZm+H7+FZ+5moaLie2=13U3xU=g@mail.gmail.com>

Andrii Nakryiko wrote:
> On Thu, Jun 18, 2020 at 11:58 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> > > <john.fastabend@gmail.com> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > Switch most of BPF helper definitions from returning int to long. These
> > > > > definitions are coming from comments in BPF UAPI header and are used to
> > > > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > > > BPF programs.
> > > > >
> > > > > In actual in-kernel implementation, all the helpers are defined as returning
> > > > > u64, but due to some historical reasons, most of them are actually defined as
> > > > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > > > error).
> > > >
> > > > Could we change the helpers side to return correct types now? Meaning if the
> > > > UAPI claims its an int lets actually return the int.
> > >
> > > I'm not sure how exactly you see this being done. BPF ABI dictates
> > > that the helper's result is passed in a full 64-bit r0 register. Are
> > > you suggesting that in addition to RET_ANYTHING we should add
> > > RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> > > guaranteed to be zero? And then make helpers actually return 32-bit
> > > values without up-casting them to u64?
> >
> > Yes this is what I was thinking, having a RET_ANYTHING32 but I would
> > assume the upper 32-bits could be anything not zeroed. For +alu32
> > and programmer using correct types I would expect clang to generate
> > good code here and mostly not need to zero upper bits.
> >
> > I think this discussion can be independent of your changes though and
> > its not at the top of my todo list so probably wont get to investigating
> > more for awhile.
> 
> I'm confused. If the verifier doesn't make any assumptions about upper
> 32-bits for RET_ANYTHING32, how is it different from RET_ANYTHING and
> today's logic? What you described is exactly what is happening when
> bpf_helpers_def.h has BPF helpers defined as returning int.
> 

Agreed. I recall it helping the 32-bit bounds on the verifier side
somewhere. But lets drop it maybe it really is not useful. I'll go
try and recall the details later.

[...] Aggressively pruning

> >
> > Agreed. Sorry for the confusion on my side. Poked at this a bit more this
> > morning trying to see why I don't hit the same pattern when we have many
> > cases very similar to above.
> >
> > In your C code you never check for negative return codes? Oops, this
> > means you can walk backwards off the front of payload? This is probably
> > not valid either from program logic side and/or verifier will probably
> > yell. Commented where I think you want a <0 check here,
> 
> You are missing that I'm using unsigned u64. So (s64)-1 ==
> (u64)0xFFFFFFFFFFFFFFFF. So negative errors are effectively turned
> into too large length and I filter them out with the same (len >
> MAX_SZ) check. This allows to do just one comparison instead of two,
> and also helps avoid some Clang optimizations that Yonghong is trying
> to undo right now (if (a > X && a < Y) turned into if (x < Y - X),
> with assembly that verifier can't verify). So no bug there, very
> deliberate choice of types.

I caught it just after I sent above ;) In our codebase we do need to
handle errors and truncated strings differently so really do need the
two conditions. I guess we could find some clever way around it but
in practice on latest kernels we've not seen much trouble around
these with +alu32.

Interesting about the optimization I've not seen that one yet.  

[...]

> See above. In practice (it might be no-ALU32-only thing, don't know),
> doing two ifs is both less efficient and quite often leads to
> unverifiable code. Users have to do hacks to complicate control flow
> enough to prevent Clang from doing Hi/Lo combining. I learned a new
> inlined assembly trick recently to prevent this, but either way it's
> unpleasant and unnecessary.

In the end we also run on ancient kernels so have lots of tricks.

[...] more pruning

> > > My point was that this int -> long switch doesn't degrade ALU32 and
> > > helps no-ALU32, and thus is good :)
> >
> > With the long vs int I do see worse code when using the <0 check.
> > Using C function below which I took from some real code and renamed
> > variables.
> >
> > int myfunc(void *a, void *b, void *c) {
> >         void *payload = a;
> >         int len;
> >
> >         len = probe_read_str(payload, 1000, a);
> >         if (len < 0) return len;
> >         if (len <= 1000) {
> >                 payload += len;
> >         }
> >         len = probe_read_str(payload, 1000, b);
> >         if (len <= 1000) {
> >                 payload += len;
> >         }
> >         return 1;
> > }
> >
> > Then here is the side-by-side of generated code, with +ALU32.
> >
> >   int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
> > -------------------------------                ---------------------------------
> >        0:       r6 = r2                         0:      r6 = r2
> >        1:       r7 = r1                         1:      r7 = r1
> >        2:       w2 = 1000                       2:      w2 = 1000
> >        3:       r3 = r7                         3:      r3 = r7
> >        4:       call 45                         4:      call 45
> >        5:       if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
> >        6:       w2 = w0                         6:      if w0 s< 0 goto +10 <LBB0_4>
> >        7:       r1 = r7                         7:      r2 <<= 32
> >        8:       r1 += r2                        8:      r2 s>>= 32
> >        9:       if w0 s< 1001 goto +1 <LBB0_3>  9:      r1 = r7
> >       10:       r1 = r7                        10:      r1 += r2
> >       11:       w2 = 1000                      11:      if w0 s< 1001 goto +1 <LBB0_3>
> >       12:       r3 = r6                        12:      r1 = r7
> >       13:       call 45                        13:      w2 = 1000
> >       14:       w0 = 1                         14:      r3 = r6
> >       15:       exit                           15:      call 45
> >                                                16:      w0 = 1
> >                                                17:      exit
> >
> > So a couple extra instruction, but more concerning we created a
> > <<=,s>> pattern. I'll need to do some more tests but my concern
> > is that could break verifier for real programs we have. I guess
> > it didn't in the selftests? Surely, this thread has at least
> > pointed out some gaps in our test cases. I guess I _could_ make
> > len a u64 type to remove the sext but then <0 check on a u64?!
> 
> I addressed <0 check above. As for <<=,s>>=, I wish Clang was a bit
> smarter and just did w2 = w2 or something like that, given we just
> checked that w0 is non-negative. But then again, I wouldn't do two ifs
> and wouldn't use signed int for len.

It is smart enough once you get all the types aligned. So after pulling
in int->long change ideally we would change codebase to use long types as
well. Maybe we should modify the tests in selftests as well OTOH
its nice to test what happens when folks leave the return types as int.

> 
> >
> > >
> > > Overall, long as a return type matches reality and BPF ABI
> > > specification. BTW, one of the varlen programs from patch 2 doesn't
> > > even validate successfully on latest kernel with latest Clang right
> > > now, if helpers return int, even though it's completely correct code.
> > > That's a real problem we have to deal with in few major BPF
> > > applications right now, and we have to use inline assembly to enforce
> > > Clang to do the right thing. A bunch of those problems are simply
> > > avoided with correct return types for helpers.
> >
> > Do the real programs check <0? Did I miss something? I'll try
> > applying your patch to our real code base and see what happens.
> 
> That would be great. Self-tests do work, but having more testing with
> real-world application would certainly help as well.

Thanks for all the follow up.

I ran the change through some CI on my side and it passed so I can
complain about a few shifts here and there or just update my code or
just not change the return types on my side but I'm convinced its OK
in most cases and helps in some so...

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox