Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Alex Williamson @ 2019-08-23 15:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866867150DAABA422F25FF8D1A40@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Fri, 23 Aug 2019 14:53:06 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, August 23, 2019 7:58 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>; Cornelia
> > Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 08:14:39 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > Hi Alex,
> > >
> > >  
> > > > -----Original Message-----
> > > > From: Jiri Pirko <jiri@resnulli.us>
> > > > Sent: Friday, August 23, 2019 1:42 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > > > Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > <cohuck@redhat.com>;  
> > > > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > <cjia@nvidia.com>; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > Thu, Aug 22, 2019 at 03:33:30PM CEST, parav@mellanox.com wrote:  
> > > > >
> > > > >  
> > > > >> -----Original Message-----
> > > > >> From: Jiri Pirko <jiri@resnulli.us>
> > > > >> Sent: Thursday, August 22, 2019 5:50 PM
> > > > >> To: Parav Pandit <parav@mellanox.com>
> > > > >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > > >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > > > >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > > > <cohuck@redhat.com>;  
> > > > >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > >>
> > > > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:  
> > > > >> >
> > > > >> >  
> > > > >> >> -----Original Message-----
> > > > >> >> From: Jiri Pirko <jiri@resnulli.us>
> > > > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > > > >> >> To: Parav Pandit <parav@mellanox.com>
> > > > >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > > >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > > > >> >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > > > >> <cohuck@redhat.com>;  
> > > > >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > >> >>
> > > > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:  
> > > > >> >> >
> > > > >> >> >  
> > > > >> >> >> -----Original Message-----
> > > > >> >> >> From: Jiri Pirko <jiri@resnulli.us>
> > > > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > > > >> >> >> To: Parav Pandit <parav@mellanox.com>
> > > > >> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri
> > > > >> >> >> Pirko <jiri@mellanox.com>; David S . Miller
> > > > >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > > > >> >> >> <kwankhede@nvidia.com>; Cornelia Huck  
> > > > >> >> <cohuck@redhat.com>;  
> > > > >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > > >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > > > >> >> >> core
> > > > >> >> >>
> > > > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com  
> > wrote:  
> > > > >> >> >> >
> > > > >> >> >> >  
> > > > >> >> >> >> -----Original Message-----
> > > > >> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
> > > > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > > > >> >> >> >> To: Parav Pandit <parav@mellanox.com>
> > > > >> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > > > >> >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > > > >> >> >> >> <kwankhede@nvidia.com>; Cornelia Huck
> > > > >> >> >> >> <cohuck@redhat.com>; kvm@vger.kernel.org;
> > > > >> >> >> >> linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > > > >> >> >> >> netdev@vger.kernel.org
> > > > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and
> > > > >> >> >> >> mdev core
> > > > >> >> >> >>  
> > > > >> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
> > > > >> >> >> >> > > > > In fact, proposing that the user does not set
> > > > >> >> >> >> > > > > it, mdev-core provides one  
> > > > >> >> >> >> > > automatically.  
> > > > >> >> >> >> > > > >  
> > > > >> >> >> >> > > > > > > Since there seems to be some prefix
> > > > >> >> >> >> > > > > > > overhead, as I ask about above in how many
> > > > >> >> >> >> > > > > > > characters we actually have to work with in
> > > > >> >> >> >> > > > > > > IFNAMESZ, maybe we start with
> > > > >> >> >> >> > > > > > > 8 characters (matching your "index"
> > > > >> >> >> >> > > > > > > namespace) and expand as necessary for  
> > > > >> >> >> disambiguation.  
> > > > >> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ,
> > > > >> >> >> >> > > > > > > let's start with  
> > > > >> 12.  
> > > > >> >> >> >> > > > > > > Thanks,
> > > > >> >> >> >> > > > > > >  
> > > > >> >> >> >> > > > > > If user is going to choose the alias, why does
> > > > >> >> >> >> > > > > > it have to be limited to  
> > > > >> >> >> >> sha1?  
> > > > >> >> >> >> > > > > > Or you just told it as an example?
> > > > >> >> >> >> > > > > >
> > > > >> >> >> >> > > > > > It can be an alpha-numeric string.  
> > > > >> >> >> >> > > > >
> > > > >> >> >> >> > > > > No, I'm proposing a different solution where
> > > > >> >> >> >> > > > > mdev-core creates an alias based on an
> > > > >> >> >> >> > > > > abbreviated sha1.  The user does not provide the  
> > > > >> >> >> >> alias.  
> > > > >> >> >> >> > > > >  
> > > > >> >> >> >> > > > > > Instead of mdev imposing number of characters
> > > > >> >> >> >> > > > > > on the alias, it should be best  
> > > > >> >> >> >> > > > > left to the user.  
> > > > >> >> >> >> > > > > > Because in future if netdev improves on the
> > > > >> >> >> >> > > > > > naming scheme, mdev will be  
> > > > >> >> >> >> > > > > limiting it, which is not right.  
> > > > >> >> >> >> > > > > > So not restricting alias size seems right to me.
> > > > >> >> >> >> > > > > > User configuring mdev for networking devices
> > > > >> >> >> >> > > > > > in a given kernel knows what  
> > > > >> >> >> >> > > > > user is doing.  
> > > > >> >> >> >> > > > > > So user can choose alias name size as it finds suitable.  
> > > > >> >> >> >> > > > >
> > > > >> >> >> >> > > > > That's not what I'm proposing, please read again.
> > > > >> >> >> >> > > > > Thanks,  
> > > > >> >> >> >> > > >
> > > > >> >> >> >> > > > I understood your point. But mdev doesn't know how
> > > > >> >> >> >> > > > user is going to use  
> > > > >> >> >> >> > > udev/systemd to name the netdev.  
> > > > >> >> >> >> > > > So even if mdev chose to pick 12 characters, it
> > > > >> >> >> >> > > > could result in  
> > > > >> >> collision.  
> > > > >> >> >> >> > > > Hence the proposal to provide the alias by the
> > > > >> >> >> >> > > > user, as user know the best  
> > > > >> >> >> >> > > policy for its use case in the environment its using.  
> > > > >> >> >> >> > > > So 12 character sha1 method will still work by user.  
> > > > >> >> >> >> > >
> > > > >> >> >> >> > > Haven't you already provided examples where certain
> > > > >> >> >> >> > > drivers or subsystems have unique netdev prefixes?
> > > > >> >> >> >> > > If mdev provides a unique alias within the
> > > > >> >> >> >> > > subsystem, couldn't we simply define a netdev prefix
> > > > >> >> >> >> > > for the mdev subsystem and avoid all other
> > > > >> >> >> >> > > collisions?  I'm not in favor of the user providing
> > > > >> >> >> >> > > both a uuid and an alias/instance.  Thanks,
> > > > >> >> >> >> > >  
> > > > >> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1
> > > > >> >> >> >> > first 9 characters have  
> > > > >> >> >> >> collision?
> > > > >> >> >> >>
> > > > >> >> >> >> I think it would be a mistake to waste so many chars on
> > > > >> >> >> >> a prefix, but
> > > > >> >> >> >> 9 characters of sha1 likely wouldn't have a collision
> > > > >> >> >> >> before we have 10s of thousands of devices.  Thanks,
> > > > >> >> >> >>
> > > > >> >> >> >> Alex  
> > > > >> >> >> >
> > > > >> >> >> >Jiri, Dave,
> > > > >> >> >> >Are you ok with it for devlink/netdev part?
> > > > >> >> >> >Mdev core will create an alias from a UUID.
> > > > >> >> >> >
> > > > >> >> >> >This will be supplied during devlink port attr set such
> > > > >> >> >> >as,
> > > > >> >> >> >
> > > > >> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port,
> > > > >> >> >> >const char *mdev_alias);
> > > > >> >> >> >
> > > > >> >> >> >This alias is used to generate representor netdev's  
> > phys_port_name.  
> > > > >> >> >> >This alias from the mdev device's sysfs will be used by
> > > > >> >> >> >the udev/systemd to  
> > > > >> >> >> generate predicable netdev's name.  
> > > > >> >> >> >Example: enm<mdev_alias_first_12_chars>  
> > > > >> >> >>
> > > > >> >> >> What happens in unlikely case of 2 UUIDs collide?
> > > > >> >> >>  
> > > > >> >> >Since users sees two devices with same phys_port_name, user
> > > > >> >> >should destroy  
> > > > >> >> recently created mdev and recreate mdev with different UUID?
> > > > >> >>
> > > > >> >> Driver should make sure phys port name wont collide,  
> > > > >> >So when mdev creation is initiated, mdev core calculates the
> > > > >> >alias and if there  
> > > > >> is any other mdev with same alias exist, it returns -EEXIST error
> > > > >> before progressing further.  
> > > > >> >This way user will get to know upfront in event of collision
> > > > >> >before the mdev  
> > > > >> device gets created.  
> > > > >> >How about that?  
> > > > >>
> > > > >> Sounds fine to me. Now the question is how many chars do we want to  
> > have.  
> > > > >>  
> > > > >12 characters from Alex's suggestion similar to git?  
> > > >
> > > > Ok.
> > > >  
> > >
> > > Can you please confirm this scheme looks good now? I like to get patches  
> > started.
> > 
> > My only concern is your comment that in the event of an abbreviated
> > sha1 collision (as exceptionally rare as that might be at 12-chars), we'd fail the
> > device create, while my original suggestion was that vfio-core would add an
> > extra character to the alias.  For non-networking devices, the sha1 is
> > unnecessary, so the extension behavior seems preferred.  The user is only
> > responsible to provide a unique uuid.  Perhaps the failure behavior could be
> > applied based on the mdev device_api.  A module option on mdev to specify the
> > default number of alias chars would also be useful for testing so that we can set
> > it low enough to validate the collision behavior.  Thanks,
> >   
> 
> Idea is to have mdev alias as optional.
> Each mdev_parent says whether it wants mdev_core to generate an alias
> or not. So only networking device drivers would set it to true.
> For rest, alias won't be generated, and won't be compared either
> during creation time. User continue to provide only uuid.

Ok

> I am tempted to have alias collision detection only within children
> mdevs of the same parent, but doing so will always mandate to prefix
> in netdev name. And currently we are left with only 3 characters to
> prefix it, so that may not be good either. Hence, I think mdev core
> wide alias is better with 12 characters.

I suppose it depends on the API, if the vendor driver can ask the mdev
core for an alias as part of the device creation process, then it could
manage the netdev namespace for all its devices, choosing how many
characters to use, and fail the creation if it can't meet a uniqueness
requirement.  IOW, mdev-core would always provide a full sha1 and
therefore gets itself out of the uniqueness/collision aspects.

> I do not understand how an extra character reduces collision, if
> that's what you meant.

If the default were for example 3-chars, we might already have device
'abc'.  A collision would expose one more char of the new device, so we
might add device with alias 'abcd'.  I mentioned previously that this
leaves an issue for userspace that we can't change the alias of device
abc, so without additional information, userspace can only determine via
elimination the mapping of alias to device, but userspace has more
information available to it in the form of sysfs links.

> Module options are almost not encouraged
> anymore with other subsystems/drivers.

We don't live in a world of absolutes.  I agree that the defaults
should work in the vast majority of cases.  Requiring a user to twiddle
module options to make things work is undesirable, verging on a bug.  A
module option to enable some specific feature, unsafe condition, or test
that is outside of the typical use case is reasonable, imo.

> For testing collision rate, a sample user space script and sample
> mtty is easy and get us collision count too. We shouldn't put that
> using module option in production kernel. I practically have the code
> ready to play with; Changing 12 to smaller value is easy with module
> reload.
> 
> #define MDEV_ALIAS_LEN 12

If it can't be tested with a shipping binary, it probably won't be
tested.  Thanks,

Alex

^ permalink raw reply

* [PATCH net-next] drop_monitor: Make timestamps y2038 safe
From: Ido Schimmel @ 2019-08-23 15:47 UTC (permalink / raw)
  To: netdev; +Cc: davem, nhorman, arnd, andrew, ayal, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@mellanox.com>

Timestamps are currently communicated to user space as 'struct
timespec', which is not considered y2038 safe since it uses a 32-bit
signed value for seconds.

Fix this while the API is still not part of any official kernel release
by using 64-bit nanoseconds timestamps instead.

Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
Arnd, I have followed your recommendation to use 64-bit nanoseconds
timestamps. I would appreciate it if you could review this change.

Thanks!
---
 include/uapi/linux/net_dropmon.h |  2 +-
 net/core/drop_monitor.c          | 14 ++++++--------
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/net_dropmon.h b/include/uapi/linux/net_dropmon.h
index 75a35dccb675..8bf79a9eb234 100644
--- a/include/uapi/linux/net_dropmon.h
+++ b/include/uapi/linux/net_dropmon.h
@@ -75,7 +75,7 @@ enum net_dm_attr {
 	NET_DM_ATTR_PC,				/* u64 */
 	NET_DM_ATTR_SYMBOL,			/* string */
 	NET_DM_ATTR_IN_PORT,			/* nested */
-	NET_DM_ATTR_TIMESTAMP,			/* struct timespec */
+	NET_DM_ATTR_TIMESTAMP,			/* u64 */
 	NET_DM_ATTR_PROTO,			/* u16 */
 	NET_DM_ATTR_PAYLOAD,			/* binary */
 	NET_DM_ATTR_PAD,
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index bfc024024aa3..cc60cc22e2db 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -552,7 +552,7 @@ static size_t net_dm_packet_report_size(size_t payload_len)
 	       /* NET_DM_ATTR_IN_PORT */
 	       net_dm_in_port_size() +
 	       /* NET_DM_ATTR_TIMESTAMP */
-	       nla_total_size(sizeof(struct timespec)) +
+	       nla_total_size(sizeof(u64)) +
 	       /* NET_DM_ATTR_ORIG_LEN */
 	       nla_total_size(sizeof(u32)) +
 	       /* NET_DM_ATTR_PROTO */
@@ -592,7 +592,6 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, struct sk_buff *skb,
 	u64 pc = (u64)(uintptr_t) NET_DM_SKB_CB(skb)->pc;
 	char buf[NET_DM_MAX_SYMBOL_LEN];
 	struct nlattr *attr;
-	struct timespec ts;
 	void *hdr;
 	int rc;
 
@@ -615,8 +614,8 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, struct sk_buff *skb,
 	if (rc)
 		goto nla_put_failure;
 
-	if (ktime_to_timespec_cond(skb->tstamp, &ts) &&
-	    nla_put(msg, NET_DM_ATTR_TIMESTAMP, sizeof(ts), &ts))
+	if (nla_put_u64_64bit(msg, NET_DM_ATTR_TIMESTAMP,
+			      ktime_to_ns(skb->tstamp), NET_DM_ATTR_PAD))
 		goto nla_put_failure;
 
 	if (nla_put_u32(msg, NET_DM_ATTR_ORIG_LEN, skb->len))
@@ -716,7 +715,7 @@ net_dm_hw_packet_report_size(size_t payload_len,
 	       /* NET_DM_ATTR_IN_PORT */
 	       net_dm_in_port_size() +
 	       /* NET_DM_ATTR_TIMESTAMP */
-	       nla_total_size(sizeof(struct timespec)) +
+	       nla_total_size(sizeof(u64)) +
 	       /* NET_DM_ATTR_ORIG_LEN */
 	       nla_total_size(sizeof(u32)) +
 	       /* NET_DM_ATTR_PROTO */
@@ -730,7 +729,6 @@ static int net_dm_hw_packet_report_fill(struct sk_buff *msg,
 {
 	struct net_dm_hw_metadata *hw_metadata;
 	struct nlattr *attr;
-	struct timespec ts;
 	void *hdr;
 
 	hw_metadata = NET_DM_SKB_CB(skb)->hw_metadata;
@@ -761,8 +759,8 @@ static int net_dm_hw_packet_report_fill(struct sk_buff *msg,
 			goto nla_put_failure;
 	}
 
-	if (ktime_to_timespec_cond(skb->tstamp, &ts) &&
-	    nla_put(msg, NET_DM_ATTR_TIMESTAMP, sizeof(ts), &ts))
+	if (nla_put_u64_64bit(msg, NET_DM_ATTR_TIMESTAMP,
+			      ktime_to_ns(skb->tstamp), NET_DM_ATTR_PAD))
 		goto nla_put_failure;
 
 	if (nla_put_u32(msg, NET_DM_ATTR_ORIG_LEN, skb->len))
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next 5/6] net: dsa: program VLAN on CPU port from slave
From: Vladimir Oltean @ 2019-08-23 15:44 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, David S. Miller, Florian Fainelli, Andrew Lunn
In-Reply-To: <20190822201323.1292-6-vivien.didelot@gmail.com>

On Thu, 22 Aug 2019 at 23:13, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> DSA currently programs a VLAN on the CPU port implicitly after the
> related notifier is received by a switch.
>
> While we still need to do this transparent programmation of the DSA
> links in the fabric, programming the CPU port this way may cause
> problems in some corners such as the tag_8021q driver.
>
> Because the dedicated CPU port is specific to a slave, make their
> programmation explicit a few layers up, in the slave code.
>
> Note that technically, DSA links have a dedicated CPU port as well,
> but since they are only used as conduit between interconnected switches
> of a fabric, programming them transparently this way is fine.
>
> Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
> ---
>  net/dsa/slave.c  | 14 ++++++++++++++
>  net/dsa/switch.c |  5 ++++-
>  2 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 82e48d247b81..8267c156a51a 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -332,6 +332,10 @@ static int dsa_slave_vlan_add(struct net_device *dev,
>         if (err)
>                 return err;
>
> +       err = dsa_port_vlan_add(dp->cpu_dp, &vlan, trans);
> +       if (err)
> +               return err;
> +
>         return 0;
>  }
>
> @@ -383,6 +387,9 @@ static int dsa_slave_vlan_del(struct net_device *dev,
>         if (dp->bridge_dev && !br_vlan_enabled(dp->bridge_dev))
>                 return 0;
>
> +       /* Do not deprogram the CPU port as it may be shared with other user
> +        * ports which can be members of this VLAN as well.
> +        */

+1 for the comments, the deletion of dp->cpu_dp is less likely to get
patched into the code in the future now.

>         return dsa_port_vlan_del(dp, SWITCHDEV_OBJ_PORT_VLAN(obj));
>  }
>
> @@ -1121,6 +1128,10 @@ static int dsa_slave_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
>         if (ret && ret != -EOPNOTSUPP)
>                 return ret;
>
> +       ret = dsa_port_vid_add(dp->cpu_dp, vid, 0);
> +       if (ret && ret != -EOPNOTSUPP)
> +               return ret;
> +

I think it's worth understanding what the EOPNOTSUPP -> 0 is avoiding.

>         return 0;
>  }
>
> @@ -1151,6 +1162,9 @@ static int dsa_slave_vlan_rx_kill_vid(struct net_device *dev, __be16 proto,
>         if (ret == -EOPNOTSUPP)
>                 ret = 0;
>
> +       /* Do not deprogram the CPU port as it may be shared with other user
> +        * ports which can be members of this VLAN as well.
> +        */
>         return ret;
>  }
>
> diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> index 489eb7b430a4..6a9607518823 100644
> --- a/net/dsa/switch.c
> +++ b/net/dsa/switch.c
> @@ -232,7 +232,7 @@ static bool dsa_switch_vlan_match(struct dsa_switch *ds, int port,
>         if (ds->index == info->sw_index && port == info->port)
>                 return true;
>
> -       if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
> +       if (dsa_is_dsa_port(ds, port))

Much better, thank you.



>                 return true;
>
>         return false;
> @@ -288,6 +288,9 @@ static int dsa_switch_vlan_del(struct dsa_switch *ds,
>         if (ds->index == info->sw_index)
>                 return ds->ops->port_vlan_del(ds, info->port, info->vlan);
>
> +       /* Do not deprogram the DSA links as they may be used as conduit
> +        * for other VLAN members in the fabric.
> +        */
>         return 0;
>  }
>
> --
> 2.23.0
>

^ permalink raw reply

* [PATCH net-next] dpaa2-eth: Add pause frame support
From: Ioana Radulescu @ 2019-08-23 15:19 UTC (permalink / raw)
  To: netdev, davem; +Cc: ioana.ciornei

Starting with firmware version MC10.18.0, we have support for
L2 flow control. Asymmetrical configuration (Rx or Tx only) is
supported, but not pause frame autonegotioation.

The hardware can automatically send pause frames when the number
of buffers in the pool goes below a predefined threshold. Due to
this, flow control is incompatible with Rx frame queue taildrop
(both mechanisms target the case when processing of ingress
frames can't keep up with the Rx rate; for large frames, the number
of buffers in the pool may never get low enough to trigger pause
frames as long as taildrop is enabled). So we set pause frame
generation and Rx FQ taildrop as mutually exclusive.

Pause frame configuration is done via ethtool. By default, we start
with flow control enabled on both Rx and Tx. Changes are propagated
to hardware through firmware commands; current FC state is stored
in the driver and we only interrogate the firmware when we receive
a notification that something (flow control or other link options)
has changed.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c   | 85 +++++++++++++++++++---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h   |  7 ++
 .../net/ethernet/freescale/dpaa2/dpaa2-ethtool.c   | 74 +++++++++++++++----
 drivers/net/ethernet/freescale/dpaa2/dpni-cmd.h    |  3 +-
 drivers/net/ethernet/freescale/dpaa2/dpni.c        | 40 +++++++++-
 drivers/net/ethernet/freescale/dpaa2/dpni.h        |  5 ++
 6 files changed, 186 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 0acb115..e0816d6 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -1208,9 +1208,37 @@ static void disable_ch_napi(struct dpaa2_eth_priv *priv)
 	}
 }
 
+static void dpaa2_eth_set_rx_taildrop(struct dpaa2_eth_priv *priv, bool enable)
+{
+	struct dpni_taildrop td = {0};
+	int i, err;
+
+	if (priv->rx_td_enabled == enable)
+		return;
+
+	td.enable = enable;
+	td.threshold = DPAA2_ETH_TAILDROP_THRESH;
+
+	for (i = 0; i < priv->num_fqs; i++) {
+		if (priv->fq[i].type != DPAA2_RX_FQ)
+			continue;
+		err = dpni_set_taildrop(priv->mc_io, 0, priv->mc_token,
+					DPNI_CP_QUEUE, DPNI_QUEUE_RX, 0,
+					priv->fq[i].flowid, &td);
+		if (err) {
+			netdev_err(priv->net_dev,
+				   "dpni_set_taildrop() failed\n");
+			break;
+		}
+	}
+
+	priv->rx_td_enabled = enable;
+}
+
 static int link_state_update(struct dpaa2_eth_priv *priv)
 {
 	struct dpni_link_state state = {0};
+	bool tx_pause;
 	int err;
 
 	err = dpni_get_link_state(priv->mc_io, 0, priv->mc_token, &state);
@@ -1220,11 +1248,18 @@ static int link_state_update(struct dpaa2_eth_priv *priv)
 		return err;
 	}
 
+	/* If Tx pause frame settings have changed, we need to update
+	 * Rx FQ taildrop configuration as well. We configure taildrop
+	 * only when pause frame generation is disabled.
+	 */
+	tx_pause = !!(state.options & DPNI_LINK_OPT_PAUSE) ^
+		   !!(state.options & DPNI_LINK_OPT_ASYM_PAUSE);
+	dpaa2_eth_set_rx_taildrop(priv, !tx_pause);
+
 	/* Chech link state; speed / duplex changes are not treated yet */
 	if (priv->link_state.up == state.up)
-		return 0;
+		goto out;
 
-	priv->link_state = state;
 	if (state.up) {
 		netif_carrier_on(priv->net_dev);
 		netif_tx_start_all_queues(priv->net_dev);
@@ -1236,6 +1271,9 @@ static int link_state_update(struct dpaa2_eth_priv *priv)
 	netdev_info(priv->net_dev, "Link Event: state %s\n",
 		    state.up ? "up" : "down");
 
+out:
+	priv->link_state = state;
+
 	return 0;
 }
 
@@ -2443,6 +2481,32 @@ static void set_enqueue_mode(struct dpaa2_eth_priv *priv)
 		priv->enqueue = dpaa2_eth_enqueue_fq;
 }
 
+static int set_pause(struct dpaa2_eth_priv *priv)
+{
+	struct device *dev = priv->net_dev->dev.parent;
+	struct dpni_link_cfg link_cfg = {0};
+	int err;
+
+	/* Get the default link options so we don't override other flags */
+	err = dpni_get_link_cfg(priv->mc_io, 0, priv->mc_token, &link_cfg);
+	if (err) {
+		dev_err(dev, "dpni_get_link_cfg() failed\n");
+		return err;
+	}
+
+	link_cfg.options |= DPNI_LINK_OPT_PAUSE;
+	link_cfg.options &= ~DPNI_LINK_OPT_ASYM_PAUSE;
+	err = dpni_set_link_cfg(priv->mc_io, 0, priv->mc_token, &link_cfg);
+	if (err) {
+		dev_err(dev, "dpni_set_link_cfg() failed\n");
+		return err;
+	}
+
+	priv->link_state.options = link_cfg.options;
+
+	return 0;
+}
+
 /* Configure the DPNI object this interface is associated with */
 static int setup_dpni(struct fsl_mc_device *ls_dev)
 {
@@ -2498,6 +2562,13 @@ static int setup_dpni(struct fsl_mc_device *ls_dev)
 
 	set_enqueue_mode(priv);
 
+	/* Enable pause frame support */
+	if (dpaa2_eth_has_pause_support(priv)) {
+		err = set_pause(priv);
+		if (err)
+			goto close;
+	}
+
 	priv->cls_rules = devm_kzalloc(dev, sizeof(struct dpaa2_eth_cls_rule) *
 				       dpaa2_eth_fs_count(priv), GFP_KERNEL);
 	if (!priv->cls_rules)
@@ -2529,7 +2600,6 @@ static int setup_rx_flow(struct dpaa2_eth_priv *priv,
 	struct device *dev = priv->net_dev->dev.parent;
 	struct dpni_queue queue;
 	struct dpni_queue_id qid;
-	struct dpni_taildrop td;
 	int err;
 
 	err = dpni_get_queue(priv->mc_io, 0, priv->mc_token,
@@ -2554,15 +2624,6 @@ static int setup_rx_flow(struct dpaa2_eth_priv *priv,
 		return err;
 	}
 
-	td.enable = 1;
-	td.threshold = DPAA2_ETH_TAILDROP_THRESH;
-	err = dpni_set_taildrop(priv->mc_io, 0, priv->mc_token, DPNI_CP_QUEUE,
-				DPNI_QUEUE_RX, 0, fq->flowid, &td);
-	if (err) {
-		dev_err(dev, "dpni_set_threshold() failed\n");
-		return err;
-	}
-
 	/* xdp_rxq setup */
 	err = xdp_rxq_info_reg(&fq->channel->xdp_rxq, priv->net_dev,
 			       fq->flowid);
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index 9af18c2..8a0e65b 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -392,6 +392,7 @@ struct dpaa2_eth_priv {
 	struct dpaa2_eth_drv_stats __percpu *percpu_extras;
 
 	u16 mc_token;
+	u8 rx_td_enabled;
 
 	struct dpni_link_state link_state;
 	bool do_link_poll;
@@ -476,6 +477,12 @@ enum dpaa2_eth_rx_dist {
 #define DPAA2_ETH_DIST_L4DST		BIT(8)
 #define DPAA2_ETH_DIST_ALL		(~0ULL)
 
+#define DPNI_PAUSE_VER_MAJOR		7
+#define DPNI_PAUSE_VER_MINOR		13
+#define dpaa2_eth_has_pause_support(priv)			\
+	(dpaa2_eth_cmp_dpni_ver((priv), DPNI_PAUSE_VER_MAJOR,	\
+				DPNI_PAUSE_VER_MINOR) >= 0)
+
 static inline
 unsigned int dpaa2_eth_needed_headroom(struct dpaa2_eth_priv *priv,
 				       struct sk_buff *skb)
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
index 7b182f4..7000638 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
@@ -78,29 +78,20 @@ static int
 dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
 			     struct ethtool_link_ksettings *link_settings)
 {
-	struct dpni_link_state state = {0};
-	int err = 0;
 	struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
 
-	err = dpni_get_link_state(priv->mc_io, 0, priv->mc_token, &state);
-	if (err) {
-		netdev_err(net_dev, "ERROR %d getting link state\n", err);
-		goto out;
-	}
-
 	/* At the moment, we have no way of interrogating the DPMAC
 	 * from the DPNI side - and for that matter there may exist
 	 * no DPMAC at all. So for now we just don't report anything
 	 * beyond the DPNI attributes.
 	 */
-	if (state.options & DPNI_LINK_OPT_AUTONEG)
+	if (priv->link_state.options & DPNI_LINK_OPT_AUTONEG)
 		link_settings->base.autoneg = AUTONEG_ENABLE;
-	if (!(state.options & DPNI_LINK_OPT_HALF_DUPLEX))
+	if (!(priv->link_state.options & DPNI_LINK_OPT_HALF_DUPLEX))
 		link_settings->base.duplex = DUPLEX_FULL;
-	link_settings->base.speed = state.rate;
+	link_settings->base.speed = priv->link_state.rate;
 
-out:
-	return err;
+	return 0;
 }
 
 #define DPNI_DYNAMIC_LINK_SET_VER_MAJOR		7
@@ -125,6 +116,9 @@ dpaa2_eth_set_link_ksettings(struct net_device *net_dev,
 		}
 	}
 
+	/* Make sure current options are not overwritten */
+	cfg.options = priv->link_state.options;
+
 	cfg.rate = link_settings->base.speed;
 	if (link_settings->base.autoneg == AUTONEG_ENABLE)
 		cfg.options |= DPNI_LINK_OPT_AUTONEG;
@@ -145,6 +139,58 @@ dpaa2_eth_set_link_ksettings(struct net_device *net_dev,
 	return err;
 }
 
+static void dpaa2_eth_get_pauseparam(struct net_device *net_dev,
+				     struct ethtool_pauseparam *pause)
+{
+	struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
+	u64 link_options = priv->link_state.options;
+
+	pause->rx_pause = !!(link_options & DPNI_LINK_OPT_PAUSE);
+	pause->tx_pause = pause->rx_pause ^
+			  !!(link_options & DPNI_LINK_OPT_ASYM_PAUSE);
+}
+
+static int dpaa2_eth_set_pauseparam(struct net_device *net_dev,
+				    struct ethtool_pauseparam *pause)
+{
+	struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
+	struct dpni_link_cfg cfg = {0};
+	int err;
+
+	if (!dpaa2_eth_has_pause_support(priv)) {
+		netdev_info(net_dev, "No pause frame support for DPNI version < %d.%d\n",
+			    DPNI_PAUSE_VER_MAJOR, DPNI_PAUSE_VER_MINOR);
+		return -EOPNOTSUPP;
+	}
+
+	if (pause->autoneg)
+		netdev_err(net_dev, "Pause frame autoneg not supported\n");
+
+	cfg.rate = priv->link_state.rate;
+	cfg.options = priv->link_state.options;
+	if (pause->rx_pause)
+		cfg.options |= DPNI_LINK_OPT_PAUSE;
+	else
+		cfg.options &= ~DPNI_LINK_OPT_PAUSE;
+	if (!!pause->rx_pause ^ !!pause->tx_pause)
+		cfg.options |= DPNI_LINK_OPT_ASYM_PAUSE;
+	else
+		cfg.options &= ~DPNI_LINK_OPT_ASYM_PAUSE;
+
+	if (cfg.options == priv->link_state.options)
+		return 0;
+
+	err = dpni_set_link_cfg(priv->mc_io, 0, priv->mc_token, &cfg);
+	if (err) {
+		netdev_err(net_dev, "dpni_set_link_state failed\n");
+		return err;
+	}
+
+	priv->link_state.options = cfg.options;
+
+	return 0;
+}
+
 static void dpaa2_eth_get_strings(struct net_device *netdev, u32 stringset,
 				  u8 *data)
 {
@@ -722,6 +768,8 @@ const struct ethtool_ops dpaa2_ethtool_ops = {
 	.get_link = ethtool_op_get_link,
 	.get_link_ksettings = dpaa2_eth_get_link_ksettings,
 	.set_link_ksettings = dpaa2_eth_set_link_ksettings,
+	.get_pauseparam = dpaa2_eth_get_pauseparam,
+	.set_pauseparam = dpaa2_eth_set_pauseparam,
 	.get_sset_count = dpaa2_eth_get_sset_count,
 	.get_ethtool_stats = dpaa2_eth_get_ethtool_stats,
 	.get_strings = dpaa2_eth_get_strings,
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpni-cmd.h b/drivers/net/ethernet/freescale/dpaa2/dpni-cmd.h
index 7b44d7d..d9b6918 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpni-cmd.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpni-cmd.h
@@ -84,6 +84,7 @@
 
 #define DPNI_CMDID_SET_RX_FS_DIST			DPNI_CMD(0x273)
 #define DPNI_CMDID_SET_RX_HASH_DIST			DPNI_CMD(0x274)
+#define DPNI_CMDID_GET_LINK_CFG				DPNI_CMD(0x278)
 
 /* Macros for accessing command fields smaller than 1byte */
 #define DPNI_MASK(field)	\
@@ -284,7 +285,7 @@ struct dpni_rsp_get_statistics {
 	__le64 counter[DPNI_STATISTICS_CNT];
 };
 
-struct dpni_cmd_set_link_cfg {
+struct dpni_cmd_link_cfg {
 	/* cmd word 0 */
 	__le64 pad0;
 	/* cmd word 1 */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpni.c b/drivers/net/ethernet/freescale/dpaa2/dpni.c
index 220dfc8..05e3089 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpni.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpni.c
@@ -838,13 +838,13 @@ int dpni_set_link_cfg(struct fsl_mc_io *mc_io,
 		      const struct dpni_link_cfg *cfg)
 {
 	struct fsl_mc_command cmd = { 0 };
-	struct dpni_cmd_set_link_cfg *cmd_params;
+	struct dpni_cmd_link_cfg *cmd_params;
 
 	/* prepare command */
 	cmd.header = mc_encode_cmd_header(DPNI_CMDID_SET_LINK_CFG,
 					  cmd_flags,
 					  token);
-	cmd_params = (struct dpni_cmd_set_link_cfg *)cmd.params;
+	cmd_params = (struct dpni_cmd_link_cfg *)cmd.params;
 	cmd_params->rate = cpu_to_le32(cfg->rate);
 	cmd_params->options = cpu_to_le64(cfg->options);
 
@@ -853,6 +853,42 @@ int dpni_set_link_cfg(struct fsl_mc_io *mc_io,
 }
 
 /**
+ * dpni_get_link_cfg() - return the link configuration
+ * @mc_io:	Pointer to MC portal's I/O object
+ * @cmd_flags:	Command flags; one or more of 'MC_CMD_FLAG_'
+ * @token:	Token of DPNI object
+ * @cfg:	Link configuration from dpni object
+ *
+ * Return:	'0' on Success; Error code otherwise.
+ */
+int dpni_get_link_cfg(struct fsl_mc_io *mc_io,
+		      u32 cmd_flags,
+		      u16 token,
+		      struct dpni_link_cfg *cfg)
+{
+	struct fsl_mc_command cmd = { 0 };
+	struct dpni_cmd_link_cfg *rsp_params;
+	int err;
+
+	/* prepare command */
+	cmd.header = mc_encode_cmd_header(DPNI_CMDID_GET_LINK_CFG,
+					  cmd_flags,
+					  token);
+
+	/* send command to mc*/
+	err = mc_send_command(mc_io, &cmd);
+	if (err)
+		return err;
+
+	/* retrieve response parameters */
+	rsp_params = (struct dpni_cmd_link_cfg *)cmd.params;
+	cfg->rate = le32_to_cpu(rsp_params->rate);
+	cfg->options = le64_to_cpu(rsp_params->options);
+
+	return err;
+}
+
+/**
  * dpni_get_link_state() - Return the link state (either up or down)
  * @mc_io:	Pointer to MC portal's I/O object
  * @cmd_flags:	Command flags; one or more of 'MC_CMD_FLAG_'
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpni.h b/drivers/net/ethernet/freescale/dpaa2/dpni.h
index a521242..3e8fc6c 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpni.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpni.h
@@ -485,6 +485,11 @@ int dpni_set_link_cfg(struct fsl_mc_io			*mc_io,
 		      u16				token,
 		      const struct dpni_link_cfg	*cfg);
 
+int dpni_get_link_cfg(struct fsl_mc_io			*mc_io,
+		      u32				cmd_flags,
+		      u16				token,
+		      struct dpni_link_cfg		*cfg);
+
 /**
  * struct dpni_link_state - Structure representing DPNI link state
  * @rate: Rate
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Jiri Pirko @ 2019-08-23 15:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Alex Williamson, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866867150DAABA422F25FF8D1A40@AM0PR05MB4866.eurprd05.prod.outlook.com>

Fri, Aug 23, 2019 at 04:53:06PM CEST, parav@mellanox.com wrote:
>
>
>> -----Original Message-----
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Friday, August 23, 2019 7:58 PM
>> To: Parav Pandit <parav@mellanox.com>
>> Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S . Miller
>> <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>; Cornelia
>> Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
>> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
>> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>> 
>> On Fri, 23 Aug 2019 08:14:39 +0000
>> Parav Pandit <parav@mellanox.com> wrote:
>> 
>> > Hi Alex,
>> >
>> >
>> > > -----Original Message-----
>> > > From: Jiri Pirko <jiri@resnulli.us>
>> > > Sent: Friday, August 23, 2019 1:42 PM
>> > > To: Parav Pandit <parav@mellanox.com>
>> > > Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
>> > > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
>> > > Wankhede <kwankhede@nvidia.com>; Cornelia Huck
>> <cohuck@redhat.com>;
>> > > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
>> > > <cjia@nvidia.com>; netdev@vger.kernel.org
>> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>> > >
>> > > Thu, Aug 22, 2019 at 03:33:30PM CEST, parav@mellanox.com wrote:
>> > > >
>> > > >
>> > > >> -----Original Message-----
>> > > >> From: Jiri Pirko <jiri@resnulli.us>
>> > > >> Sent: Thursday, August 22, 2019 5:50 PM
>> > > >> To: Parav Pandit <parav@mellanox.com>
>> > > >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
>> > > >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
>> > > >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
>> > > <cohuck@redhat.com>;
>> > > >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
>> > > >> <cjia@nvidia.com>; netdev@vger.kernel.org
>> > > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>> > > >>
>> > > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:
>> > > >> >
>> > > >> >
>> > > >> >> -----Original Message-----
>> > > >> >> From: Jiri Pirko <jiri@resnulli.us>
>> > > >> >> Sent: Thursday, August 22, 2019 3:28 PM
>> > > >> >> To: Parav Pandit <parav@mellanox.com>
>> > > >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
>> > > >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
>> > > >> >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
>> > > >> <cohuck@redhat.com>;
>> > > >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
>> > > >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
>> > > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>> > > >> >>
>> > > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:
>> > > >> >> >
>> > > >> >> >
>> > > >> >> >> -----Original Message-----
>> > > >> >> >> From: Jiri Pirko <jiri@resnulli.us>
>> > > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
>> > > >> >> >> To: Parav Pandit <parav@mellanox.com>
>> > > >> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri
>> > > >> >> >> Pirko <jiri@mellanox.com>; David S . Miller
>> > > >> >> >> <davem@davemloft.net>; Kirti Wankhede
>> > > >> >> >> <kwankhede@nvidia.com>; Cornelia Huck
>> > > >> >> <cohuck@redhat.com>;
>> > > >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
>> > > >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
>> > > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
>> > > >> >> >> core
>> > > >> >> >>
>> > > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com
>> wrote:
>> > > >> >> >> >
>> > > >> >> >> >
>> > > >> >> >> >> -----Original Message-----
>> > > >> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
>> > > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
>> > > >> >> >> >> To: Parav Pandit <parav@mellanox.com>
>> > > >> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
>> > > >> >> >> >> <davem@davemloft.net>; Kirti Wankhede
>> > > >> >> >> >> <kwankhede@nvidia.com>; Cornelia Huck
>> > > >> >> >> >> <cohuck@redhat.com>; kvm@vger.kernel.org;
>> > > >> >> >> >> linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
>> > > >> >> >> >> netdev@vger.kernel.org
>> > > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and
>> > > >> >> >> >> mdev core
>> > > >> >> >> >>
>> > > >> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
>> > > >> >> >> >> > > > > In fact, proposing that the user does not set
>> > > >> >> >> >> > > > > it, mdev-core provides one
>> > > >> >> >> >> > > automatically.
>> > > >> >> >> >> > > > >
>> > > >> >> >> >> > > > > > > Since there seems to be some prefix
>> > > >> >> >> >> > > > > > > overhead, as I ask about above in how many
>> > > >> >> >> >> > > > > > > characters we actually have to work with in
>> > > >> >> >> >> > > > > > > IFNAMESZ, maybe we start with
>> > > >> >> >> >> > > > > > > 8 characters (matching your "index"
>> > > >> >> >> >> > > > > > > namespace) and expand as necessary for
>> > > >> >> >> disambiguation.
>> > > >> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ,
>> > > >> >> >> >> > > > > > > let's start with
>> > > >> 12.
>> > > >> >> >> >> > > > > > > Thanks,
>> > > >> >> >> >> > > > > > >
>> > > >> >> >> >> > > > > > If user is going to choose the alias, why does
>> > > >> >> >> >> > > > > > it have to be limited to
>> > > >> >> >> >> sha1?
>> > > >> >> >> >> > > > > > Or you just told it as an example?
>> > > >> >> >> >> > > > > >
>> > > >> >> >> >> > > > > > It can be an alpha-numeric string.
>> > > >> >> >> >> > > > >
>> > > >> >> >> >> > > > > No, I'm proposing a different solution where
>> > > >> >> >> >> > > > > mdev-core creates an alias based on an
>> > > >> >> >> >> > > > > abbreviated sha1.  The user does not provide the
>> > > >> >> >> >> alias.
>> > > >> >> >> >> > > > >
>> > > >> >> >> >> > > > > > Instead of mdev imposing number of characters
>> > > >> >> >> >> > > > > > on the alias, it should be best
>> > > >> >> >> >> > > > > left to the user.
>> > > >> >> >> >> > > > > > Because in future if netdev improves on the
>> > > >> >> >> >> > > > > > naming scheme, mdev will be
>> > > >> >> >> >> > > > > limiting it, which is not right.
>> > > >> >> >> >> > > > > > So not restricting alias size seems right to me.
>> > > >> >> >> >> > > > > > User configuring mdev for networking devices
>> > > >> >> >> >> > > > > > in a given kernel knows what
>> > > >> >> >> >> > > > > user is doing.
>> > > >> >> >> >> > > > > > So user can choose alias name size as it finds suitable.
>> > > >> >> >> >> > > > >
>> > > >> >> >> >> > > > > That's not what I'm proposing, please read again.
>> > > >> >> >> >> > > > > Thanks,
>> > > >> >> >> >> > > >
>> > > >> >> >> >> > > > I understood your point. But mdev doesn't know how
>> > > >> >> >> >> > > > user is going to use
>> > > >> >> >> >> > > udev/systemd to name the netdev.
>> > > >> >> >> >> > > > So even if mdev chose to pick 12 characters, it
>> > > >> >> >> >> > > > could result in
>> > > >> >> collision.
>> > > >> >> >> >> > > > Hence the proposal to provide the alias by the
>> > > >> >> >> >> > > > user, as user know the best
>> > > >> >> >> >> > > policy for its use case in the environment its using.
>> > > >> >> >> >> > > > So 12 character sha1 method will still work by user.
>> > > >> >> >> >> > >
>> > > >> >> >> >> > > Haven't you already provided examples where certain
>> > > >> >> >> >> > > drivers or subsystems have unique netdev prefixes?
>> > > >> >> >> >> > > If mdev provides a unique alias within the
>> > > >> >> >> >> > > subsystem, couldn't we simply define a netdev prefix
>> > > >> >> >> >> > > for the mdev subsystem and avoid all other
>> > > >> >> >> >> > > collisions?  I'm not in favor of the user providing
>> > > >> >> >> >> > > both a uuid and an alias/instance.  Thanks,
>> > > >> >> >> >> > >
>> > > >> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1
>> > > >> >> >> >> > first 9 characters have
>> > > >> >> >> >> collision?
>> > > >> >> >> >>
>> > > >> >> >> >> I think it would be a mistake to waste so many chars on
>> > > >> >> >> >> a prefix, but
>> > > >> >> >> >> 9 characters of sha1 likely wouldn't have a collision
>> > > >> >> >> >> before we have 10s of thousands of devices.  Thanks,
>> > > >> >> >> >>
>> > > >> >> >> >> Alex
>> > > >> >> >> >
>> > > >> >> >> >Jiri, Dave,
>> > > >> >> >> >Are you ok with it for devlink/netdev part?
>> > > >> >> >> >Mdev core will create an alias from a UUID.
>> > > >> >> >> >
>> > > >> >> >> >This will be supplied during devlink port attr set such
>> > > >> >> >> >as,
>> > > >> >> >> >
>> > > >> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port,
>> > > >> >> >> >const char *mdev_alias);
>> > > >> >> >> >
>> > > >> >> >> >This alias is used to generate representor netdev's
>> phys_port_name.
>> > > >> >> >> >This alias from the mdev device's sysfs will be used by
>> > > >> >> >> >the udev/systemd to
>> > > >> >> >> generate predicable netdev's name.
>> > > >> >> >> >Example: enm<mdev_alias_first_12_chars>
>> > > >> >> >>
>> > > >> >> >> What happens in unlikely case of 2 UUIDs collide?
>> > > >> >> >>
>> > > >> >> >Since users sees two devices with same phys_port_name, user
>> > > >> >> >should destroy
>> > > >> >> recently created mdev and recreate mdev with different UUID?
>> > > >> >>
>> > > >> >> Driver should make sure phys port name wont collide,
>> > > >> >So when mdev creation is initiated, mdev core calculates the
>> > > >> >alias and if there
>> > > >> is any other mdev with same alias exist, it returns -EEXIST error
>> > > >> before progressing further.
>> > > >> >This way user will get to know upfront in event of collision
>> > > >> >before the mdev
>> > > >> device gets created.
>> > > >> >How about that?
>> > > >>
>> > > >> Sounds fine to me. Now the question is how many chars do we want to
>> have.
>> > > >>
>> > > >12 characters from Alex's suggestion similar to git?
>> > >
>> > > Ok.
>> > >
>> >
>> > Can you please confirm this scheme looks good now? I like to get patches
>> started.
>> 
>> My only concern is your comment that in the event of an abbreviated
>> sha1 collision (as exceptionally rare as that might be at 12-chars), we'd fail the
>> device create, while my original suggestion was that vfio-core would add an
>> extra character to the alias.  For non-networking devices, the sha1 is
>> unnecessary, so the extension behavior seems preferred.  The user is only
>> responsible to provide a unique uuid.  Perhaps the failure behavior could be
>> applied based on the mdev device_api.  A module option on mdev to specify the
>> default number of alias chars would also be useful for testing so that we can set
>> it low enough to validate the collision behavior.  Thanks,
>> 
>
>Idea is to have mdev alias as optional.
>Each mdev_parent says whether it wants mdev_core to generate an alias or not.
>So only networking device drivers would set it to true.
>For rest, alias won't be generated, and won't be compared either during creation time.
>User continue to provide only uuid.
>I am tempted to have alias collision detection only within children mdevs of the same parent, but doing so will always mandate to prefix in netdev name.
>And currently we are left with only 3 characters to prefix it, so that may not be good either.
>Hence, I think mdev core wide alias is better with 12 characters.
>
>I do not understand how an extra character reduces collision, if that's what you meant.

Also, that breaks the naming consistency for different creation order.


>Module options are almost not encouraged anymore with other subsystems/drivers.
>
>For testing collision rate, a sample user space script and sample mtty is easy and get us collision count too.
>We shouldn't put that using module option in production kernel.
>I practically have the code ready to play with; Changing 12 to smaller value is easy with module reload.
>
>#define MDEV_ALIAS_LEN 12
>
>> Alex
>> 
>> > > >> >> in this case that it does
>> > > >> >> not provide 2 same attrs for 2 different ports.
>> > > >> >> Hmm, so the order of creation matters. That is not good.
>> > > >> >>
>> > > >> >> >>
>> > > >> >> >> >I took Ethernet mdev as an example.
>> > > >> >> >> >New prefix 'm' stands for mediated device.
>> > > >> >> >> >Remaining 12 characters are first 12 chars of the mdev alias.
>> > > >> >> >>
>> > > >> >> >> Does this resolve the identification of devlink port representor?
>> > > >> >> >Not sure if I understood your question correctly, attemping
>> > > >> >> >to answer
>> > > >> below.
>> > > >> >> >phys_port_name of devlink port is defined by the first 12
>> > > >> >> >characters of mdev
>> > > >> >> alias.
>> > > >> >> >> I assume you want to use the same 12(or so) chars, don't you?
>> > > >> >> >Mdev's netdev will also use the same mdev alias from the
>> > > >> >> >sysfs to rename
>> > > >> >> netdev name from ethX to enm<mdev_alias>, where en=Etherenet,
>> > > >> m=mdev.
>> > > >> >> >
>> > > >> >> >So yes, same 12 characters are use for mdev's netdev and mdev
>> > > >> >> >devlink port's
>> > > >> >> phys_port_name.
>> > > >> >> >
>> > > >> >> >Is that what are you asking?
>> > > >> >>
>> > > >> >> Yes. Then you have 3 chars to handle the rest of the name (pci, pf)...
>

^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-23 14:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <20190823082820.605deb07@x1.home>



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, August 23, 2019 7:58 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S . Miller
> <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>; Cornelia
> Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> On Fri, 23 Aug 2019 08:14:39 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > Hi Alex,
> >
> >
> > > -----Original Message-----
> > > From: Jiri Pirko <jiri@resnulli.us>
> > > Sent: Friday, August 23, 2019 1:42 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > > Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> <cohuck@redhat.com>;
> > > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > <cjia@nvidia.com>; netdev@vger.kernel.org
> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >
> > > Thu, Aug 22, 2019 at 03:33:30PM CEST, parav@mellanox.com wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Jiri Pirko <jiri@resnulli.us>
> > > >> Sent: Thursday, August 22, 2019 5:50 PM
> > > >> To: Parav Pandit <parav@mellanox.com>
> > > >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > > >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> > > <cohuck@redhat.com>;
> > > >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >>
> > > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:
> > > >> >
> > > >> >
> > > >> >> -----Original Message-----
> > > >> >> From: Jiri Pirko <jiri@resnulli.us>
> > > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > > >> >> To: Parav Pandit <parav@mellanox.com>
> > > >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > > >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > > >> >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> > > >> <cohuck@redhat.com>;
> > > >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >> >>
> > > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:
> > > >> >> >
> > > >> >> >
> > > >> >> >> -----Original Message-----
> > > >> >> >> From: Jiri Pirko <jiri@resnulli.us>
> > > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > > >> >> >> To: Parav Pandit <parav@mellanox.com>
> > > >> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri
> > > >> >> >> Pirko <jiri@mellanox.com>; David S . Miller
> > > >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > > >> >> >> <kwankhede@nvidia.com>; Cornelia Huck
> > > >> >> <cohuck@redhat.com>;
> > > >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > > >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > > >> >> >> core
> > > >> >> >>
> > > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com
> wrote:
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> >> -----Original Message-----
> > > >> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
> > > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > > >> >> >> >> To: Parav Pandit <parav@mellanox.com>
> > > >> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > > >> >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > > >> >> >> >> <kwankhede@nvidia.com>; Cornelia Huck
> > > >> >> >> >> <cohuck@redhat.com>; kvm@vger.kernel.org;
> > > >> >> >> >> linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > > >> >> >> >> netdev@vger.kernel.org
> > > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and
> > > >> >> >> >> mdev core
> > > >> >> >> >>
> > > >> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
> > > >> >> >> >> > > > > In fact, proposing that the user does not set
> > > >> >> >> >> > > > > it, mdev-core provides one
> > > >> >> >> >> > > automatically.
> > > >> >> >> >> > > > >
> > > >> >> >> >> > > > > > > Since there seems to be some prefix
> > > >> >> >> >> > > > > > > overhead, as I ask about above in how many
> > > >> >> >> >> > > > > > > characters we actually have to work with in
> > > >> >> >> >> > > > > > > IFNAMESZ, maybe we start with
> > > >> >> >> >> > > > > > > 8 characters (matching your "index"
> > > >> >> >> >> > > > > > > namespace) and expand as necessary for
> > > >> >> >> disambiguation.
> > > >> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ,
> > > >> >> >> >> > > > > > > let's start with
> > > >> 12.
> > > >> >> >> >> > > > > > > Thanks,
> > > >> >> >> >> > > > > > >
> > > >> >> >> >> > > > > > If user is going to choose the alias, why does
> > > >> >> >> >> > > > > > it have to be limited to
> > > >> >> >> >> sha1?
> > > >> >> >> >> > > > > > Or you just told it as an example?
> > > >> >> >> >> > > > > >
> > > >> >> >> >> > > > > > It can be an alpha-numeric string.
> > > >> >> >> >> > > > >
> > > >> >> >> >> > > > > No, I'm proposing a different solution where
> > > >> >> >> >> > > > > mdev-core creates an alias based on an
> > > >> >> >> >> > > > > abbreviated sha1.  The user does not provide the
> > > >> >> >> >> alias.
> > > >> >> >> >> > > > >
> > > >> >> >> >> > > > > > Instead of mdev imposing number of characters
> > > >> >> >> >> > > > > > on the alias, it should be best
> > > >> >> >> >> > > > > left to the user.
> > > >> >> >> >> > > > > > Because in future if netdev improves on the
> > > >> >> >> >> > > > > > naming scheme, mdev will be
> > > >> >> >> >> > > > > limiting it, which is not right.
> > > >> >> >> >> > > > > > So not restricting alias size seems right to me.
> > > >> >> >> >> > > > > > User configuring mdev for networking devices
> > > >> >> >> >> > > > > > in a given kernel knows what
> > > >> >> >> >> > > > > user is doing.
> > > >> >> >> >> > > > > > So user can choose alias name size as it finds suitable.
> > > >> >> >> >> > > > >
> > > >> >> >> >> > > > > That's not what I'm proposing, please read again.
> > > >> >> >> >> > > > > Thanks,
> > > >> >> >> >> > > >
> > > >> >> >> >> > > > I understood your point. But mdev doesn't know how
> > > >> >> >> >> > > > user is going to use
> > > >> >> >> >> > > udev/systemd to name the netdev.
> > > >> >> >> >> > > > So even if mdev chose to pick 12 characters, it
> > > >> >> >> >> > > > could result in
> > > >> >> collision.
> > > >> >> >> >> > > > Hence the proposal to provide the alias by the
> > > >> >> >> >> > > > user, as user know the best
> > > >> >> >> >> > > policy for its use case in the environment its using.
> > > >> >> >> >> > > > So 12 character sha1 method will still work by user.
> > > >> >> >> >> > >
> > > >> >> >> >> > > Haven't you already provided examples where certain
> > > >> >> >> >> > > drivers or subsystems have unique netdev prefixes?
> > > >> >> >> >> > > If mdev provides a unique alias within the
> > > >> >> >> >> > > subsystem, couldn't we simply define a netdev prefix
> > > >> >> >> >> > > for the mdev subsystem and avoid all other
> > > >> >> >> >> > > collisions?  I'm not in favor of the user providing
> > > >> >> >> >> > > both a uuid and an alias/instance.  Thanks,
> > > >> >> >> >> > >
> > > >> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1
> > > >> >> >> >> > first 9 characters have
> > > >> >> >> >> collision?
> > > >> >> >> >>
> > > >> >> >> >> I think it would be a mistake to waste so many chars on
> > > >> >> >> >> a prefix, but
> > > >> >> >> >> 9 characters of sha1 likely wouldn't have a collision
> > > >> >> >> >> before we have 10s of thousands of devices.  Thanks,
> > > >> >> >> >>
> > > >> >> >> >> Alex
> > > >> >> >> >
> > > >> >> >> >Jiri, Dave,
> > > >> >> >> >Are you ok with it for devlink/netdev part?
> > > >> >> >> >Mdev core will create an alias from a UUID.
> > > >> >> >> >
> > > >> >> >> >This will be supplied during devlink port attr set such
> > > >> >> >> >as,
> > > >> >> >> >
> > > >> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port,
> > > >> >> >> >const char *mdev_alias);
> > > >> >> >> >
> > > >> >> >> >This alias is used to generate representor netdev's
> phys_port_name.
> > > >> >> >> >This alias from the mdev device's sysfs will be used by
> > > >> >> >> >the udev/systemd to
> > > >> >> >> generate predicable netdev's name.
> > > >> >> >> >Example: enm<mdev_alias_first_12_chars>
> > > >> >> >>
> > > >> >> >> What happens in unlikely case of 2 UUIDs collide?
> > > >> >> >>
> > > >> >> >Since users sees two devices with same phys_port_name, user
> > > >> >> >should destroy
> > > >> >> recently created mdev and recreate mdev with different UUID?
> > > >> >>
> > > >> >> Driver should make sure phys port name wont collide,
> > > >> >So when mdev creation is initiated, mdev core calculates the
> > > >> >alias and if there
> > > >> is any other mdev with same alias exist, it returns -EEXIST error
> > > >> before progressing further.
> > > >> >This way user will get to know upfront in event of collision
> > > >> >before the mdev
> > > >> device gets created.
> > > >> >How about that?
> > > >>
> > > >> Sounds fine to me. Now the question is how many chars do we want to
> have.
> > > >>
> > > >12 characters from Alex's suggestion similar to git?
> > >
> > > Ok.
> > >
> >
> > Can you please confirm this scheme looks good now? I like to get patches
> started.
> 
> My only concern is your comment that in the event of an abbreviated
> sha1 collision (as exceptionally rare as that might be at 12-chars), we'd fail the
> device create, while my original suggestion was that vfio-core would add an
> extra character to the alias.  For non-networking devices, the sha1 is
> unnecessary, so the extension behavior seems preferred.  The user is only
> responsible to provide a unique uuid.  Perhaps the failure behavior could be
> applied based on the mdev device_api.  A module option on mdev to specify the
> default number of alias chars would also be useful for testing so that we can set
> it low enough to validate the collision behavior.  Thanks,
> 

Idea is to have mdev alias as optional.
Each mdev_parent says whether it wants mdev_core to generate an alias or not.
So only networking device drivers would set it to true.
For rest, alias won't be generated, and won't be compared either during creation time.
User continue to provide only uuid.
I am tempted to have alias collision detection only within children mdevs of the same parent, but doing so will always mandate to prefix in netdev name.
And currently we are left with only 3 characters to prefix it, so that may not be good either.
Hence, I think mdev core wide alias is better with 12 characters.

I do not understand how an extra character reduces collision, if that's what you meant.
Module options are almost not encouraged anymore with other subsystems/drivers.

For testing collision rate, a sample user space script and sample mtty is easy and get us collision count too.
We shouldn't put that using module option in production kernel.
I practically have the code ready to play with; Changing 12 to smaller value is easy with module reload.

#define MDEV_ALIAS_LEN 12

> Alex
> 
> > > >> >> in this case that it does
> > > >> >> not provide 2 same attrs for 2 different ports.
> > > >> >> Hmm, so the order of creation matters. That is not good.
> > > >> >>
> > > >> >> >>
> > > >> >> >> >I took Ethernet mdev as an example.
> > > >> >> >> >New prefix 'm' stands for mediated device.
> > > >> >> >> >Remaining 12 characters are first 12 chars of the mdev alias.
> > > >> >> >>
> > > >> >> >> Does this resolve the identification of devlink port representor?
> > > >> >> >Not sure if I understood your question correctly, attemping
> > > >> >> >to answer
> > > >> below.
> > > >> >> >phys_port_name of devlink port is defined by the first 12
> > > >> >> >characters of mdev
> > > >> >> alias.
> > > >> >> >> I assume you want to use the same 12(or so) chars, don't you?
> > > >> >> >Mdev's netdev will also use the same mdev alias from the
> > > >> >> >sysfs to rename
> > > >> >> netdev name from ethX to enm<mdev_alias>, where en=Etherenet,
> > > >> m=mdev.
> > > >> >> >
> > > >> >> >So yes, same 12 characters are use for mdev's netdev and mdev
> > > >> >> >devlink port's
> > > >> >> phys_port_name.
> > > >> >> >
> > > >> >> >Is that what are you asking?
> > > >> >>
> > > >> >> Yes. Then you have 3 chars to handle the rest of the name (pci, pf)...


^ permalink raw reply

* KASAN: use-after-free Read in rxrpc_release_call
From: syzbot @ 2019-08-23 14:42 UTC (permalink / raw)
  To: davem, dhowells, linux-afs, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    fed07ef3 Merge tag 'mlx5-updates-2019-08-21' of git://git...
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1256e22e600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e34a4fe936eac597
dashboard link: https://syzkaller.appspot.com/bug?extid=eed305768ece6682bb7f
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+eed305768ece6682bb7f@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in rxrpc_release_call+0xb2d/0xb60  
net/rxrpc/call_object.c:481
Read of size 8 at addr ffff888062ffeb50 by task syz-executor.5/4764

CPU: 1 PID: 4764 Comm: syz-executor.5 Not tainted 5.3.0-rc5+ #143
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
  kasan_report+0x12/0x17 mm/kasan/common.c:612
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
  rxrpc_release_call+0xb2d/0xb60 net/rxrpc/call_object.c:481
  rxrpc_release_calls_on_socket+0x6e7/0x1320 net/rxrpc/call_object.c:517
  rxrpc_release_sock net/rxrpc/af_rxrpc.c:898 [inline]
  rxrpc_release+0x40c/0x840 net/rxrpc/af_rxrpc.c:930
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459829
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe5ddebec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000459829
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe5ddebf6d4
R13: 00000000004f8f72 R14: 00000000004d1a70 R15: 00000000ffffffff

Allocated by task 4766:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:487 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:460
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:501
  kmem_cache_alloc_trace+0x158/0x790 mm/slab.c:3550
  kmalloc include/linux/slab.h:552 [inline]
  kzalloc include/linux/slab.h:748 [inline]
  rxrpc_alloc_connection+0x86/0x5f0 net/rxrpc/conn_object.c:41
  rxrpc_alloc_client_connection net/rxrpc/conn_client.c:176 [inline]
  rxrpc_get_client_conn net/rxrpc/conn_client.c:339 [inline]
  rxrpc_connect_call+0x648/0x4c00 net/rxrpc/conn_client.c:697
  rxrpc_new_client_call+0x978/0x19d0 net/rxrpc/call_object.c:289
  rxrpc_new_client_call_for_sendmsg net/rxrpc/sendmsg.c:594 [inline]
  rxrpc_do_sendmsg+0xff5/0x1d53 net/rxrpc/sendmsg.c:651
  rxrpc_sendmsg+0x4d6/0x5f0 net/rxrpc/af_rxrpc.c:585
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
  __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
  __do_sys_sendmmsg net/socket.c:2442 [inline]
  __se_sys_sendmmsg net/socket.c:2439 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 16:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:449
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:457
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  rxrpc_destroy_connection+0x1f2/0x2d0 net/rxrpc/conn_object.c:372
  __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
  rcu_do_batch kernel/rcu/tree.c:2114 [inline]
  rcu_core+0x67f/0x1580 kernel/rcu/tree.c:2314
  rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2323
  __do_softirq+0x262/0x98c kernel/softirq.c:292

The buggy address belongs to the object at ffff888062ffe900
  which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 592 bytes inside of
  1024-byte region [ffff888062ffe900, ffff888062ffed00)
The buggy address belongs to the page:
page:ffffea00018bff80 refcount:1 mapcount:0 mapping:ffff8880aa400c40  
index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea000181be88 ffffea0002324108 ffff8880aa400c40
raw: 0000000000000000 ffff888062ffe000 0000000100000007 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff888062ffea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888062ffea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff888062ffeb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                  ^
  ffff888062ffeb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888062ffec00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH net-next v4 0/2] r8152: save EEE
From: Andrew Lunn @ 2019-08-23 14:37 UTC (permalink / raw)
  To: Hayes Wang; +Cc: netdev, nic_swsd, linux-kernel
In-Reply-To: <1394712342-15778-311-Taiwan-albertk@realtek.com>

On Fri, Aug 23, 2019 at 03:33:39PM +0800, Hayes Wang wrote:
> v4:
> For patch #2, remove redundant calling of "ocp_reg_write(tp, OCP_EEE_ADV, 0)".
> 
> v3:
> For patch #2, fix the mistake caused by copying and pasting.
> 
> v2:
> Adjust patch #1. The EEE has been disabled in the beginning of
> r8153_hw_phy_cfg() and r8153b_hw_phy_cfg(), so only check if
> it is necessary to enable EEE.

Hi Hayes

That was 3 revisions of the patches in less than 30 minutes. Slow
down, take your time, review your work yourself before posting it,
etc.

Aim for no more than one revision, posted to the list, per day.

    Andrew

^ permalink raw reply

* WARNING in sk_msg_check_to_free
From: syzbot @ 2019-08-23 14:33 UTC (permalink / raw)
  To: ast, bpf, daniel, davem, john.fastabend, kafai, linux-kernel,
	netdev, songliubraving, syzkaller-bugs, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    fed07ef3 Merge tag 'mlx5-updates-2019-08-21' of git://git...
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=150102bc600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e34a4fe936eac597
dashboard link: https://syzkaller.appspot.com/bug?extid=ea3c54a7b2364123d818
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+ea3c54a7b2364123d818@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 0 PID: 14478 at include/linux/skmsg.h:129  
sk_msg_check_to_free.isra.0.part.0+0x15/0x19 include/linux/skmsg.h:129
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 14478 Comm: syz-executor.0 Not tainted 5.3.0-rc5+ #143
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:219
  __warn.cold+0x20/0x4c kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:sk_msg_check_to_free.isra.0.part.0+0x15/0x19  
include/linux/skmsg.h:129
Code: 77 ff ff ff e8 4e de 03 fc eb 96 4c 89 f7 e8 e4 dd 03 fc eb c3 55 48  
89 e5 e8 f9 b4 c9 fb 48 c7 c7 80 3a 48 88 e8 c1 55 b3 fb <0f> 0b 5d c3 e8  
e4 b4 c9 fb e8 dd ff ff ff 48 8b 45 d0 0f b6 00 41
RSP: 0018:ffff88806381fb98 EFLAGS: 00010286
RAX: 0000000000000024 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815c2456 RDI: ffffed100c703f65
RBP: ffff88806381fb98 R08: 0000000000000024 R09: ffffed1015d060d1
R10: ffffed1015d060d0 R11: ffff8880ae830687 R12: 000000000000000d
R13: ffff8880986c1550 R14: 0000000000000001 R15: 0000000000000007
  sk_msg_check_to_free include/linux/skmsg.h:129 [inline]
  __sk_msg_free.cold+0xa/0x2e net/core/skmsg.c:190
  sk_msg_free+0x44/0x60 net/core/skmsg.c:207
  tls_sw_release_resources_tx+0x268/0x6b0 net/tls/tls_sw.c:2092
  tls_sk_proto_cleanup net/tls/tls_main.c:275 [inline]
  tls_sk_proto_close+0x6a7/0x990 net/tls/tls_main.c:305
  inet_release+0xed/0x200 net/ipv4/af_inet.c:427
  inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x413511
Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 1b 00 00 c3 48  
83 ec 08 e8 0a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 53 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffdca8135c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000413511
RDX: 0000001b31920000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000001 R08: 0000000025038832 R09: 0000000025038836
R10: 00007ffdca8136a0 R11: 0000000000000293 R12: 000000000075bf20
R13: 000000000009f412 R14: 0000000000760b20 R15: ffffffffffffffff
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Alex Williamson @ 2019-08-23 14:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866DED407D6F1C653D5D560D1A40@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Fri, 23 Aug 2019 08:14:39 +0000
Parav Pandit <parav@mellanox.com> wrote:

> Hi Alex,
> 
> 
> > -----Original Message-----
> > From: Jiri Pirko <jiri@resnulli.us>
> > Sent: Friday, August 23, 2019 1:42 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > Wankhede <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > Thu, Aug 22, 2019 at 03:33:30PM CEST, parav@mellanox.com wrote:  
> > >
> > >  
> > >> -----Original Message-----
> > >> From: Jiri Pirko <jiri@resnulli.us>
> > >> Sent: Thursday, August 22, 2019 5:50 PM
> > >> To: Parav Pandit <parav@mellanox.com>
> > >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > <cohuck@redhat.com>;  
> > >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >>
> > >> Thu, Aug 22, 2019 at 12:04:02PM CEST, parav@mellanox.com wrote:  
> > >> >
> > >> >  
> > >> >> -----Original Message-----
> > >> >> From: Jiri Pirko <jiri@resnulli.us>
> > >> >> Sent: Thursday, August 22, 2019 3:28 PM
> > >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> > >> >> Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > >> <cohuck@redhat.com>;  
> > >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >>
> > >> >> Thu, Aug 22, 2019 at 11:42:13AM CEST, parav@mellanox.com wrote:  
> > >> >> >
> > >> >> >  
> > >> >> >> -----Original Message-----
> > >> >> >> From: Jiri Pirko <jiri@resnulli.us>
> > >> >> >> Sent: Thursday, August 22, 2019 2:59 PM
> > >> >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> >> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> > >> >> >> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > >> >> >> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck  
> > >> >> <cohuck@redhat.com>;  
> > >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >> >> >>
> > >> >> >> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com wrote:  
> > >> >> >> >
> > >> >> >> >  
> > >> >> >> >> -----Original Message-----
> > >> >> >> >> From: Alex Williamson <alex.williamson@redhat.com>
> > >> >> >> >> Sent: Wednesday, August 21, 2019 10:56 AM
> > >> >> >> >> To: Parav Pandit <parav@mellanox.com>
> > >> >> >> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > >> >> >> >> <davem@davemloft.net>; Kirti Wankhede
> > >> >> >> >> <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > >> >> >> >> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia
> > >> >> >> >> <cjia@nvidia.com>; netdev@vger.kernel.org
> > >> >> >> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev
> > >> >> >> >> core
> > >> >> >> >>  
> > >> >> >> >> > > > > Just an example of the alias, not proposing how it's set.
> > >> >> >> >> > > > > In fact, proposing that the user does not set it,
> > >> >> >> >> > > > > mdev-core provides one  
> > >> >> >> >> > > automatically.  
> > >> >> >> >> > > > >  
> > >> >> >> >> > > > > > > Since there seems to be some prefix overhead, as
> > >> >> >> >> > > > > > > I ask about above in how many characters we
> > >> >> >> >> > > > > > > actually have to work with in IFNAMESZ, maybe we
> > >> >> >> >> > > > > > > start with
> > >> >> >> >> > > > > > > 8 characters (matching your "index" namespace)
> > >> >> >> >> > > > > > > and expand as necessary for  
> > >> >> >> disambiguation.  
> > >> >> >> >> > > > > > > If we can eliminate overhead in IFNAMESZ, let's
> > >> >> >> >> > > > > > > start with  
> > >> 12.  
> > >> >> >> >> > > > > > > Thanks,
> > >> >> >> >> > > > > > >  
> > >> >> >> >> > > > > > If user is going to choose the alias, why does it
> > >> >> >> >> > > > > > have to be limited to  
> > >> >> >> >> sha1?  
> > >> >> >> >> > > > > > Or you just told it as an example?
> > >> >> >> >> > > > > >
> > >> >> >> >> > > > > > It can be an alpha-numeric string.  
> > >> >> >> >> > > > >
> > >> >> >> >> > > > > No, I'm proposing a different solution where
> > >> >> >> >> > > > > mdev-core creates an alias based on an abbreviated
> > >> >> >> >> > > > > sha1.  The user does not provide the  
> > >> >> >> >> alias.  
> > >> >> >> >> > > > >  
> > >> >> >> >> > > > > > Instead of mdev imposing number of characters on
> > >> >> >> >> > > > > > the alias, it should be best  
> > >> >> >> >> > > > > left to the user.  
> > >> >> >> >> > > > > > Because in future if netdev improves on the naming
> > >> >> >> >> > > > > > scheme, mdev will be  
> > >> >> >> >> > > > > limiting it, which is not right.  
> > >> >> >> >> > > > > > So not restricting alias size seems right to me.
> > >> >> >> >> > > > > > User configuring mdev for networking devices in a
> > >> >> >> >> > > > > > given kernel knows what  
> > >> >> >> >> > > > > user is doing.  
> > >> >> >> >> > > > > > So user can choose alias name size as it finds suitable.  
> > >> >> >> >> > > > >
> > >> >> >> >> > > > > That's not what I'm proposing, please read again.
> > >> >> >> >> > > > > Thanks,  
> > >> >> >> >> > > >
> > >> >> >> >> > > > I understood your point. But mdev doesn't know how
> > >> >> >> >> > > > user is going to use  
> > >> >> >> >> > > udev/systemd to name the netdev.  
> > >> >> >> >> > > > So even if mdev chose to pick 12 characters, it could
> > >> >> >> >> > > > result in  
> > >> >> collision.  
> > >> >> >> >> > > > Hence the proposal to provide the alias by the user,
> > >> >> >> >> > > > as user know the best  
> > >> >> >> >> > > policy for its use case in the environment its using.  
> > >> >> >> >> > > > So 12 character sha1 method will still work by user.  
> > >> >> >> >> > >
> > >> >> >> >> > > Haven't you already provided examples where certain
> > >> >> >> >> > > drivers or subsystems have unique netdev prefixes?  If
> > >> >> >> >> > > mdev provides a unique alias within the subsystem,
> > >> >> >> >> > > couldn't we simply define a netdev prefix for the mdev
> > >> >> >> >> > > subsystem and avoid all other collisions?  I'm not in
> > >> >> >> >> > > favor of the user providing both a uuid and an
> > >> >> >> >> > > alias/instance.  Thanks,
> > >> >> >> >> > >  
> > >> >> >> >> > For a given prefix, say ens2f0, can two UUID->sha1 first 9
> > >> >> >> >> > characters have  
> > >> >> >> >> collision?
> > >> >> >> >>
> > >> >> >> >> I think it would be a mistake to waste so many chars on a
> > >> >> >> >> prefix, but
> > >> >> >> >> 9 characters of sha1 likely wouldn't have a collision before
> > >> >> >> >> we have 10s of thousands of devices.  Thanks,
> > >> >> >> >>
> > >> >> >> >> Alex  
> > >> >> >> >
> > >> >> >> >Jiri, Dave,
> > >> >> >> >Are you ok with it for devlink/netdev part?
> > >> >> >> >Mdev core will create an alias from a UUID.
> > >> >> >> >
> > >> >> >> >This will be supplied during devlink port attr set such as,
> > >> >> >> >
> > >> >> >> >devlink_port_attrs_mdev_set(struct devlink_port *port, const
> > >> >> >> >char *mdev_alias);
> > >> >> >> >
> > >> >> >> >This alias is used to generate representor netdev's phys_port_name.
> > >> >> >> >This alias from the mdev device's sysfs will be used by the
> > >> >> >> >udev/systemd to  
> > >> >> >> generate predicable netdev's name.  
> > >> >> >> >Example: enm<mdev_alias_first_12_chars>  
> > >> >> >>
> > >> >> >> What happens in unlikely case of 2 UUIDs collide?
> > >> >> >>  
> > >> >> >Since users sees two devices with same phys_port_name, user
> > >> >> >should destroy  
> > >> >> recently created mdev and recreate mdev with different UUID?
> > >> >>
> > >> >> Driver should make sure phys port name wont collide,  
> > >> >So when mdev creation is initiated, mdev core calculates the alias
> > >> >and if there  
> > >> is any other mdev with same alias exist, it returns -EEXIST error
> > >> before progressing further.  
> > >> >This way user will get to know upfront in event of collision before
> > >> >the mdev  
> > >> device gets created.  
> > >> >How about that?  
> > >>
> > >> Sounds fine to me. Now the question is how many chars do we want to have.
> > >>  
> > >12 characters from Alex's suggestion similar to git?  
> > 
> > Ok.
> >   
> 
> Can you please confirm this scheme looks good now? I like to get patches started.

My only concern is your comment that in the event of an abbreviated
sha1 collision (as exceptionally rare as that might be at 12-chars),
we'd fail the device create, while my original suggestion was that
vfio-core would add an extra character to the alias.  For
non-networking devices, the sha1 is unnecessary, so the extension
behavior seems preferred.  The user is only responsible to provide a
unique uuid.  Perhaps the failure behavior could be applied based on
the mdev device_api.  A module option on mdev to specify the default
number of alias chars would also be useful for testing so that we can
set it low enough to validate the collision behavior.  Thanks,

Alex

> > >> >> in this case that it does
> > >> >> not provide 2 same attrs for 2 different ports.
> > >> >> Hmm, so the order of creation matters. That is not good.
> > >> >>  
> > >> >> >>  
> > >> >> >> >I took Ethernet mdev as an example.
> > >> >> >> >New prefix 'm' stands for mediated device.
> > >> >> >> >Remaining 12 characters are first 12 chars of the mdev alias.  
> > >> >> >>
> > >> >> >> Does this resolve the identification of devlink port representor?  
> > >> >> >Not sure if I understood your question correctly, attemping to
> > >> >> >answer  
> > >> below.  
> > >> >> >phys_port_name of devlink port is defined by the first 12
> > >> >> >characters of mdev  
> > >> >> alias.  
> > >> >> >> I assume you want to use the same 12(or so) chars, don't you?  
> > >> >> >Mdev's netdev will also use the same mdev alias from the sysfs to
> > >> >> >rename  
> > >> >> netdev name from ethX to enm<mdev_alias>, where en=Etherenet,  
> > >> m=mdev.  
> > >> >> >
> > >> >> >So yes, same 12 characters are use for mdev's netdev and mdev
> > >> >> >devlink port's  
> > >> >> phys_port_name.  
> > >> >> >
> > >> >> >Is that what are you asking?  
> > >> >>
> > >> >> Yes. Then you have 3 chars to handle the rest of the name (pci, pf)...  


^ permalink raw reply

* Re: [PATCH net-next] bnxt_en: Fix allocation of zero statistics block size regression.
From: Jonathan Lemon @ 2019-08-23 14:24 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev, kernel-team
In-Reply-To: <1566539501-5884-1-git-send-email-michael.chan@broadcom.com>



On 22 Aug 2019, at 22:51, Michael Chan wrote:

> Recent commit added logic to determine the appropriate statistics 
> block
> size to allocate and the size is stored in bp->hw_ring_stats_size.  
> But
> if the firmware spec is older than 1.6.0, it is 0 and not initialized.
> This causes the allocation to fail with size 0 and bnxt_open() to
> abort.  Fix it by always initializing bp->hw_ring_stats_size to the
> legacy default size value.
>
> Fixes: 4e7485066373 ("bnxt_en: Allocate the larger per-ring statistics 
> block for 57500 chips.")
> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

Thanks, Michael!
-- 
Jonathan

^ permalink raw reply

* Re: [PATCH] ethernet: Delete unnecessary checks before the macro call “dev_kfree_skb”
From: Christophe JAILLET @ 2019-08-23 14:08 UTC (permalink / raw)
  To: Markus Elfring, netdev, linux-arm-kernel, linux-stm32,
	intel-wired-lan, bcm-kernel-feedback-list, UNGLinuxDriver,
	Alexandre Torgue, Alexios Zavras, Allison Randal, Bryan Whitehead,
	Claudiu Manoil, David S. Miller, Doug Berger, Douglas Miller,
	Florian Fainelli, Giuseppe Cavallaro, Greg Kroah-Hartman,
	Jeff Kirsher, Jilayne Lovejoy, Jonathan Lemon, Jose Abreu,
	Kate Stewart
  Cc: kernel-janitors, LKML
In-Reply-To: <af1ae1cf-4a01-5e3a-edc2-058668487137@web.de>

Hi,

in this patch, there is one piece that looked better before. (see below)

Removing the 'if (skb)' is fine, but concatening everything in one 
statement just to save 2 variables and a few LOC is of no use, IMHO, and 
the code is less readable.

just my 2c.


CJ


diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index d3a0b614dbfa..8b19ddcdafaa 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -2515,19 +2515,14 @@ static int bcmgenet_dma_teardown(struct 
bcmgenet_priv *priv)
  static void bcmgenet_fini_dma(struct bcmgenet_priv *priv)
  {
      struct netdev_queue *txq;
-    struct sk_buff *skb;
-    struct enet_cb *cb;
      int i;

      bcmgenet_fini_rx_napi(priv);
      bcmgenet_fini_tx_napi(priv);

-    for (i = 0; i < priv->num_tx_bds; i++) {
-        cb = priv->tx_cbs + i;
-        skb = bcmgenet_free_tx_cb(&priv->pdev->dev, cb);
-        if (skb)
-            dev_kfree_skb(skb);
-    }
+    for (i = 0; i < priv->num_tx_bds; i++)
+ dev_kfree_skb(bcmgenet_free_tx_cb(&priv->pdev->dev,
+                          priv->tx_cbs + i));

      for (i = 0; i < priv->hw_params->tx_queues; i++) {
          txq = netdev_get_tx_queue(priv->dev, priv->tx_rings[i].queue);

^ permalink raw reply related

* [PATCH net-next] net/rds: Whitelist rdma_cookie and rx_tstamp for usercopy
From: Dag Moxnes @ 2019-08-23 14:03 UTC (permalink / raw)
  To: santosh.shilimkar, netdev, linux-rdma, rds-devel; +Cc: davem, dag.moxnes

Add the RDMA cookie and RX timestamp to the usercopy whitelist.

After the introduction of hardened usercopy whitelisting
(https://lwn.net/Articles/727322/), a warning is displayed when the
RDMA cookie or RX timestamp is copied to userspace:

kernel: WARNING: CPU: 3 PID: 5750 at
mm/usercopy.c:81 usercopy_warn+0x8e/0xa6
[...]
kernel: Call Trace:
kernel: __check_heap_object+0xb8/0x11b
kernel: __check_object_size+0xe3/0x1bc
kernel: put_cmsg+0x95/0x115
kernel: rds_recvmsg+0x43d/0x620 [rds]
kernel: sock_recvmsg+0x43/0x4a
kernel: ___sys_recvmsg+0xda/0x1e6
kernel: ? __handle_mm_fault+0xcae/0xf79
kernel: __sys_recvmsg+0x51/0x8a
kernel: SyS_recvmsg+0x12/0x1c
kernel: do_syscall_64+0x79/0x1ae

When the whitelisting feature was introduced, the memory for the RDMA
cookie and RX timestamp in RDS was not added to the whitelist, causing
the warning above.

Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Tested-by: jenny.x.xu@oracle.com
---
 net/rds/ib_recv.c | 11 ++++++++---
 net/rds/rds.h     |  9 +++++++--
 net/rds/recv.c    | 22 ++++++++++++----------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 3cae88cbda..fecd0abdc7 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -1038,9 +1038,14 @@ int rds_ib_recv_init(void)
 	si_meminfo(&si);
 	rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE;
 
-	rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming",
-					sizeof(struct rds_ib_incoming),
-					0, SLAB_HWCACHE_ALIGN, NULL);
+	rds_ib_incoming_slab =
+		kmem_cache_create_usercopy("rds_ib_incoming",
+					   sizeof(struct rds_ib_incoming),
+					   0, SLAB_HWCACHE_ALIGN,
+					   offsetof(struct rds_ib_incoming,
+						    ii_inc.i_usercopy),
+					   sizeof(struct rds_inc_usercopy),
+					   NULL);
 	if (!rds_ib_incoming_slab)
 		goto out;
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index f0066d1684..e792a67dd5 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -271,6 +271,12 @@ struct rds_ext_header_rdma_dest {
 #define	RDS_MSG_RX_END		2
 #define	RDS_MSG_RX_CMSG		3
 
+/* The following values are whitelisted for usercopy */
+struct rds_inc_usercopy {
+	rds_rdma_cookie_t	rdma_cookie;
+	ktime_t			rx_tstamp;
+};
+
 struct rds_incoming {
 	refcount_t		i_refcount;
 	struct list_head	i_item;
@@ -280,8 +286,7 @@ struct rds_incoming {
 	unsigned long		i_rx_jiffies;
 	struct in6_addr		i_saddr;
 
-	rds_rdma_cookie_t	i_rdma_cookie;
-	ktime_t			i_rx_tstamp;
+	struct rds_inc_usercopy i_usercopy;
 	u64			i_rx_lat_trace[RDS_RX_MAX_TRACES];
 };
 
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 853de48760..7e451c8259 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -47,8 +47,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 	INIT_LIST_HEAD(&inc->i_item);
 	inc->i_conn = conn;
 	inc->i_saddr = *saddr;
-	inc->i_rdma_cookie = 0;
-	inc->i_rx_tstamp = ktime_set(0, 0);
+	inc->i_usercopy.rdma_cookie = 0;
+	inc->i_usercopy.rx_tstamp = ktime_set(0, 0);
 
 	memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
@@ -62,8 +62,8 @@ void rds_inc_path_init(struct rds_incoming *inc, struct rds_conn_path *cp,
 	inc->i_conn = cp->cp_conn;
 	inc->i_conn_path = cp;
 	inc->i_saddr = *saddr;
-	inc->i_rdma_cookie = 0;
-	inc->i_rx_tstamp = ktime_set(0, 0);
+	inc->i_usercopy.rdma_cookie = 0;
+	inc->i_usercopy.rx_tstamp = ktime_set(0, 0);
 }
 EXPORT_SYMBOL_GPL(rds_inc_path_init);
 
@@ -186,7 +186,7 @@ static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock
 		case RDS_EXTHDR_RDMA_DEST:
 			/* We ignore the size for now. We could stash it
 			 * somewhere and use it for error checking. */
-			inc->i_rdma_cookie = rds_rdma_make_cookie(
+			inc->i_usercopy.rdma_cookie = rds_rdma_make_cookie(
 					be32_to_cpu(buffer.rdma_dest.h_rdma_rkey),
 					be32_to_cpu(buffer.rdma_dest.h_rdma_offset));
 
@@ -380,7 +380,7 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 				      be32_to_cpu(inc->i_hdr.h_len),
 				      inc->i_hdr.h_dport);
 		if (sock_flag(sk, SOCK_RCVTSTAMP))
-			inc->i_rx_tstamp = ktime_get_real();
+			inc->i_usercopy.rx_tstamp = ktime_get_real();
 		rds_inc_addref(inc);
 		inc->i_rx_lat_trace[RDS_MSG_RX_END] = local_clock();
 		list_add_tail(&inc->i_item, &rs->rs_recv_queue);
@@ -540,16 +540,18 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
 {
 	int ret = 0;
 
-	if (inc->i_rdma_cookie) {
+	if (inc->i_usercopy.rdma_cookie) {
 		ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST,
-				sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie);
+				sizeof(inc->i_usercopy.rdma_cookie),
+				&inc->i_usercopy.rdma_cookie);
 		if (ret)
 			goto out;
 	}
 
-	if ((inc->i_rx_tstamp != 0) &&
+	if ((inc->i_usercopy.rx_tstamp != 0) &&
 	    sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
-		struct __kernel_old_timeval tv = ns_to_kernel_old_timeval(inc->i_rx_tstamp);
+		struct __kernel_old_timeval tv =
+			ns_to_kernel_old_timeval(inc->i_usercopy.rx_tstamp);
 
 		if (!sock_flag(rds_rs_to_sk(rs), SOCK_TSTAMP_NEW)) {
 			ret = put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_OLD,
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v3 2/3] net: ethernet: mediatek: Re-add support SGMII
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

* Re-add SGMII support but now with PHYLINK API support
  So the SGMII changes are more clear
* Move SGMII block setup from mtk_gmac_sgmii_path_setup() to
  mtk_mac_config()
* Merge mtk_setup_hw_path() into mtk_mac_config()
* Remove mediatek,physpeed property, fixed-link supports now any speed so
  speed = <2500>; is now valid with PHYLINK.
* Demagic SGMII register values
* Use phylink state to setup fixed-link mode

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* Redo validate(), it was totally wrong. Noticed by Russell King.
v1->v2:
* SGMII port only support SGMII at 1Gbit, 1000BASE-X and 2500BASE-X.
  Also SGMII mode only does auto-negotiation.
* Change validate() to support mt76x8 changes.
---
 drivers/net/ethernet/mediatek/mtk_eth_path.c |  75 +--------
 drivers/net/ethernet/mediatek/mtk_eth_soc.c  | 157 ++++++++++++++++---
 drivers/net/ethernet/mediatek/mtk_eth_soc.h  |  37 ++++-
 drivers/net/ethernet/mediatek/mtk_sgmii.c    |  65 +++++---
 4 files changed, 219 insertions(+), 115 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_path.c b/drivers/net/ethernet/mediatek/mtk_eth_path.c
index 28960e4c4e43..ef11cf3d1ccc 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_path.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_path.c
@@ -239,10 +239,9 @@ static int mtk_eth_mux_setup(struct mtk_eth *eth, int path)
 	return err;
 }
 
-static int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
 {
-	unsigned int val = 0;
-	int sid, err, path;
+	int err, path;
 
 	path = (mac_id == 0) ?  MTK_ETH_PATH_GMAC1_SGMII :
 				MTK_ETH_PATH_GMAC2_SGMII;
@@ -252,33 +251,10 @@ static int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id)
 	if (err)
 		return err;
 
-	/* The path GMAC to SGMII will be enabled once the SGMIISYS is being
-	 * setup done.
-	 */
-	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-
-	regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
-			   SYSCFG0_SGMII_MASK, ~(u32)SYSCFG0_SGMII_MASK);
-
-	/* Decide how GMAC and SGMIISYS be mapped */
-	sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ? 0 : mac_id;
-
-	/* Setup SGMIISYS with the determined property */
-	if (MTK_HAS_FLAGS(eth->sgmii->flags[sid], MTK_SGMII_PHYSPEED_AN))
-		err = mtk_sgmii_setup_mode_an(eth->sgmii, sid);
-	else
-		err = mtk_sgmii_setup_mode_force(eth->sgmii, sid);
-
-	if (err)
-		return err;
-
-	regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
-			   SYSCFG0_SGMII_MASK, val);
-
 	return 0;
 }
 
-static int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
 {
 	int err, path = 0;
 
@@ -296,7 +272,7 @@ static int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id)
 	return 0;
 }
 
-static int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
+int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
 {
 	int err, path;
 
@@ -311,46 +287,3 @@ static int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id)
 	return 0;
 }
 
-int mtk_setup_hw_path(struct mtk_eth *eth, int mac_id, int phymode)
-{
-	int err;
-
-	/* No mux'ing for MT7628/88 */
-	if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628))
-		return 0;
-
-	switch (phymode) {
-	case PHY_INTERFACE_MODE_TRGMII:
-	case PHY_INTERFACE_MODE_RGMII_TXID:
-	case PHY_INTERFACE_MODE_RGMII_RXID:
-	case PHY_INTERFACE_MODE_RGMII_ID:
-	case PHY_INTERFACE_MODE_RGMII:
-	case PHY_INTERFACE_MODE_MII:
-	case PHY_INTERFACE_MODE_REVMII:
-	case PHY_INTERFACE_MODE_RMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_RGMII)) {
-			err = mtk_gmac_rgmii_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	case PHY_INTERFACE_MODE_SGMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_SGMII)) {
-			err = mtk_gmac_sgmii_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	case PHY_INTERFACE_MODE_GMII:
-		if (MTK_HAS_CAPS(eth->soc->caps, MTK_GEPHY)) {
-			err = mtk_gmac_gephy_path_setup(eth, mac_id);
-			if (err)
-				return err;
-		}
-		break;
-	default:
-		break;
-	}
-
-	return 0;
-}
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index a04baad6337c..d471a8d16aa5 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -193,8 +193,8 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 	struct mtk_mac *mac = container_of(config, struct mtk_mac,
 					   phylink_config);
 	struct mtk_eth *eth = mac->hw;
-	u32 mcr_cur, mcr_new;
-	int val, ge_mode = 0;
+	u32 mcr_cur, mcr_new, sid;
+	int val, ge_mode, err;
 
 	/* MT76x8 has no hardware settings between for the MAC */
 	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) &&
@@ -208,29 +208,42 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 					  MTK_GMAC1_TRGMII))
 				goto err_phy;
 			/* fall through */
-		case PHY_INTERFACE_MODE_GMII:
 		case PHY_INTERFACE_MODE_RGMII_TXID:
 		case PHY_INTERFACE_MODE_RGMII_RXID:
 		case PHY_INTERFACE_MODE_RGMII_ID:
 		case PHY_INTERFACE_MODE_RGMII:
-			break;
 		case PHY_INTERFACE_MODE_MII:
-			ge_mode = 1;
-			break;
 		case PHY_INTERFACE_MODE_REVMII:
-			ge_mode = 2;
-			break;
 		case PHY_INTERFACE_MODE_RMII:
-			if (mac->id)
-				goto err_phy;
-			ge_mode = 3;
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_RGMII)) {
+				err = mtk_gmac_rgmii_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
+			break;
+		case PHY_INTERFACE_MODE_1000BASEX:
+		case PHY_INTERFACE_MODE_2500BASEX:
+		case PHY_INTERFACE_MODE_SGMII:
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_SGMII)) {
+				err = mtk_gmac_sgmii_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
+			break;
+		case PHY_INTERFACE_MODE_GMII:
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_GEPHY)) {
+				err = mtk_gmac_gephy_path_setup(eth, mac->id);
+				if (err)
+					goto init_err;
+			}
 			break;
 		default:
 			goto err_phy;
 		}
 
 		/* Setup clock for 1st gmac */
-		if (!mac->id &&
+		if (!mac->id && state->interface != PHY_INTERFACE_MODE_SGMII &&
+		    !phy_interface_mode_is_8023z(state->interface) &&
 		    MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII)) {
 			if (MTK_HAS_CAPS(mac->hw->soc->caps,
 					 MTK_TRGMII_MT7621_CLK)) {
@@ -245,6 +258,23 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 			}
 		}
 
+		ge_mode = 0;
+		switch (state->interface) {
+		case PHY_INTERFACE_MODE_MII:
+			ge_mode = 1;
+			break;
+		case PHY_INTERFACE_MODE_REVMII:
+			ge_mode = 2;
+			break;
+		case PHY_INTERFACE_MODE_RMII:
+			if (mac->id)
+				goto err_phy;
+			ge_mode = 3;
+			break;
+		default:
+			break;
+		}
+
 		/* put the gmac into the right mode */
 		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
 		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
@@ -254,6 +284,40 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 		mac->interface = state->interface;
 	}
 
+	/* SGMII */
+	if (state->interface == PHY_INTERFACE_MODE_SGMII ||
+	    phy_interface_mode_is_8023z(state->interface)) {
+		/* The path GMAC to SGMII will be enabled once the SGMIISYS is
+		 * being setup done.
+		 */
+		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
+
+		regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
+				   SYSCFG0_SGMII_MASK,
+				   ~(u32)SYSCFG0_SGMII_MASK);
+
+		/* Decide how GMAC and SGMIISYS be mapped */
+		sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ?
+		       0 : mac->id;
+
+		/* Setup SGMIISYS with the determined property */
+		if (state->interface != PHY_INTERFACE_MODE_SGMII)
+			err = mtk_sgmii_setup_mode_force(eth->sgmii, sid,
+							 state);
+		else if (phylink_autoneg_inband(mode))
+			err = mtk_sgmii_setup_mode_an(eth->sgmii, sid);
+
+		if (err)
+			goto init_err;
+
+		regmap_update_bits(eth->ethsys, ETHSYS_SYSCFG0,
+				   SYSCFG0_SGMII_MASK, val);
+	} else if (phylink_autoneg_inband(mode)) {
+		dev_err(eth->dev,
+			"In-band mode not supported in non SGMII mode!\n");
+		return;
+	}
+
 	/* Setup gmac */
 	mcr_cur = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 	mcr_new = mcr_cur;
@@ -264,6 +328,7 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 		   MAC_MCR_BACKOFF_EN | MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_LINK;
 
 	switch (state->speed) {
+	case SPEED_2500:
 	case SPEED_1000:
 		mcr_new |= MAC_MCR_SPEED_1000;
 		break;
@@ -288,6 +353,11 @@ static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
 err_phy:
 	dev_err(eth->dev, "%s: GMAC%d mode %s not supported!\n", __func__,
 		mac->id, phy_modes(state->interface));
+	return;
+
+init_err:
+	dev_err(eth->dev, "%s: GMAC%d mode %s err: %d!\n", __func__,
+		mac->id, phy_modes(state->interface), err);
 }
 
 static int mtk_mac_link_state(struct phylink_config *config,
@@ -326,7 +396,10 @@ static int mtk_mac_link_state(struct phylink_config *config,
 
 static void mtk_mac_an_restart(struct phylink_config *config)
 {
-	/* Do nothing */
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+
+	mtk_sgmii_restart_an(mac->hw, mac->id);
 }
 
 static void mtk_mac_link_down(struct phylink_config *config, unsigned int mode,
@@ -366,7 +439,10 @@ static void mtk_validate(struct phylink_config *config,
 	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
 	      phy_interface_mode_is_rgmii(state->interface)) &&
 	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
-	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
+	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII) &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_SGMII) &&
+	      (state->interface == PHY_INTERFACE_MODE_SGMII ||
+	       phy_interface_mode_is_8023z(state->interface)))) {
 		linkmode_zero(supported);
 		return;
 	}
@@ -374,18 +450,58 @@ static void mtk_validate(struct phylink_config *config,
 	phylink_set_port_modes(mask);
 	phylink_set(mask, Autoneg);
 
-	if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_SGMII:
+		phylink_set(mask, 10baseT_Half);
+		phylink_set(mask, 10baseT_Full);
+		phylink_set(mask, 100baseT_Half);
+		phylink_set(mask, 100baseT_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_TRGMII:
 		phylink_set(mask, 1000baseT_Full);
-	} else {
+		break;
+	case PHY_INTERFACE_MODE_2500BASEX:
+		phylink_set(mask, 2500baseX_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_1000BASEX:
+		phylink_set(mask, 1000baseX_Full);
+		break;
+	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+		phylink_set(mask, 1000baseX_Full);
+		/* fall through */
+	case PHY_INTERFACE_MODE_GMII:
+		phylink_set(mask, 1000baseT_Full);
+		phylink_set(mask, 1000baseT_Half);
+		/* fall through */
+	case PHY_INTERFACE_MODE_MII:
+	case PHY_INTERFACE_MODE_RMII:
+	case PHY_INTERFACE_MODE_REVMII:
+	case PHY_INTERFACE_MODE_NA:
+	default:
 		phylink_set(mask, 10baseT_Half);
 		phylink_set(mask, 10baseT_Full);
 		phylink_set(mask, 100baseT_Half);
 		phylink_set(mask, 100baseT_Full);
+		break;
+	}
 
-		if (state->interface != PHY_INTERFACE_MODE_MII) {
-			phylink_set(mask, 1000baseT_Half);
+	if (state->interface == PHY_INTERFACE_MODE_NA) {
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_SGMII)) {
 			phylink_set(mask, 1000baseT_Full);
 			phylink_set(mask, 1000baseX_Full);
+			phylink_set(mask, 2500baseX_Full);
+		}
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII)) {
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseT_Half);
+			phylink_set(mask, 1000baseX_Full);
+		}
+		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GEPHY)) {
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseT_Half);
 		}
 	}
 
@@ -394,6 +510,11 @@ static void mtk_validate(struct phylink_config *config,
 
 	linkmode_and(supported, supported, mask);
 	linkmode_and(state->advertising, state->advertising, mask);
+
+	/* We can only operate at 2500BaseX or 1000BaseX. If requested
+	 * to advertise both, only report advertising at 2500BaseX.
+	 */
+	phylink_helper_basex_speed(state);
 }
 
 static const struct phylink_mac_ops mtk_phylink_ops = {
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 7f5f541daad7..76bd12cb8150 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -412,14 +412,38 @@
 /* Register to auto-negotiation restart */
 #define SGMSYS_PCS_CONTROL_1	0x0
 #define SGMII_AN_RESTART	BIT(9)
+#define SGMII_ISOLATE		BIT(10)
+#define SGMII_AN_ENABLE		BIT(12)
+#define SGMII_LINK_STATYS	BIT(18)
+#define SGMII_AN_ABILITY	BIT(19)
+#define SGMII_AN_COMPLETE	BIT(21)
+#define SGMII_PCS_FAULT		BIT(23)
+#define SGMII_AN_EXPANSION_CLR	BIT(30)
 
 /* Register to programmable link timer, the unit in 2 * 8ns */
 #define SGMSYS_PCS_LINK_TIMER	0x18
 #define SGMII_LINK_TIMER_DEFAULT	(0x186a0 & GENMASK(19, 0))
 
 /* Register to control remote fault */
-#define SGMSYS_SGMII_MODE	0x20
-#define SGMII_REMOTE_FAULT_DIS	BIT(8)
+#define SGMSYS_SGMII_MODE		0x20
+#define SGMII_IF_MODE_BIT0		BIT(0)
+#define SGMII_SPEED_DUPLEX_AN		BIT(1)
+#define SGMII_SPEED_10			0x0
+#define SGMII_SPEED_100			BIT(2)
+#define SGMII_SPEED_1000		BIT(3)
+#define SGMII_DUPLEX_FULL		BIT(4)
+#define SGMII_IF_MODE_BIT5		BIT(5)
+#define SGMII_REMOTE_FAULT_DIS		BIT(8)
+#define SGMII_CODE_SYNC_SET_VAL		BIT(9)
+#define SGMII_CODE_SYNC_SET_EN		BIT(10)
+#define SGMII_SEND_AN_ERROR_EN		BIT(11)
+#define SGMII_IF_MODE_MASK		GENMASK(5, 1)
+
+/* Register to set SGMII speed, ANA RG_ Control Signals III*/
+#define SGMSYS_ANA_RG_CS3	0x2028
+#define RG_PHY_SPEED_MASK	(BIT(2) | BIT(3))
+#define RG_PHY_SPEED_1_25G	0x0
+#define RG_PHY_SPEED_3_125G	BIT(2)
 
 /* Register to power up QPHY */
 #define SGMSYS_QPHY_PWR_STATE_CTRL 0xe8
@@ -897,7 +921,12 @@ u32 mtk_r32(struct mtk_eth *eth, unsigned reg);
 int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *np,
 		   u32 ana_rgc3);
 int mtk_sgmii_setup_mode_an(struct mtk_sgmii *ss, int id);
-int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id);
-int mtk_setup_hw_path(struct mtk_eth *eth, int mac_id, int phymode);
+int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id,
+			       const struct phylink_link_state *state);
+void mtk_sgmii_restart_an(struct mtk_eth *eth, int mac_id);
+
+int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id);
+int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id);
+int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id);
 
 #endif /* MTK_ETH_H */
diff --git a/drivers/net/ethernet/mediatek/mtk_sgmii.c b/drivers/net/ethernet/mediatek/mtk_sgmii.c
index ff509d42d818..4db27dfc7ec1 100644
--- a/drivers/net/ethernet/mediatek/mtk_sgmii.c
+++ b/drivers/net/ethernet/mediatek/mtk_sgmii.c
@@ -16,8 +16,7 @@
 int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *r, u32 ana_rgc3)
 {
 	struct device_node *np;
-	const char *str;
-	int i, err;
+	int i;
 
 	ss->ana_rgc3 = ana_rgc3;
 
@@ -29,19 +28,6 @@ int mtk_sgmii_init(struct mtk_sgmii *ss, struct device_node *r, u32 ana_rgc3)
 		ss->regmap[i] = syscon_node_to_regmap(np);
 		if (IS_ERR(ss->regmap[i]))
 			return PTR_ERR(ss->regmap[i]);
-
-		err = of_property_read_string(np, "mediatek,physpeed", &str);
-		if (err)
-			return err;
-
-		if (!strcmp(str, "2500"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_2500;
-		else if (!strcmp(str, "1000"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_1000;
-		else if (!strcmp(str, "auto"))
-			ss->flags[i] |= MTK_SGMII_PHYSPEED_AN;
-		else
-			return -EINVAL;
 	}
 
 	return 0;
@@ -73,27 +59,45 @@ int mtk_sgmii_setup_mode_an(struct mtk_sgmii *ss, int id)
 	return 0;
 }
 
-int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id)
+int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id,
+			       const struct phylink_link_state *state)
 {
 	unsigned int val;
-	int mode;
 
 	if (!ss->regmap[id])
 		return -EINVAL;
 
 	regmap_read(ss->regmap[id], ss->ana_rgc3, &val);
-	val &= ~GENMASK(3, 2);
-	mode = ss->flags[id] & MTK_SGMII_PHYSPEED_MASK;
-	val |= (mode == MTK_SGMII_PHYSPEED_1000) ? 0 : BIT(2);
+	val &= ~RG_PHY_SPEED_MASK;
+	if (state->interface == PHY_INTERFACE_MODE_2500BASEX)
+		val |= RG_PHY_SPEED_3_125G;
 	regmap_write(ss->regmap[id], ss->ana_rgc3, val);
 
 	/* Disable SGMII AN */
 	regmap_read(ss->regmap[id], SGMSYS_PCS_CONTROL_1, &val);
-	val &= ~BIT(12);
+	val &= ~SGMII_AN_ENABLE;
 	regmap_write(ss->regmap[id], SGMSYS_PCS_CONTROL_1, val);
 
 	/* SGMII force mode setting */
-	val = 0x31120019;
+	regmap_read(ss->regmap[id], SGMSYS_SGMII_MODE, &val);
+	val &= ~SGMII_IF_MODE_MASK;
+
+	switch (state->speed) {
+	case SPEED_10:
+		val |= SGMII_SPEED_10;
+		break;
+	case SPEED_100:
+		val |= SGMII_SPEED_100;
+		break;
+	case SPEED_2500:
+	case SPEED_1000:
+		val |= SGMII_SPEED_1000;
+		break;
+	};
+
+	if (state->duplex == DUPLEX_FULL)
+		val |= SGMII_DUPLEX_FULL;
+
 	regmap_write(ss->regmap[id], SGMSYS_SGMII_MODE, val);
 
 	/* Release PHYA power down state */
@@ -103,3 +107,20 @@ int mtk_sgmii_setup_mode_force(struct mtk_sgmii *ss, int id)
 
 	return 0;
 }
+
+void mtk_sgmii_restart_an(struct mtk_eth *eth, int mac_id)
+{
+	struct mtk_sgmii *ss = eth->sgmii;
+	unsigned int val, sid;
+
+	/* Decide how GMAC and SGMIISYS be mapped */
+	sid = (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_SGMII)) ?
+	       0 : mac_id;
+
+	if (!ss->regmap[sid])
+		return;
+
+	regmap_read(ss->regmap[sid], SGMSYS_PCS_CONTROL_1, &val);
+	val |= SGMII_AN_RESTART;
+	regmap_write(ss->regmap[sid], SGMSYS_PCS_CONTROL_1, val);
+}
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v3 1/3] net: ethernet: mediatek: Add basic PHYLINK support
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

This convert the basics to PHYLINK API.
SGMII support is not in this patch.

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* Make link_down() similar as link_up() suggested by Russell King
v1->v2:
* Also report 1000Base-X support suggested by Russell King
* Reverse christmas on many places suggested by David Miller
* Rebase too pickup the mt76x8 changes.
---
 drivers/net/ethernet/mediatek/Kconfig       |   2 +-
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 424 +++++++++++---------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  31 +-
 3 files changed, 265 insertions(+), 192 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/Kconfig b/drivers/net/ethernet/mediatek/Kconfig
index b76cf2e1c9dc..4968352ba188 100644
--- a/drivers/net/ethernet/mediatek/Kconfig
+++ b/drivers/net/ethernet/mediatek/Kconfig
@@ -9,7 +9,7 @@ if NET_VENDOR_MEDIATEK
 
 config NET_MEDIATEK_SOC
 	tristate "MediaTek SoC Gigabit Ethernet support"
-	select PHYLIB
+	select PHYLINK
 	---help---
 	  This driver supports the gigabit ethernet MACs in the
 	  MediaTek SoC family.
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8ddbb8dcf032..a04baad6337c 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -18,6 +18,7 @@
 #include <linux/tcp.h>
 #include <linux/interrupt.h>
 #include <linux/pinctrl/devinfo.h>
+#include <linux/phylink.h>
 
 #include "mtk_eth_soc.h"
 
@@ -186,168 +187,224 @@ static void mtk_gmac0_rgmii_adjust(struct mtk_eth *eth, int speed)
 	mtk_w32(eth, val, TRGMII_TCK_CTRL);
 }
 
-static void mtk_phy_link_adjust(struct net_device *dev)
+static void mtk_mac_config(struct phylink_config *config, unsigned int mode,
+			   const struct phylink_link_state *state)
 {
-	struct mtk_mac *mac = netdev_priv(dev);
-	u16 lcl_adv = 0, rmt_adv = 0;
-	u8 flowctrl;
-	u32 mcr = MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG |
-		  MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN |
-		  MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN |
-		  MAC_MCR_BACKPR_EN;
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	struct mtk_eth *eth = mac->hw;
+	u32 mcr_cur, mcr_new;
+	int val, ge_mode = 0;
+
+	/* MT76x8 has no hardware settings between for the MAC */
+	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) &&
+	    mac->interface != state->interface) {
+		/* Setup soc pin functions */
+		switch (state->interface) {
+		case PHY_INTERFACE_MODE_TRGMII:
+			if (mac->id)
+				goto err_phy;
+			if (!MTK_HAS_CAPS(mac->hw->soc->caps,
+					  MTK_GMAC1_TRGMII))
+				goto err_phy;
+			/* fall through */
+		case PHY_INTERFACE_MODE_GMII:
+		case PHY_INTERFACE_MODE_RGMII_TXID:
+		case PHY_INTERFACE_MODE_RGMII_RXID:
+		case PHY_INTERFACE_MODE_RGMII_ID:
+		case PHY_INTERFACE_MODE_RGMII:
+			break;
+		case PHY_INTERFACE_MODE_MII:
+			ge_mode = 1;
+			break;
+		case PHY_INTERFACE_MODE_REVMII:
+			ge_mode = 2;
+			break;
+		case PHY_INTERFACE_MODE_RMII:
+			if (mac->id)
+				goto err_phy;
+			ge_mode = 3;
+			break;
+		default:
+			goto err_phy;
+		}
 
-	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
-		return;
+		/* Setup clock for 1st gmac */
+		if (!mac->id &&
+		    MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII)) {
+			if (MTK_HAS_CAPS(mac->hw->soc->caps,
+					 MTK_TRGMII_MT7621_CLK)) {
+				if (mt7621_gmac0_rgmii_adjust(mac->hw,
+							      state->interface))
+					goto err_phy;
+			} else {
+				if (state->interface !=
+				    PHY_INTERFACE_MODE_TRGMII)
+					mtk_gmac0_rgmii_adjust(mac->hw,
+							       state->speed);
+			}
+		}
 
-	switch (dev->phydev->speed) {
+		/* put the gmac into the right mode */
+		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
+		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
+		val |= SYSCFG0_GE_MODE(ge_mode, mac->id);
+		regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+
+		mac->interface = state->interface;
+	}
+
+	/* Setup gmac */
+	mcr_cur = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
+	mcr_new = mcr_cur;
+	mcr_new &= ~(MAC_MCR_SPEED_100 | MAC_MCR_SPEED_1000 |
+		     MAC_MCR_FORCE_DPX | MAC_MCR_FORCE_TX_FC |
+		     MAC_MCR_FORCE_RX_FC);
+	mcr_new |= MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG | MAC_MCR_FORCE_MODE |
+		   MAC_MCR_BACKOFF_EN | MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_LINK;
+
+	switch (state->speed) {
 	case SPEED_1000:
-		mcr |= MAC_MCR_SPEED_1000;
+		mcr_new |= MAC_MCR_SPEED_1000;
 		break;
 	case SPEED_100:
-		mcr |= MAC_MCR_SPEED_100;
+		mcr_new |= MAC_MCR_SPEED_100;
 		break;
 	}
-
-	if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_GMAC1_TRGMII) && !mac->id) {
-		if (MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII_MT7621_CLK)) {
-			if (mt7621_gmac0_rgmii_adjust(mac->hw,
-						      dev->phydev->interface))
-				return;
-		} else {
-			if (!mac->trgmii)
-				mtk_gmac0_rgmii_adjust(mac->hw,
-						       dev->phydev->speed);
-		}
+	if (state->duplex == DUPLEX_FULL) {
+		mcr_new |= MAC_MCR_FORCE_DPX;
+		if (state->pause & MLO_PAUSE_TX)
+			mcr_new |= MAC_MCR_FORCE_TX_FC;
+		if (state->pause & MLO_PAUSE_RX)
+			mcr_new |= MAC_MCR_FORCE_RX_FC;
 	}
 
-	if (dev->phydev->link)
-		mcr |= MAC_MCR_FORCE_LINK;
+	/* Only update control register when needed! */
+	if (mcr_new != mcr_cur)
+		mtk_w32(mac->hw, mcr_new, MTK_MAC_MCR(mac->id));
 
-	if (dev->phydev->duplex) {
-		mcr |= MAC_MCR_FORCE_DPX;
+	return;
 
-		if (dev->phydev->pause)
-			rmt_adv = LPA_PAUSE_CAP;
-		if (dev->phydev->asym_pause)
-			rmt_adv |= LPA_PAUSE_ASYM;
+err_phy:
+	dev_err(eth->dev, "%s: GMAC%d mode %s not supported!\n", __func__,
+		mac->id, phy_modes(state->interface));
+}
 
-		lcl_adv = linkmode_adv_to_lcl_adv_t(dev->phydev->advertising);
-		flowctrl = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
+static int mtk_mac_link_state(struct phylink_config *config,
+			      struct phylink_link_state *state)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 pmsr = mtk_r32(mac->hw, MTK_MAC_MSR(mac->id));
 
-		if (flowctrl & FLOW_CTRL_TX)
-			mcr |= MAC_MCR_FORCE_TX_FC;
-		if (flowctrl & FLOW_CTRL_RX)
-			mcr |= MAC_MCR_FORCE_RX_FC;
+	state->link = (pmsr & MAC_MSR_LINK);
+	state->duplex = (pmsr & MAC_MSR_DPX) >> 1;
 
-		netif_dbg(mac->hw, link, dev, "rx pause %s, tx pause %s\n",
-			  flowctrl & FLOW_CTRL_RX ? "enabled" : "disabled",
-			  flowctrl & FLOW_CTRL_TX ? "enabled" : "disabled");
+	switch (pmsr & (MAC_MSR_SPEED_1000 | MAC_MSR_SPEED_100)) {
+	case 0:
+		state->speed = SPEED_10;
+		break;
+	case MAC_MSR_SPEED_100:
+		state->speed = SPEED_100;
+		break;
+	case MAC_MSR_SPEED_1000:
+		state->speed = SPEED_1000;
+		break;
+	default:
+		state->speed = SPEED_UNKNOWN;
+		break;
 	}
 
-	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
+	state->pause &= (MLO_PAUSE_RX | MLO_PAUSE_TX);
+	if (pmsr & MAC_MSR_RX_FC)
+		state->pause |= MLO_PAUSE_RX;
+	if (pmsr & MAC_MSR_TX_FC)
+		state->pause |= MLO_PAUSE_TX;
 
-	if (!of_phy_is_fixed_link(mac->of_node))
-		phy_print_status(dev->phydev);
+	return 1;
 }
 
-static int mtk_phy_connect_node(struct mtk_eth *eth, struct mtk_mac *mac,
-				struct device_node *phy_node)
+static void mtk_mac_an_restart(struct phylink_config *config)
 {
-	struct phy_device *phydev;
-	int phy_mode;
-
-	phy_mode = of_get_phy_mode(phy_node);
-	if (phy_mode < 0) {
-		dev_err(eth->dev, "incorrect phy-mode %d\n", phy_mode);
-		return -EINVAL;
-	}
-
-	phydev = of_phy_connect(eth->netdev[mac->id], phy_node,
-				mtk_phy_link_adjust, 0, phy_mode);
-	if (!phydev) {
-		dev_err(eth->dev, "could not connect to PHY\n");
-		return -ENODEV;
-	}
+	/* Do nothing */
+}
 
-	dev_info(eth->dev,
-		 "connected mac %d to PHY at %s [uid=%08x, driver=%s]\n",
-		 mac->id, phydev_name(phydev), phydev->phy_id,
-		 phydev->drv->name);
+static void mtk_mac_link_down(struct phylink_config *config, unsigned int mode,
+			      phy_interface_t interface)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 
-	return 0;
+	mcr &= (MAC_MCR_TX_EN | MAC_MCR_RX_EN);
+	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
 }
 
-static int mtk_phy_connect(struct net_device *dev)
+static void mtk_mac_link_up(struct phylink_config *config, unsigned int mode,
+			    phy_interface_t interface,
+			    struct phy_device *phy)
 {
-	struct mtk_mac *mac = netdev_priv(dev);
-	struct mtk_eth *eth;
-	struct device_node *np;
-	u32 val;
-	int err;
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
 
-	eth = mac->hw;
-	np = of_parse_phandle(mac->of_node, "phy-handle", 0);
-	if (!np && of_phy_is_fixed_link(mac->of_node))
-		if (!of_phy_register_fixed_link(mac->of_node))
-			np = of_node_get(mac->of_node);
-	if (!np)
-		return -ENODEV;
+	mcr |= MAC_MCR_TX_EN | MAC_MCR_RX_EN;
+	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
+}
 
-	err = mtk_setup_hw_path(eth, mac->id, of_get_phy_mode(np));
-	if (err)
-		goto err_phy;
-
-	mac->ge_mode = 0;
-	switch (of_get_phy_mode(np)) {
-	case PHY_INTERFACE_MODE_TRGMII:
-		mac->trgmii = true;
-	case PHY_INTERFACE_MODE_RGMII_TXID:
-	case PHY_INTERFACE_MODE_RGMII_RXID:
-	case PHY_INTERFACE_MODE_RGMII_ID:
-	case PHY_INTERFACE_MODE_RGMII:
-	case PHY_INTERFACE_MODE_SGMII:
-		break;
-	case PHY_INTERFACE_MODE_MII:
-	case PHY_INTERFACE_MODE_GMII:
-		mac->ge_mode = 1;
-		break;
-	case PHY_INTERFACE_MODE_REVMII:
-		mac->ge_mode = 2;
-		break;
-	case PHY_INTERFACE_MODE_RMII:
-		if (!mac->id)
-			goto err_phy;
-		mac->ge_mode = 3;
-		break;
-	default:
-		goto err_phy;
-	}
+static void mtk_validate(struct phylink_config *config,
+			 unsigned long *supported,
+			 struct phylink_link_state *state)
+{
+	struct mtk_mac *mac = container_of(config, struct mtk_mac,
+					   phylink_config);
+	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
-	/* No MT7628/88 support for now */
-	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628)) {
-		/* put the gmac into the right mode */
-		regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
-		val |= SYSCFG0_GE_MODE(mac->ge_mode, mac->id);
-		regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+	if (state->interface != PHY_INTERFACE_MODE_NA &&
+	    state->interface != PHY_INTERFACE_MODE_MII &&
+	    state->interface != PHY_INTERFACE_MODE_GMII &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
+	      phy_interface_mode_is_rgmii(state->interface)) &&
+	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
+	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
+		linkmode_zero(supported);
+		return;
 	}
 
-	/* couple phydev to net_device */
-	if (mtk_phy_connect_node(eth, mac, np))
-		goto err_phy;
+	phylink_set_port_modes(mask);
+	phylink_set(mask, Autoneg);
 
-	of_node_put(np);
+	if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
+		phylink_set(mask, 1000baseT_Full);
+	} else {
+		phylink_set(mask, 10baseT_Half);
+		phylink_set(mask, 10baseT_Full);
+		phylink_set(mask, 100baseT_Half);
+		phylink_set(mask, 100baseT_Full);
+
+		if (state->interface != PHY_INTERFACE_MODE_MII) {
+			phylink_set(mask, 1000baseT_Half);
+			phylink_set(mask, 1000baseT_Full);
+			phylink_set(mask, 1000baseX_Full);
+		}
+	}
 
-	return 0;
+	phylink_set(mask, Pause);
+	phylink_set(mask, Asym_Pause);
 
-err_phy:
-	if (of_phy_is_fixed_link(mac->of_node))
-		of_phy_deregister_fixed_link(mac->of_node);
-	of_node_put(np);
-	dev_err(eth->dev, "%s: invalid phy\n", __func__);
-	return -EINVAL;
+	linkmode_and(supported, supported, mask);
+	linkmode_and(state->advertising, state->advertising, mask);
 }
 
+static const struct phylink_mac_ops mtk_phylink_ops = {
+	.validate = mtk_validate,
+	.mac_link_state = mtk_mac_link_state,
+	.mac_an_restart = mtk_mac_an_restart,
+	.mac_config = mtk_mac_config,
+	.mac_link_down = mtk_mac_link_down,
+	.mac_link_up = mtk_mac_link_up,
+};
+
 static int mtk_mdio_init(struct mtk_eth *eth)
 {
 	struct device_node *mii_np;
@@ -2013,6 +2070,14 @@ static int mtk_open(struct net_device *dev)
 {
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
+	int err;
+
+	err = phylink_of_phy_connect(mac->phylink, mac->of_node, 0);
+	if (err) {
+		netdev_err(dev, "%s: could not attach PHY: %d\n", __func__,
+			   err);
+		return err;
+	}
 
 	/* we run 2 netdevs on the same dma ring so we only bring it up once */
 	if (!refcount_read(&eth->dma_refcnt)) {
@@ -2030,7 +2095,7 @@ static int mtk_open(struct net_device *dev)
 	else
 		refcount_inc(&eth->dma_refcnt);
 
-	phy_start(dev->phydev);
+	phylink_start(mac->phylink);
 	netif_start_queue(dev);
 	return 0;
 }
@@ -2063,8 +2128,11 @@ static int mtk_stop(struct net_device *dev)
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
 
+	phylink_stop(mac->phylink);
+
 	netif_tx_disable(dev);
-	phy_stop(dev->phydev);
+
+	phylink_disconnect_phy(mac->phylink);
 
 	/* only shutdown DMA if this is the last user */
 	if (!refcount_dec_and_test(&eth->dma_refcnt))
@@ -2159,15 +2227,6 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	ethsys_reset(eth, RSTCTRL_FE);
 	ethsys_reset(eth, RSTCTRL_PPE);
 
-	regmap_read(eth->ethsys, ETHSYS_SYSCFG0, &val);
-	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!eth->mac[i])
-			continue;
-		val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, eth->mac[i]->id);
-		val |= SYSCFG0_GE_MODE(eth->mac[i]->ge_mode, eth->mac[i]->id);
-	}
-	regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
-
 	if (eth->pctl) {
 		/* Set GE2 driving and slew rate */
 		regmap_write(eth->pctl, GPIO_DRV_SEL10, 0xa00);
@@ -2180,11 +2239,11 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	}
 
 	/* Set linkdown as the default for each GMAC. Its own MCR would be set
-	 * up with the more appropriate value when mtk_phy_link_adjust call is
-	 * being invoked.
+	 * up with the more appropriate value when mtk_mac_config call is being
+	 * invoked.
 	 */
 	for (i = 0; i < MTK_MAC_COUNT; i++)
-		mtk_w32(eth, 0, MTK_MAC_MCR(i));
+		mtk_w32(eth, MAC_MCR_FORCE_LINK_DOWN, MTK_MAC_MCR(i));
 
 	/* Indicates CDM to parse the MTK special tag from CPU
 	 * which also is working out for untag packets.
@@ -2212,7 +2271,7 @@ static int mtk_hw_init(struct mtk_eth *eth)
 	mtk_w32(eth, MTK_RX_DONE_INT, MTK_QDMA_INT_GRP2);
 	mtk_w32(eth, 0x21021000, MTK_FE_INT_GRP);
 
-	for (i = 0; i < 2; i++) {
+	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		u32 val = mtk_r32(eth, MTK_GDMA_FWD_CFG(i));
 
 		/* setup the forward port to send frame to PDMA */
@@ -2264,7 +2323,7 @@ static int __init mtk_init(struct net_device *dev)
 			dev->dev_addr);
 	}
 
-	return mtk_phy_connect(dev);
+	return 0;
 }
 
 static void mtk_uninit(struct net_device *dev)
@@ -2272,20 +2331,20 @@ static void mtk_uninit(struct net_device *dev)
 	struct mtk_mac *mac = netdev_priv(dev);
 	struct mtk_eth *eth = mac->hw;
 
-	phy_disconnect(dev->phydev);
-	if (of_phy_is_fixed_link(mac->of_node))
-		of_phy_deregister_fixed_link(mac->of_node);
+	phylink_disconnect_phy(mac->phylink);
 	mtk_tx_irq_disable(eth, ~0);
 	mtk_rx_irq_disable(eth, ~0);
 }
 
 static int mtk_do_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
+	struct mtk_mac *mac = netdev_priv(dev);
+
 	switch (cmd) {
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSMIIREG:
-		return phy_mii_ioctl(dev->phydev, ifr, cmd);
+		return phylink_mii_ioctl(mac->phylink, ifr, cmd);
 	default:
 		break;
 	}
@@ -2326,16 +2385,6 @@ static void mtk_pending_work(struct work_struct *work)
 				     eth->dev->pins->default_state);
 	mtk_hw_init(eth);
 
-	for (i = 0; i < MTK_MAC_COUNT; i++) {
-		if (!eth->mac[i] ||
-		    of_phy_is_fixed_link(eth->mac[i]->of_node))
-			continue;
-		err = phy_init_hw(eth->netdev[i]->phydev);
-		if (err)
-			dev_err(eth->dev, "%s: PHY init failed.\n",
-				eth->netdev[i]->name);
-	}
-
 	/* restart DMA and enable IRQs */
 	for (i = 0; i < MTK_MAC_COUNT; i++) {
 		if (!test_bit(i, &restart))
@@ -2398,9 +2447,7 @@ static int mtk_get_link_ksettings(struct net_device *ndev,
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	phy_ethtool_ksettings_get(ndev->phydev, cmd);
-
-	return 0;
+	return phylink_ethtool_ksettings_get(mac->phylink, cmd);
 }
 
 static int mtk_set_link_ksettings(struct net_device *ndev,
@@ -2411,7 +2458,7 @@ static int mtk_set_link_ksettings(struct net_device *ndev,
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	return phy_ethtool_ksettings_set(ndev->phydev, cmd);
+	return phylink_ethtool_ksettings_set(mac->phylink, cmd);
 }
 
 static void mtk_get_drvinfo(struct net_device *dev,
@@ -2445,22 +2492,10 @@ static int mtk_nway_reset(struct net_device *dev)
 	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
 		return -EBUSY;
 
-	return genphy_restart_aneg(dev->phydev);
-}
+	if (!mac->phylink)
+		return -ENOTSUPP;
 
-static u32 mtk_get_link(struct net_device *dev)
-{
-	struct mtk_mac *mac = netdev_priv(dev);
-	int err;
-
-	if (unlikely(test_bit(MTK_RESETTING, &mac->hw->state)))
-		return -EBUSY;
-
-	err = genphy_update_link(dev->phydev);
-	if (err)
-		return ethtool_op_get_link(dev);
-
-	return dev->phydev->link;
+	return phylink_ethtool_nway_reset(mac->phylink);
 }
 
 static void mtk_get_strings(struct net_device *dev, u32 stringset, u8 *data)
@@ -2580,7 +2615,7 @@ static const struct ethtool_ops mtk_ethtool_ops = {
 	.get_msglevel		= mtk_get_msglevel,
 	.set_msglevel		= mtk_set_msglevel,
 	.nway_reset		= mtk_nway_reset,
-	.get_link		= mtk_get_link,
+	.get_link		= ethtool_op_get_link,
 	.get_strings		= mtk_get_strings,
 	.get_sset_count		= mtk_get_sset_count,
 	.get_ethtool_stats	= mtk_get_ethtool_stats,
@@ -2608,9 +2643,10 @@ static const struct net_device_ops mtk_netdev_ops = {
 
 static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 {
-	struct mtk_mac *mac;
 	const __be32 *_id = of_get_property(np, "reg", NULL);
-	int id, err;
+	struct phylink *phylink;
+	int phy_mode, id, err;
+	struct mtk_mac *mac;
 
 	if (!_id) {
 		dev_err(eth->dev, "missing mac id\n");
@@ -2654,6 +2690,32 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	u64_stats_init(&mac->hw_stats->syncp);
 	mac->hw_stats->reg_offset = id * MTK_STAT_OFFSET;
 
+	/* phylink create */
+	phy_mode = of_get_phy_mode(np);
+	if (phy_mode < 0) {
+		dev_err(eth->dev, "incorrect phy-mode\n");
+		err = -EINVAL;
+		goto free_netdev;
+	}
+
+	/* mac config is not set */
+	mac->interface = PHY_INTERFACE_MODE_NA;
+	mac->mode = MLO_AN_PHY;
+	mac->speed = SPEED_UNKNOWN;
+
+	mac->phylink_config.dev = &eth->netdev[id]->dev;
+	mac->phylink_config.type = PHYLINK_NETDEV;
+
+	phylink = phylink_create(&mac->phylink_config,
+				 of_fwnode_handle(mac->of_node),
+				 phy_mode, &mtk_phylink_ops);
+	if (IS_ERR(phylink)) {
+		err = PTR_ERR(phylink);
+		goto free_netdev;
+	}
+
+	mac->phylink = phylink;
+
 	SET_NETDEV_DEV(eth->netdev[id], eth->dev);
 	eth->netdev[id]->watchdog_timeo = 5 * HZ;
 	eth->netdev[id]->netdev_ops = &mtk_netdev_ops;
@@ -2682,8 +2744,7 @@ static int mtk_probe(struct platform_device *pdev)
 {
 	struct device_node *mac_np;
 	struct mtk_eth *eth;
-	int err;
-	int i;
+	int err, i;
 
 	eth = devm_kzalloc(&pdev->dev, sizeof(*eth), GFP_KERNEL);
 	if (!eth)
@@ -2869,6 +2930,7 @@ static int mtk_probe(struct platform_device *pdev)
 static int mtk_remove(struct platform_device *pdev)
 {
 	struct mtk_eth *eth = platform_get_drvdata(pdev);
+	struct mtk_mac *mac;
 	int i;
 
 	/* stop all devices to make sure that dma is properly shut down */
@@ -2876,6 +2938,8 @@ static int mtk_remove(struct platform_device *pdev)
 		if (!eth->netdev[i])
 			continue;
 		mtk_stop(eth->netdev[i]);
+		mac = netdev_priv(eth->netdev[i]);
+		phylink_disconnect_phy(mac->phylink);
 	}
 
 	mtk_hw_deinit(eth);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index cc1466ae0926..7f5f541daad7 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -14,6 +14,7 @@
 #include <linux/of_net.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/refcount.h>
+#include <linux/phylink.h>
 
 #define MTK_QDMA_PAGE_SIZE	2048
 #define	MTK_MAX_RX_LENGTH	1536
@@ -330,12 +331,19 @@
 #define MAC_MCR_SPEED_100	BIT(2)
 #define MAC_MCR_FORCE_DPX	BIT(1)
 #define MAC_MCR_FORCE_LINK	BIT(0)
-#define MAC_MCR_FIXED_LINK	(MAC_MCR_MAX_RX_1536 | MAC_MCR_IPG_CFG | \
-				 MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN | \
-				 MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN | \
-				 MAC_MCR_BACKPR_EN | MAC_MCR_FORCE_RX_FC | \
-				 MAC_MCR_FORCE_TX_FC | MAC_MCR_SPEED_1000 | \
-				 MAC_MCR_FORCE_DPX | MAC_MCR_FORCE_LINK)
+#define MAC_MCR_FORCE_LINK_DOWN	(MAC_MCR_FORCE_MODE)
+
+/* Mac status registers */
+#define MTK_MAC_MSR(x)		(0x10108 + (x * 0x100))
+#define MAC_MSR_EEE1G		BIT(7)
+#define MAC_MSR_EEE100M		BIT(6)
+#define MAC_MSR_RX_FC		BIT(5)
+#define MAC_MSR_TX_FC		BIT(4)
+#define MAC_MSR_SPEED_1000	BIT(3)
+#define MAC_MSR_SPEED_100	BIT(2)
+#define MAC_MSR_SPEED_MASK	(MAC_MSR_SPEED_1000 | MAC_MSR_SPEED_100)
+#define MAC_MSR_DPX		BIT(1)
+#define MAC_MSR_LINK		BIT(0)
 
 /* TRGMII RXC control register */
 #define TRGMII_RCK_CTRL		0x10300
@@ -858,22 +866,23 @@ struct mtk_eth {
 /* struct mtk_mac -	the structure that holds the info about the MACs of the
  *			SoC
  * @id:			The number of the MAC
- * @ge_mode:            Interface mode kept for setup restoring
+ * @interface:		Interface mode kept for detecting change in hw settings
  * @of_node:		Our devicetree node
  * @hw:			Backpointer to our main datastruture
  * @hw_stats:		Packet statistics counter
- * @trgmii		Indicate if the MAC uses TRGMII connected to internal
-			switch
  */
 struct mtk_mac {
 	int				id;
-	int				ge_mode;
+	phy_interface_t			interface;
+	unsigned int			mode;
+	int				speed;
 	struct device_node		*of_node;
+	struct phylink			*phylink;
+	struct phylink_config		phylink_config;
 	struct mtk_eth			*hw;
 	struct mtk_hw_stats		*hw_stats;
 	__be32				hwlro_ip[MTK_MAX_LRO_IP_CNT];
 	int				hwlro_ip_cnt;
-	bool				trgmii;
 };
 
 /* the struct describing the SoC. these are declared in the soc_xyz.c files */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v3 0/3] net: ethernet: mediatek: convert to PHYLINK
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst

These patches converts mediatek driver to PHYLINK API.

v2->v3:
* Phylink improvements and clean-ups after review
v1->v2:
* Rebase for mt76x8 changes
* Phylink improvements and clean-ups after review
* SGMII port doesn't support 2.5Gbit in SGMII mode only in BASE-X mode.
  Refactor the code.

René van Dorst (3):
  net: ethernet: mediatek: Add basic PHYLINK support
  net: ethernet: mediatek: Re-add support SGMII
  dt-bindings: net: ethernet: Update mt7622 docs and dts to reflect the
    new phylink API

 .../arm/mediatek/mediatek,sgmiisys.txt        |   2 -
 .../dts/mediatek/mt7622-bananapi-bpi-r64.dts  |  28 +-
 arch/arm64/boot/dts/mediatek/mt7622.dtsi      |   1 -
 drivers/net/ethernet/mediatek/Kconfig         |   2 +-
 drivers/net/ethernet/mediatek/mtk_eth_path.c  |  75 +--
 drivers/net/ethernet/mediatek/mtk_eth_soc.c   | 529 ++++++++++++------
 drivers/net/ethernet/mediatek/mtk_eth_soc.h   |  68 ++-
 drivers/net/ethernet/mediatek/mtk_sgmii.c     |  65 ++-
 8 files changed, 477 insertions(+), 293 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH net-next v3 3/3] dt-bindings: net: ethernet: Update mt7622 docs and dts to reflect the new phylink API
From: René van Dorst @ 2019-08-23 13:45 UTC (permalink / raw)
  To: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger
  Cc: netdev, linux-arm-kernel, linux-mediatek, linux-mips,
	Russell King, Frank Wunderlich, Stefan Roese, René van Dorst
In-Reply-To: <20190823134516.27559-1-opensource@vdorst.com>

This patch the removes the recently added mediatek,physpeed property.
Use the fixed-link property speed = <2500> to set the phy in 2.5Gbit.
See mt7622-bananapi-bpi-r64.dts for a working example.

Signed-off-by: René van Dorst <opensource@vdorst.com>
--
v2->v3:
* no change
v1->v2:
* SGMII port only support BASE-X at 2.5Gbit.
---
 .../arm/mediatek/mediatek,sgmiisys.txt        |  2 --
 .../dts/mediatek/mt7622-bananapi-bpi-r64.dts  | 28 +++++++++++++------
 arch/arm64/boot/dts/mediatek/mt7622.dtsi      |  1 -
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
index f5518f26a914..30cb645c0e54 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,sgmiisys.txt
@@ -9,8 +9,6 @@ Required Properties:
 	- "mediatek,mt7622-sgmiisys", "syscon"
 	- "mediatek,mt7629-sgmiisys", "syscon"
 - #clock-cells: Must be 1
-- mediatek,physpeed: Should be one of "auto", "1000" or "2500" to match up
-		     the capability of the target PHY.
 
 The SGMIISYS controller uses the common clk binding from
 Documentation/devicetree/bindings/clock/clock-bindings.txt
diff --git a/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts b/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
index 710c5c3d87d3..83e10591e0e5 100644
--- a/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
+++ b/arch/arm64/boot/dts/mediatek/mt7622-bananapi-bpi-r64.dts
@@ -115,24 +115,34 @@
 };
 
 &eth {
-	pinctrl-names = "default";
-	pinctrl-0 = <&eth_pins>;
 	status = "okay";
+	gmac0: mac@0 {
+		compatible = "mediatek,eth-mac";
+		reg = <0>;
+		phy-mode = "2500base-x";
+
+		fixed-link {
+			speed = <2500>;
+			full-duplex;
+			pause;
+		};
+	};
 
 	gmac1: mac@1 {
 		compatible = "mediatek,eth-mac";
 		reg = <1>;
-		phy-handle = <&phy5>;
+		phy-mode = "rgmii";
+
+		fixed-link {
+			speed = <1000>;
+			full-duplex;
+			pause;
+		};
 	};
 
-	mdio-bus {
+	mdio: mdio-bus {
 		#address-cells = <1>;
 		#size-cells = <0>;
-
-		phy5: ethernet-phy@5 {
-			reg = <5>;
-			phy-mode = "sgmii";
-		};
 	};
 };
 
diff --git a/arch/arm64/boot/dts/mediatek/mt7622.dtsi b/arch/arm64/boot/dts/mediatek/mt7622.dtsi
index d1e13d340e26..dac51e98204c 100644
--- a/arch/arm64/boot/dts/mediatek/mt7622.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt7622.dtsi
@@ -931,6 +931,5 @@
 			     "syscon";
 		reg = <0 0x1b128000 0 0x3000>;
 		#clock-cells = <1>;
-		mediatek,physpeed = "2500";
 	};
 };
-- 
2.20.1


^ permalink raw reply related

* [PATCH net] ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
From: Sabrina Dubroca @ 2019-08-23 13:44 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca

Currently, ipv6_find_idev returns NULL when ipv6_add_dev fails,
ignoring the specific error value. This results in addrconf_add_dev
returning ENOBUFS in all cases, which is unfortunate in cases such as:

    # ip link add dummyX type dummy
    # ip link set dummyX mtu 1200 up
    # ip addr add 2000::/64 dev dummyX
    RTNETLINK answers: No buffer space available

Commit a317a2f19da7 ("ipv6: fail early when creating netdev named all
or default") introduced error returns in ipv6_add_dev. Before that,
that function would simply return NULL for all failures.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 net/ipv6/addrconf.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ced995f3fec4..6a576ff92c39 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -478,7 +478,7 @@ static struct inet6_dev *ipv6_find_idev(struct net_device *dev)
 	if (!idev) {
 		idev = ipv6_add_dev(dev);
 		if (IS_ERR(idev))
-			return NULL;
+			return idev;
 	}
 
 	if (dev->flags&IFF_UP)
@@ -2466,8 +2466,8 @@ static struct inet6_dev *addrconf_add_dev(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev)
-		return ERR_PTR(-ENOBUFS);
+	if (IS_ERR(idev))
+		return idev;
 
 	if (idev->cnf.disable_ipv6)
 		return ERR_PTR(-EACCES);
@@ -3159,7 +3159,7 @@ static void init_loopback(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -3374,7 +3374,7 @@ static void addrconf_sit_config(struct net_device *dev)
 	 */
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -3399,7 +3399,7 @@ static void addrconf_gre_config(struct net_device *dev)
 	ASSERT_RTNL();
 
 	idev = ipv6_find_idev(dev);
-	if (!idev) {
+	if (IS_ERR(idev)) {
 		pr_debug("%s: add_dev failed\n", __func__);
 		return;
 	}
@@ -4773,8 +4773,8 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh,
 			 IFA_F_MCAUTOJOIN | IFA_F_OPTIMISTIC;
 
 	idev = ipv6_find_idev(dev);
-	if (!idev)
-		return -ENOBUFS;
+	if (IS_ERR(idev))
+		return PTR_ERR(idev);
 
 	if (!ipv6_allow_optimistic_dad(net, idev))
 		cfg.ifa_flags &= ~IFA_F_OPTIMISTIC;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH bpf-next v5 03/11] xsk: add support to allow unaligned chunk placement
From: Laatz, Kevin @ 2019-08-23 13:35 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <3AEEC88E-8D45-41C5-AFBF-51512826B1A7@gmail.com>

On 22/08/2019 19:43, Jonathan Lemon wrote:
> On 21 Aug 2019, at 18:44, Kevin Laatz wrote:
>> Currently, addresses are chunk size aligned. This means, we are very
>> restricted in terms of where we can place chunk within the umem. For
>> example, if we have a chunk size of 2k, then our chunks can only be 
>> placed
>> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>>
>> This patch introduces the ability to use unaligned chunks. With these
>> changes, we are no longer bound to having to place chunks at a 2k (or
>> whatever your chunk size is) interval. Since we are no longer dealing 
>> with
>> aligned chunks, they can now cross page boundaries. Checks for page
>> contiguity have been added in order to keep track of which pages are
>> followed by a physically contiguous page.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> ---
>> v2:
>>   - Add checks for the flags coming from userspace
>>   - Fix how we get chunk_size in xsk_diag.c
>>   - Add defines for masking the new descriptor format
>>   - Modified the rx functions to use new descriptor format
>>   - Modified the tx functions to use new descriptor format
>>
>> v3:
>>   - Add helper function to do address/offset masking/addition
>>
>> v4:
>>   - fixed page_start calculation in __xsk_rcv_memcpy().
>>   - move offset handling to the xdp_umem_get_* functions
>>   - modified the len field in xdp_umem_reg struct. We now use 16 bits 
>> from
>>     this for the flags field.
>>   - removed next_pg_contig field from xdp_umem_page struct. Using low 12
>>     bits of addr to store flags instead.
>>   - other minor changes based on review comments
>>
>> v5:
>>   - Added accessors for getting addr and offset
>>   - Added helper function to add offset to addr
>>   - Fixed offset handling in xsk_rcv
>>   - Removed bitfields from xdp_umem_reg
>>   - Added struct size checking for xdp_umem_reg in xsk_setsockopt to 
>> handle
>>     different versions of the struct.
>>   - fix conflicts after 'bpf-af-xdp-wakeup' was merged.
>> ---
>>  include/net/xdp_sock.h      | 75 +++++++++++++++++++++++++++--
>>  include/uapi/linux/if_xdp.h |  9 ++++
>>  net/xdp/xdp_umem.c          | 19 ++++++--
>>  net/xdp/xsk.c               | 96 +++++++++++++++++++++++++++++--------
>>  net/xdp/xsk_diag.c          |  2 +-
>>  net/xdp/xsk_queue.h         | 68 ++++++++++++++++++++++----
>>  6 files changed, 232 insertions(+), 37 deletions(-)
>>
>>
[...]

>> @@ -196,17 +221,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>          goto out_unlock;
>>      }
>>
>> -    if (!xskq_peek_addr(xs->umem->fq, &addr) ||
>> +    if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>>          len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>>          err = -ENOSPC;
>>          goto out_drop;
>>      }
>>
>> -    addr += xs->umem->headroom;
>> -
>> -    buffer = xdp_umem_get_data(xs->umem, addr);
>> +    buffer = xdp_umem_get_data(xs->umem, addr + offset);
>>      memcpy(buffer, xdp->data_meta, len + metalen);
>> -    addr += metalen;
>> +    offset += metalen;
>> +
>> +    addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>>      err = xskq_produce_batch_desc(xs->rx, addr, len);
>>      if (err)
>>          goto out_drop;
>
> Can't just add address and offset any longer.  This should read:
>
>     addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>     buffer = xdp_umem_get_data(xs->umem, addr);
>
>     addr = xsk_umem_adjust_offset(xs->umem, addr, metalen);
>
>
> so that offset and then metalen are added.  (or preserve the
> address across the calls like memcpy_addr earlier).


Will fix this, thanks!

-Kevin


^ permalink raw reply

* Re: [PATCH 0/3] Add NETIF_F_HW_BRIDGE feature
From: Andrew Lunn @ 2019-08-23 13:25 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: Nikolay Aleksandrov, roopa, davem, UNGLinuxDriver,
	alexandre.belloni, allan.nielsen, netdev, linux-kernel, bridge
In-Reply-To: <20190823122657.njk2tcgur2zu74i7@soft-dev3.microsemi.net>

> > > Why do the devices have to be from the same driver ?
> After seeing yours and Andrews comments I realize that I try to do two
> things, but I have only explained one of them.
> 
> Here is what I was trying to do:
> A. Prevent ports from going into promisc mode, if it is not needed.

The switch definition is promisc is a bit odd. You really need to
split it into two use cases.

The Linux interface is not a member of a bridge. In this case, promisc
mode would mean all frames ingressing the port should be forwarded to
the CPU. Without promisc, you can program the hardware to just accept
frames with the interfaces MAC address. So this is just the usual
behaviour of an interface.

When the interface is part of the bridge, then you can turn on all the
learning and not forward frames to the CPU, unless the CPU asks for
them. But you need to watch out for various flags. By default, you
should flood to the CPU, unknown destinations to the CPU etc. But some
of these can be turned off by flags.

> B. Prevent adding the CPU to the flood-mask (in Ocelot we have a
> flood-mask controlling who should be included when flooding due to
> destination unknown).

So destination unknown should be flooded to the CPU. The CPU might
know where to send the frame.

> To solve item "B", the network driver needs to detect if there is a
> foreign interfaces added to the bridge. If that is the case then to add
> the CPU port to the flooding mask otherwise no.

It is not just a foreign interface. What about the MAC address on the
bridge interface?

> > > This is too specific targeting some devices.
> Maybe I was wrong to mention specific HW in the commit message. The
> purpose of the patch was to add an optimization (not to copy all the
> frames to the CPU) for HW that is capable of learning and flooding the
> frames.

To some extent, this is also tied to your hardware not learning MAC
addresses from frames passed from the CPU. You should also consider
fixing this. The SW bridge does send out notifications when it
adds/removes MAC addresses to its tables. You probably want to receive
this modifications, and use them to program your hardware to forward
frames to the CPU when needed.

       Andrew

^ permalink raw reply

* Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-23 13:19 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexander Duyck, Ilya Maximets, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <217e90f5-206a-bb80-6699-f6dd750b57d9@intel.com>

On Thu, Aug 22, 2019 at 11:10 PM Björn Töpel <bjorn.topel@intel.com> wrote:
>
> On 2019-08-22 19:32, William Tu wrote:
> > On Thu, Aug 22, 2019 at 10:21 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >>
> >> On Thu, Aug 22, 2019 at 10:12 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>
> >>> Tx code doesn't clear the descriptors' status after cleaning.
> >>> So, if the budget is larger than number of used elems in a ring, some
> >>> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >>> prod_tail far beyond the prod_head breaking the completion queue ring.
> >>>
> >>> Fix that by limiting the number of descriptors to clean by the number
> >>> of used descriptors in the tx ring.
> >>>
> >>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> >>> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
> >>> 'next_to_clean' and 'next_to_use' indexes.
> >>>
> >>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>> ---
> >>>
> >>> Version 3:
> >>>    * Reverted some refactoring made for v2.
> >>>    * Eliminated 'budget' for tx clean.
> >>>    * prefetch returned.
> >>>
> >>> Version 2:
> >>>    * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
> >>>      'ixgbe_xsk_clean_tx_ring()'.
> >>>
> >>>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 ++++++++------------
> >>>   1 file changed, 11 insertions(+), 18 deletions(-)
> >>
> >> Thanks for addressing my concerns.
> >>
> >> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > Thanks.
> >
> > Tested-by: William Tu <u9012063@gmail.com>
> >
>
> Will, thanks for testing! For this patch, did you notice any performance
> degradation?
>

No noticeable performance degradation seen this time.
Thanks,
William

^ permalink raw reply

* Re: [PATCH net-next v5] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-23 12:59 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <CAA93jw5_LN_-zhHh=zZA8r6Zvv1CvA_AikT_rCgWyT8ytQM_rg@mail.gmail.com>

> 1) Since we're still duking it out over the meaning of the bits - not
> just the SCE thing, but as best as I can
> tell (but could be wrong) the NQB idea wants to put something into the
> l4s fast queue? Or is NQB supposed to
> be a third queue?

We can add support for NQB in the future, by expanding the
dualpi2_skb_classify() function. This is however out of scope at the
moment as NQB is not yet adopted by the TSV WG. I'd guess we may want more
than just the NQB DSCP codepoint in the L queue, which then warrant
another way to classify traffic, e.g., using tc filter hints.

> In those cases, the ecn_mask should just be mask.

That is actually what it is at the moment: a mask on the two ecn bits.

> 2) Is the intent to make the drop probability 0 by default? (10 in the
> pie rfc, not mentioned in the l4s rfc as yet)

I assume you are referring to §5.1 of the PIE RFC, i.e., the switch to
pure drop once the computed marking probability is >10%?

The default for dualpi2 is also to enter a pure-drop mode on overload.
More precisely, we define overload as reaching a marking probability of
100% in the L queue, meaning an internal PI probability of 50% (as it
gets mutiplied by the coupling factor which defaults to 2).
This is equivalent to a PIE probability of 25% (as the classic queue gets a 
squared probability).
This drop mode means that packets in both queues will be subject to
random drops with a PI^2 probability. Additionally, all packets not
dropped in the L queue are CE marked.

We used to have a parameter to configure this overload threshold (IIRC
it was still in the pre-v4 patches), but found no real use for lowering
it, hence its removal.

Note that the drop on overload can be disabled, resulting in increased
latencies in both queues, 100% CE marking in the L queue, and eventually
a taildrop behaviour once the packet limit is reached.

> 3) has this been tested on a hw mq system as yet? (10gigE is typically
> 64 queues)

Yes, in a setup where 1/32/64/128VMs were behind an Intel X540-*, which indeed
has 64 internal queues. The VMs use a mix of long/short cubic/DCTCP connections
towards another server. I could not think about another use-case where a 10G
software switch would prove to be a bottleneck, i.e., where a queue would
happen.
The qdisc is however not optimized for mq systems, could it cause performance 
degradation if the server was severely resource constrained?

Also, ensuring it was able to saturate 10G meant gro was required on the 
hypervisor, thus that the step threshold of dualpi2 has to be increased to 
compensate for those large bursts. Maybe that is where being mq-aware would 
help, i.e., by instantiating one dualpi2 instance per HW queue?
The AQM scheme itself is CPU friendly (lighter than PIE)--i.e., computing the
probability takes <10 arithmetic ops and 5 comparisons once every 16ms, while
enqueue/dequeue can involve ~10 comparisons and at most 2 rng calls)--so
should not increase the load too much issues if it was duplicated.

Best,
Olivier

^ permalink raw reply

* Re: IPv6 routes inserted into the kernel with 'route' end up with invalid type
From: Toke Høiland-Jørgensen @ 2019-08-23 12:59 UTC (permalink / raw)
  To: David Ahern, David S. Miller
  Cc: netdev, bird-users, Tom Bird, Maria Jan Matejka, Ondrej Zajicek
In-Reply-To: <423380ec-b999-d620-9bd6-78c2dabfde99@gmail.com>

David Ahern <dsahern@gmail.com> writes:

> On 8/23/19 8:43 AM, Toke Høiland-Jørgensen wrote:
>> Hi David
>> 
>> Tom noticed[0] that on newer kernels, the Bird routing daemon rejects IPv6
>> routes received from the kernel if those routes were inserted with the
>> old 'route' utility (i.e., when they're inserted through the ioctl
>> interface).
>> 
>> We tracked it down to the routes having an rtm_type of RTN_UNKNOWN, and
>> a bit of git archaeology points suggestively at this commit:
>> 
>> e8478e80e5a ("net/ipv6: Save route type in rt6_info")
>> 
>> The same setup works with older kernels, so this seems like it's a
>> regression, the age of 'route' notwithstanding. Any good ideas for the
>> proper way to fix this?
>> 
>
> Should be fixed by:
>
> commit c7036d97acd2527cef145b5ef9ad1a37ed21bbe6
> Author: David Ahern <dsahern@gmail.com>
> Date:   Wed Jun 19 10:50:24 2019 -0700
>
>     ipv6: Default fib6_type to RTN_UNICAST when not set
>
>     A user reported that routes are getting installed with type 0
> (RTN_UNSPEC)
>     where before the routes were RTN_UNICAST. One example is from accel-ppp
>     which apparently still uses the ioctl interface and does not set
>     rtmsg_type. Another is the netlink interface where ipv6 does not require
>     rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
>     ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.
>
>     Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
>     Signed-off-by: David Ahern <dsahern@gmail.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>


Ah, great! Guess that hasn't made its way to the stable and distribution
kernels yet. Thanks for the pointer! :)

-Toke

^ permalink raw reply

* Re: IPv6 routes inserted into the kernel with 'route' end up with invalid type
From: David Ahern @ 2019-08-23 12:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller
  Cc: netdev, bird-users, Tom Bird, Maria Jan Matejka, Ondrej Zajicek
In-Reply-To: <87mug0m4rp.fsf@toke.dk>

On 8/23/19 8:43 AM, Toke Høiland-Jørgensen wrote:
> Hi David
> 
> Tom noticed[0] that on newer kernels, the Bird routing daemon rejects IPv6
> routes received from the kernel if those routes were inserted with the
> old 'route' utility (i.e., when they're inserted through the ioctl
> interface).
> 
> We tracked it down to the routes having an rtm_type of RTN_UNKNOWN, and
> a bit of git archaeology points suggestively at this commit:
> 
> e8478e80e5a ("net/ipv6: Save route type in rt6_info")
> 
> The same setup works with older kernels, so this seems like it's a
> regression, the age of 'route' notwithstanding. Any good ideas for the
> proper way to fix this?
> 

Should be fixed by:

commit c7036d97acd2527cef145b5ef9ad1a37ed21bbe6
Author: David Ahern <dsahern@gmail.com>
Date:   Wed Jun 19 10:50:24 2019 -0700

    ipv6: Default fib6_type to RTN_UNICAST when not set

    A user reported that routes are getting installed with type 0
(RTN_UNSPEC)
    where before the routes were RTN_UNICAST. One example is from accel-ppp
    which apparently still uses the ioctl interface and does not set
    rtmsg_type. Another is the netlink interface where ipv6 does not require
    rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
    ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.

    Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
    Signed-off-by: David Ahern <dsahern@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCHv2 net] ipv6/addrconf: allow adding multicast addr if IFA_F_MCAUTOJOIN is set
From: David Ahern @ 2019-08-23 12:41 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Madhu Challa, David S . Miller, Jianlin Shi
In-Reply-To: <20190822082012.GE18865@dhcp-12-139.nay.redhat.com>

On 8/22/19 4:20 AM, Hangbin Liu wrote:
> On Mon, Aug 19, 2019 at 10:33:58PM -0400, David Ahern wrote:
>> On 8/19/19 10:19 PM, Hangbin Liu wrote:
>>> But in ipv6_add_addr() it will check the address type and reject multicast
>>> address directly. So this feature is never worked for IPv6.
>>
>> If true, that is really disappointing.
>>
>> We need to get a functional test script started for various address cases.
> 
> Do you mean an `ip addr add` testing for all kinds of address types?
> 
yes.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox