Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: phy: phy_led_triggers: Fix a possible null-pointer dereference in phy_led_trigger_change_speed()
From: Jia-Ju Bai @ 2019-07-30  8:03 UTC (permalink / raw)
  To: David Miller, andrew; +Cc: f.fainelli, hkallweit1, netdev, linux-kernel
In-Reply-To: <20190729.204113.316505378355498068.davem@davemloft.net>



On 2019/7/30 11:41, David Miller wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Tue, 30 Jul 2019 05:32:29 +0200
>
>> On Tue, Jul 30, 2019 at 10:25:36AM +0800, Jia-Ju Bai wrote:
>>>
>>> On 2019/7/29 21:45, Andrew Lunn wrote:
>>>> On Mon, Jul 29, 2019 at 05:24:24PM +0800, Jia-Ju Bai wrote:
>>>>> In phy_led_trigger_change_speed(), there is an if statement on line 48
>>>>> to check whether phy->last_triggered is NULL:
>>>>>      if (!phy->last_triggered)
>>>>>
>>>>> When phy->last_triggered is NULL, it is used on line 52:
>>>>>      led_trigger_event(&phy->last_triggered->trigger, LED_OFF);
>>>>>
>>>>> Thus, a possible null-pointer dereference may occur.
>>>>>
>>>>> To fix this bug, led_trigger_event(&phy->last_triggered->trigger,
>>>>> LED_OFF) is called when phy->last_triggered is not NULL.
>>>>>
>>>>> This bug is found by a static analysis tool STCheck written by us.
>>>> Who is 'us'?
>>> Me and my colleague...
>> Well, we can leave it very vague, giving no idea who 'us' is. But
>> often you want to name the company behind it, or the university, or
>> the sponsor, etc.
> I agree, if you are going to mention that there is a tool you should be
> clear exactly who and what organization are behind it

Thanks for the advice.
I will add my organization in the patch.


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* Re: KASAN: use-after-free Read in psi_task_change
From: syzbot @ 2019-07-30  7:13 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, linux-kernel, netdev, syzkaller-bugs,
	tglx
In-Reply-To: <000000000000df9d48058e9228cd@google.com>

syzbot has bisected this bug to:

commit e9db4ef6bf4ca9894bb324c76e01b8f1a16b2650
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Sat Jun 30 13:17:47 2018 +0000

     bpf: sockhash fix omitted bucket lock in sock_close

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10f4840c600000
start commit:   bed38c3e Merge tag 'powerpc-5.3-2' of git://git.kernel.org..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=12f4840c600000
console output: https://syzkaller.appspot.com/x/log.txt?x=14f4840c600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9aec8cb13b5f7389
dashboard link: https://syzkaller.appspot.com/bug?extid=f17ba6f9b8d9cc0498d0
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10dc7b34600000

Reported-by: syzbot+f17ba6f9b8d9cc0498d0@syzkaller.appspotmail.com
Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* [PATCH net v2] ipvs: Improve robustness to the ipvs sysctl
From: hujunwei @ 2019-07-30  7:11 UTC (permalink / raw)
  To: wensong, horms, pablo, kadlec, fw, davem, Julian Anastasov,
	Florian Westphal
  Cc: netdev@vger.kernel.org, lvs-devel, netfilter-devel, coreteam,
	Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <1997375e-815d-137f-20c9-0829a8587ee9@huawei.com>

From: Junwei Hu <hujunwei4@huawei.com>

The ipvs module parse the user buffer and save it to sysctl,
then check if the value is valid. invalid value occurs
over a period of time.
Here, I add a variable, struct ctl_table tmp, used to read
the value from the user buffer, and save only when it is valid.
I delete proc_do_sync_mode and use extra1/2 in table for the
proc_dointvec_minmax call.

Fixes: f73181c8288f ("ipvs: add support for sync threads")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
---
V1->V2:
- delete proc_do_sync_mode and use proc_dointvec_minmax call.
---
 net/netfilter/ipvs/ip_vs_ctl.c | 69 +++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 060565e..7aed7b0 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1737,12 +1737,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int val = *valp;
 	int rc;

-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
 	if (write && (*valp != val)) {
-		if ((*valp < 0) || (*valp > 3)) {
-			/* Restore the correct value */
+		if (val < 0 || val > 3) {
+			rc = -EINVAL;
+		} else {
 			*valp = val;
-		} else {
 			update_defense_level(ipvs);
 		}
 	}
@@ -1756,33 +1762,20 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int *valp = table->data;
 	int val[2];
 	int rc;
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = table->maxlen,
+		.mode = table->mode,
+	};

-	/* backup the value first */
 	memcpy(val, valp, sizeof(val));
-
-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
-	if (write && (valp[0] < 0 || valp[1] < 0 ||
-	    (valp[0] >= valp[1] && valp[1]))) {
-		/* Restore the correct value */
-		memcpy(valp, val, sizeof(val));
-	}
-	return rc;
-}
-
-static int
-proc_do_sync_mode(struct ctl_table *table, int write,
-		     void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-	int *valp = table->data;
-	int val = *valp;
-	int rc;
-
-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
-	if (write && (*valp != val)) {
-		if ((*valp < 0) || (*valp > 1)) {
-			/* Restore the correct value */
-			*valp = val;
-		}
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+	if (write) {
+		if (val[0] < 0 || val[1] < 0 ||
+		    (val[0] >= val[1] && val[1]))
+			rc = -EINVAL;
+		else
+			memcpy(valp, val, sizeof(val));
 	}
 	return rc;
 }
@@ -1795,12 +1788,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int val = *valp;
 	int rc;

-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
 	if (write && (*valp != val)) {
-		if (*valp < 1 || !is_power_of_2(*valp)) {
-			/* Restore the correct value */
+		if (val < 1 || !is_power_of_2(val))
+			rc = -EINVAL;
+		else
 			*valp = val;
-		}
 	}
 	return rc;
 }
@@ -1860,7 +1859,9 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 		.procname	= "sync_version",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_do_sync_mode,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	},
 	{
 		.procname	= "sync_ports",
-- 
1.7.12.4


^ permalink raw reply related

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Ido Schimmel @ 2019-07-30  7:06 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190730062721.p4vrxo5sxbtulkrx@lx-anielsen.microsemi.net>

On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
> The 07/29/2019 20:51, Ido Schimmel wrote:
> > Can you please clarify what you're trying to achieve? I just read the
> > thread again and my impression is that you're trying to locally receive
> > packets with a certain link layer multicast address.
> Yes. The thread is also a bit confusing because we half way through realized
> that we misunderstood how the multicast packets should be handled (sorry about
> that). To begin with we had a driver where multicast packets was only copied to
> the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
> how other drivers are doing it, so we changed the driver to include the CPU in
> the default multicast flood-mask.

OK, so what prevents you from removing all other ports from the
flood-mask and letting the CPU handle the flooding? Then you can install
software tc filters to limit the flooding.

> This changes the objective a bit. To begin with we needed to get more packets to
> the CPU (which could have been done using tc ingress rules and a trap action).
> 
> Now after we changed the driver, we realized that we need something to limit the
> flooding of certain L2 multicast packets. This is the new problem we are trying
> to solve!
> 
> Example: Say we have a bridge with 4 slave interfaces, then we want to install a
> forwarding rule saying that packets to a given L2-multicast MAC address, should
> only be flooded to 2 of the 4 ports.
> 
> (instead of adding rules to get certain packets to the CPU, we are now adding
> other rules to prevent other packets from going to the CPU and other ports where
> they are not needed/wanted).
> 
> This is exactly the same thing as IGMP snooping does dynamically, but only for
> IP multicast.
> 
> The "bridge mdb" allow users to manually/static add/del a port to a multicast
> group, but still it operates on IP multicast address (not L2 multicast
> addresses).
> 
> > Nik suggested SIOCADDMULTI.
> It is not clear to me how this should be used to limit the flooding, maybe we
> can make some hacks, but as far as I understand the intend of this is maintain
> the list of addresses an interface should receive. I'm not sure this should
> influence how for forwarding decisions are being made.
> 
> > and I suggested a tc filter to get the packet to the CPU.
> The TC solution is a good solution to the original problem where wanted to copy
> more frames to the CPU. But we were convinced that this is not the right
> approach, and that the CPU by default should receive all multicast packets, and
> we should instead try to find a way to limit the flooding of certain frames as
> an optimization.

This can still work. In Linux, ingress tc filters are executed before the
bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
eth2:

# tc filter add dev eth0 ingress pref 1 flower skip_sw \
	dst_mac 01:21:6C:00:00:01 action trap
# tc filter add dev eth2 egress pref 1 flower skip_hw \
	dst_mac 01:21:6C:00:00:01 action drop

The first filter is only present in HW ('skip_sw') and should result in
your HW passing you the sole copy of the packet.

The second filter is only present in SW ('skip_hw', not using HW egress
ACL that you don't have) and drops the packet after it was flooded by
the SW bridge.

As I mentioned earlier, you can install the filter once in your HW and
share it between different ports using a shared block. This means you
only consume one TCAM entry.

Note that this allows you to keep flooding all other multicast packets
in HW.

> > If you now want to limit the ports to which this packet is flooded, then
> > you can use tc filters in *software*:
> > 
> > # tc qdisc add dev eth2 clsact
> > # tc filter add dev eth2 egress pref 1 flower skip_hw \
> > 	dst_mac 01:21:6C:00:00:01 action drop
> Yes. This can work in the SW bridge.
> 
> > If you want to forward the packet in hardware and locally receive it,
> > you can chain several mirred action and then a trap action.
> I'm not I fully understand how this should be done, but it does sound like it
> becomes quite complicated. Also, as far as I understand it will mean that we
> will be using TCAM/ACL resources to do something that could have been done with
> a simple MAC entry.
> 
> > Both options avoid HW egress ACLs which your design does not support.
> True, but what is wrong with expanding the functionality of the normal
> forwarding/MAC operations to allow multiple destinations?
> 
> It is not an uncommon feature (I just browsed the manual of some common L2
> switches and they all has this feature).
> 
> It seems to fit nicely into the existing user-interface:
> 
> bridge fdb add    01:21:6C:00:00:01 port eth0
> bridge fdb append 01:21:6C:00:00:01 port eth1

Wouldn't it be better to instead extend the MDB entries so that they are
either keyed by IP or MAC? I believe FDB should remain as unicast-only.
As a bonus, existing drivers could benefit from it, as MDB entries are
already notified by MAC.

> 
> It seems that it can be added to the existing implementation with out adding
> significant complexity.
> 
> It will be easy to offload in HW.
> 
> I do not believe that it will be a performance issue, if this is a concern then
> we may have to do a bit of benchmarking, or we can make it a configuration
> option.
> 
> Long story short, we (Horatiu and I) learned a lot from the discussion here, and
> I think we should try do a new patch with the learning we got. Then it is easier
> to see what it actually means to the exiting code, complexity, exiting drivers,
> performance, default behavioral, backwards compatibly, and other valid concerns.
> 
> If the patch is no good, and cannot be fixed, then we will go back and look
> further into alternative solutions.

Overall, I tend to agree with Nik. I think your use case is too specific
to justify the amount of changes you want to make in the bridge driver.
We also provided other alternatives. That being said, you're more than
welcome to send the patches and we can continue the discussion then.

^ permalink raw reply

* Re: [PATCH net-next v2] can: fix ioctl function removal
From: Oliver Hartkopp @ 2019-07-30  6:48 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-can, sfr, kernel test robot, Marc Kleine-Budde
In-Reply-To: <20190730064333.1581-1-socketcan@hartkopp.net>

According to Marc the original patch has already been applied by Dave.

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=473d924d7d46cb57aa4c1863261d18366af345af

Thanks for the support & sorry for the noise!

Best regards,
Oliver

On 30/07/2019 08.43, Oliver Hartkopp wrote:
> Commit 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler") replaced
> the almost empty can_ioctl() function with sock_no_ioctl() which always returns
> -EOPNOTSUPP.
> 
> Even though we don't have any ioctl() functions on socket/network layer we need
> to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
> to the network driver layer.
> 
> This patch fixes the wrong return codes in the CAN network layer protocols.
> 
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Fixes: 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler")
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
> ---
> 
>   v2: Changed SHA1 tag to be a least 12 digits long
> 
>   net/can/bcm.c | 9 ++++++++-
>   net/can/raw.c | 9 ++++++++-
>   2 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index 8da986b19d88..bf1d0bbecec8 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1680,6 +1680,13 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>   	return size;
>   }
>   
> +int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +			 unsigned long arg)
> +{
> +	/* no ioctls for socket layer -> hand it down to NIC layer */
> +	return -ENOIOCTLCMD;
> +}
> +
>   static const struct proto_ops bcm_ops = {
>   	.family        = PF_CAN,
>   	.release       = bcm_release,
> @@ -1689,7 +1696,7 @@ static const struct proto_ops bcm_ops = {
>   	.accept        = sock_no_accept,
>   	.getname       = sock_no_getname,
>   	.poll          = datagram_poll,
> -	.ioctl         = sock_no_ioctl,
> +	.ioctl         = bcm_sock_no_ioctlcmd,
>   	.gettstamp     = sock_gettstamp,
>   	.listen        = sock_no_listen,
>   	.shutdown      = sock_no_shutdown,
> diff --git a/net/can/raw.c b/net/can/raw.c
> index ff720272f7b7..da386f1fa815 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -837,6 +837,13 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>   	return size;
>   }
>   
> +int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +			 unsigned long arg)
> +{
> +	/* no ioctls for socket layer -> hand it down to NIC layer */
> +	return -ENOIOCTLCMD;
> +}
> +
>   static const struct proto_ops raw_ops = {
>   	.family        = PF_CAN,
>   	.release       = raw_release,
> @@ -846,7 +853,7 @@ static const struct proto_ops raw_ops = {
>   	.accept        = sock_no_accept,
>   	.getname       = raw_getname,
>   	.poll          = datagram_poll,
> -	.ioctl         = sock_no_ioctl,
> +	.ioctl         = raw_sock_no_ioctlcmd,
>   	.gettstamp     = sock_gettstamp,
>   	.listen        = sock_no_listen,
>   	.shutdown      = sock_no_shutdown,
> 

^ permalink raw reply

* [PATCH net-next v2] can: fix ioctl function removal
From: Oliver Hartkopp @ 2019-07-30  6:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-can, sfr, Oliver Hartkopp, kernel test robot

Commit 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler") replaced
the almost empty can_ioctl() function with sock_no_ioctl() which always returns
-EOPNOTSUPP.

Even though we don't have any ioctl() functions on socket/network layer we need
to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
to the network driver layer.

This patch fixes the wrong return codes in the CAN network layer protocols.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
---

 v2: Changed SHA1 tag to be a least 12 digits long

 net/can/bcm.c | 9 ++++++++-
 net/can/raw.c | 9 ++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index 8da986b19d88..bf1d0bbecec8 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,6 +1680,13 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
+int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+			 unsigned long arg)
+{
+	/* no ioctls for socket layer -> hand it down to NIC layer */
+	return -ENOIOCTLCMD;
+}
+
 static const struct proto_ops bcm_ops = {
 	.family        = PF_CAN,
 	.release       = bcm_release,
@@ -1689,7 +1696,7 @@ static const struct proto_ops bcm_ops = {
 	.accept        = sock_no_accept,
 	.getname       = sock_no_getname,
 	.poll          = datagram_poll,
-	.ioctl         = sock_no_ioctl,
+	.ioctl         = bcm_sock_no_ioctlcmd,
 	.gettstamp     = sock_gettstamp,
 	.listen        = sock_no_listen,
 	.shutdown      = sock_no_shutdown,
diff --git a/net/can/raw.c b/net/can/raw.c
index ff720272f7b7..da386f1fa815 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,6 +837,13 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
+int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+			 unsigned long arg)
+{
+	/* no ioctls for socket layer -> hand it down to NIC layer */
+	return -ENOIOCTLCMD;
+}
+
 static const struct proto_ops raw_ops = {
 	.family        = PF_CAN,
 	.release       = raw_release,
@@ -846,7 +853,7 @@ static const struct proto_ops raw_ops = {
 	.accept        = sock_no_accept,
 	.getname       = raw_getname,
 	.poll          = datagram_poll,
-	.ioctl         = sock_no_ioctl,
+	.ioctl         = raw_sock_no_ioctlcmd,
 	.gettstamp     = sock_gettstamp,
 	.listen        = sock_no_listen,
 	.shutdown      = sock_no_shutdown,
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-30  6:39 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <03708d00-a8d9-4a9d-4188-9fe0e38de2b8@huawei.com>



On 2019/7/30 14:35, liuyonglong wrote:
> :/sys/kernel/debug/tracing$ cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 45/45   #P:128
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>     kworker/64:2-1028  [064] ....   172.295687: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.295726: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.296902: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.296938: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.321213: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.343209: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.343245: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.343882: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.343918: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.362658: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.385961: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.385996: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.386646: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.386681: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.411286: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.433225: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.433260: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.433887: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.433922: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.452862: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>         ifconfig-1324  [011] ....   177.325585: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>   kworker/u257:0-8     [012] ....   177.325642: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x04 val:0x01e1
>   kworker/u257:0-8     [012] ....   177.325654: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x04 val:0x05e1
>   kworker/u257:0-8     [012] ....   177.325708: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [012] ....   177.325744: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [012] ....   177.325779: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>   kworker/u257:0-8     [012] ....   177.325788: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1240
>   kworker/u257:0-8     [012] ....   177.325843: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x798d
>   kworker/u257:0-8     [003] ....   178.360488: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   179.384479: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   180.408477: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   181.432474: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79a9
>   kworker/u257:0-8     [000] ....   181.432510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x7800
>   kworker/u257:0-8     [000] ....   181.432546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [000] ....   181.432582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
>   kworker/u257:0-8     [000] ....   182.456510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   182.456546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x4800
>   kworker/u257:0-8     [000] ....   182.456582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [000] ....   182.456618: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
>   kworker/u257:0-8     [001] ....   183.480476: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   184.504478: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   185.528486: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   186.552475: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>         ifconfig-1327  [011] ....   187.196036: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>         ifconfig-1327  [011] ....   187.196046: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1840
> 

Add dmesg below:

[  177.325619] hns3 0000:bd:00.1 eth5: net open
[  177.325859] hns3 0000:bd:00.1: invalid speed (-1)
[  177.330180] 8021q: adding VLAN 0 to HW filter on device eth5
[  177.334569] hns3 0000:bd:00.1 eth5: failed to adjust link.
[  177.345674] IPv6: ADDRCONF(NETDEV_CHANGE): eth5: link becomes ready
[  178.021616] hns3 0000:bd:00.1 eth5: link up
[  178.808459] hns3 0000:bd:00.1 eth5: link down
[  182.808452] hns3 0000:bd:00.1 eth5: link up
[  187.191563] hns3 0000:bd:00.1 eth5: net stop
[  187.196049] hns3 0000:bd:00.1 eth5: link down


> 
> 
> On 2019/7/30 14:08, Heiner Kallweit wrote:
>> On 30.07.2019 06:03, liuyonglong wrote:
>>>
>>>
>>> On 2019/7/30 4:57, Heiner Kallweit wrote:
>>>> On 29.07.2019 05:59, liuyonglong wrote:
>>>>>
>>>>>
>>>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>>>> copper link status should read twice, or it may get a fake link
>>>>>>> up status, and cause up->down->up at the first time when link up.
>>>>>>> This happens more oftem at Realtek phy.
>>>>>>>
>>>>>> This is not correct, there is no fake link up status.
>>>>>> Read the comment in genphy_update_link, only link-down events
>>>>>> are latched. Means if the first read returns link up, then there
>>>>>> is no need for a second read. And in polling mode we don't do a
>>>>>> second read because we want to detect also short link drops.
>>>>>>
>>>>>> It would be helpful if you could describe your actual problem
>>>>>> and whether you use polling or interrupt mode.
>>>>>>
>>>>>
>>>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>>>
>>>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>>>
>>>>>
>>>>> According to the datasheet:
>>>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>>>> reg 1.2=1 now, means link is up
>>>>>
>>>>> We can see that, when we read the link up, the auto-negotiation
>>>>> is not complete yet, so the speed is invalid.
>>>>>
>>>>> I don't know why this happen, maybe this state is keep from bios?
>>>>> Or we may do something else in the phy initialize to fix it?
>>>>> And also confuse that why read twice can fix it?
>>>>>
>>>> I suppose that basically any delay would do.
>>>>
>>>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>>
>>>>>>> I add a fake status read, and can solve this problem.
>>>>>>>
>>>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>>>> read in polling mode, so I don't know whether my solution is
>>>>>>> correct.
>>>>>>>
>>>>
>>>> Can you test whether the following fixes the issue for you?
>>>> Also it would be interesting which exact PHY models you tested
>>>> and whether you built the respective PHY drivers or whether you
>>>> rely on the genphy driver. Best use the second patch to get the
>>>> needed info. It may make sense anyway to add the call to
>>>> phy_attached_info() to the hns3 driver.
>>>>
>>>
>>> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
>>> [   40.932133] hns3 0000:bd:00.3 eth7: net open
>>> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
>>> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
>>> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>
>>> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
>>> [   41.563511] hns3 0000:bd:00.2 eth6: net open
>>> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
>>> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
>>> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
>>> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
>>>
>>> I am using RTL8211F, but you can see that, both genphy driver and
>>> RTL8211F driver have the same issue.
>>>
>>>>
>>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>>> index 6b5cb87f3..fbecfe210 100644
>>>> --- a/drivers/net/phy/phy_device.c
>>>> +++ b/drivers/net/phy/phy_device.c
>>>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>>>  
>>>>  	linkmode_zero(phydev->lp_advertising);
>>>>  
>>>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>>>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>>>> +	    (phydev->autoneg_complete || phydev->link)) {
>>>>  		if (phydev->is_gigabit_capable) {
>>>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>>>  			if (lpagb < 0)
>>>>
>>>
>>> I have try this patch, have no effect. I suppose that at this time,
>>> the autoneg actually not complete yet.
>>>
>>> Maybe the wrong phy state is passed from bios? Invalid speed just
>>> happen at the first time when ethX up, after that, repeat the
>>> ifconfig down/ifconfig up command can not see that again.
>>>
>>> So I think the bios should power off the phy(writing reg 1.11 to 1)
>>> before it starts the OS? Or any other way to fix this in the OS?
>>>
>> To get a better idea of whats going on it would be good to see a full
>> MDIO trace. Can you enable MDIO tracing via the following sysctl file
>> /sys/kernel/debug/tracing/events/mdio/enable
>> and provide the generated trace?
>>
>> Due to polling mode each second entries will be generated, so you
>> better stop network after the issue occurred.
>>
>>>
>>>
>>>
>>
>> Heiner
>>
>> .
>>
> 
> 
> .
> 


^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-30  6:35 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <5087ee34-5776-f02b-d7e5-bce005ba3b92@gmail.com>

:/sys/kernel/debug/tracing$ cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 45/45   #P:128
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
    kworker/64:2-1028  [064] ....   172.295687: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.295726: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.296902: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.296938: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.321213: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.343209: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.343245: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.343882: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.343918: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.362658: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.385961: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.385996: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.386646: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.386681: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.411286: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.433225: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.433260: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.433887: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.433922: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.452862: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
        ifconfig-1324  [011] ....   177.325585: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
  kworker/u257:0-8     [012] ....   177.325642: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x04 val:0x01e1
  kworker/u257:0-8     [012] ....   177.325654: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x04 val:0x05e1
  kworker/u257:0-8     [012] ....   177.325708: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [012] ....   177.325744: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [012] ....   177.325779: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
  kworker/u257:0-8     [012] ....   177.325788: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1240
  kworker/u257:0-8     [012] ....   177.325843: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x798d
  kworker/u257:0-8     [003] ....   178.360488: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   179.384479: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   180.408477: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   181.432474: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79a9
  kworker/u257:0-8     [000] ....   181.432510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x7800
  kworker/u257:0-8     [000] ....   181.432546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [000] ....   181.432582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
  kworker/u257:0-8     [000] ....   182.456510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   182.456546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x4800
  kworker/u257:0-8     [000] ....   182.456582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [000] ....   182.456618: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
  kworker/u257:0-8     [001] ....   183.480476: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   184.504478: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   185.528486: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   186.552475: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
        ifconfig-1327  [011] ....   187.196036: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
        ifconfig-1327  [011] ....   187.196046: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1840



On 2019/7/30 14:08, Heiner Kallweit wrote:
> On 30.07.2019 06:03, liuyonglong wrote:
>>
>>
>> On 2019/7/30 4:57, Heiner Kallweit wrote:
>>> On 29.07.2019 05:59, liuyonglong wrote:
>>>>
>>>>
>>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>>> copper link status should read twice, or it may get a fake link
>>>>>> up status, and cause up->down->up at the first time when link up.
>>>>>> This happens more oftem at Realtek phy.
>>>>>>
>>>>> This is not correct, there is no fake link up status.
>>>>> Read the comment in genphy_update_link, only link-down events
>>>>> are latched. Means if the first read returns link up, then there
>>>>> is no need for a second read. And in polling mode we don't do a
>>>>> second read because we want to detect also short link drops.
>>>>>
>>>>> It would be helpful if you could describe your actual problem
>>>>> and whether you use polling or interrupt mode.
>>>>>
>>>>
>>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>>
>>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>>
>>>>
>>>> According to the datasheet:
>>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>>> reg 1.2=1 now, means link is up
>>>>
>>>> We can see that, when we read the link up, the auto-negotiation
>>>> is not complete yet, so the speed is invalid.
>>>>
>>>> I don't know why this happen, maybe this state is keep from bios?
>>>> Or we may do something else in the phy initialize to fix it?
>>>> And also confuse that why read twice can fix it?
>>>>
>>> I suppose that basically any delay would do.
>>>
>>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>
>>>>>> I add a fake status read, and can solve this problem.
>>>>>>
>>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>>> read in polling mode, so I don't know whether my solution is
>>>>>> correct.
>>>>>>
>>>
>>> Can you test whether the following fixes the issue for you?
>>> Also it would be interesting which exact PHY models you tested
>>> and whether you built the respective PHY drivers or whether you
>>> rely on the genphy driver. Best use the second patch to get the
>>> needed info. It may make sense anyway to add the call to
>>> phy_attached_info() to the hns3 driver.
>>>
>>
>> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
>> [   40.932133] hns3 0000:bd:00.3 eth7: net open
>> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
>> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
>> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>
>> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
>> [   41.563511] hns3 0000:bd:00.2 eth6: net open
>> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
>> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
>> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
>> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
>>
>> I am using RTL8211F, but you can see that, both genphy driver and
>> RTL8211F driver have the same issue.
>>
>>>
>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>> index 6b5cb87f3..fbecfe210 100644
>>> --- a/drivers/net/phy/phy_device.c
>>> +++ b/drivers/net/phy/phy_device.c
>>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>>  
>>>  	linkmode_zero(phydev->lp_advertising);
>>>  
>>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>>> +	    (phydev->autoneg_complete || phydev->link)) {
>>>  		if (phydev->is_gigabit_capable) {
>>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>>  			if (lpagb < 0)
>>>
>>
>> I have try this patch, have no effect. I suppose that at this time,
>> the autoneg actually not complete yet.
>>
>> Maybe the wrong phy state is passed from bios? Invalid speed just
>> happen at the first time when ethX up, after that, repeat the
>> ifconfig down/ifconfig up command can not see that again.
>>
>> So I think the bios should power off the phy(writing reg 1.11 to 1)
>> before it starts the OS? Or any other way to fix this in the OS?
>>
> To get a better idea of whats going on it would be good to see a full
> MDIO trace. Can you enable MDIO tracing via the following sysctl file
> /sys/kernel/debug/tracing/events/mdio/enable
> and provide the generated trace?
> 
> Due to polling mode each second entries will be generated, so you
> better stop network after the issue occurred.
> 
>>
>>
>>
> 
> Heiner
> 
> .
> 


^ permalink raw reply

* Re: [PATCH net-next] can: fix ioctl function removal
From: Marc Kleine-Budde @ 2019-07-30  6:30 UTC (permalink / raw)
  To: Oliver Hartkopp, davem, netdev; +Cc: linux-can, kernel test robot
In-Reply-To: <20190729204056.2976-1-socketcan@hartkopp.net>


[-- Attachment #1.1: Type: text/plain, Size: 1183 bytes --]

On 7/29/19 10:40 PM, Oliver Hartkopp wrote:
> Commit 60649d4e0af ("can: remove obsolete empty ioctl() handler") replaced the
> almost empty can_ioctl() function with sock_no_ioctl() which always returns
> -EOPNOTSUPP.
> 
> Even though we don't have any ioctl() functions on socket/network layer we need
> to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
> to the network driver layer.
> 
> This patch fixes the wrong return codes in the CAN network layer protocols.
> 
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Fixes: 60649d4e0af ("can: remove obsolete empty ioctl() handler")
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>

David Miller has already applied the patch.

| commit 473d924d7d46cb57aa4c1863261d18366af345af
| Author: Oliver Hartkopp <socketcan@hartkopp.net>
| Date:   Mon Jul 29 22:40:56 2019 +0200

Thanks,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-30  6:27 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190729175136.GA28572@splinter>

The 07/29/2019 20:51, Ido Schimmel wrote:
> Can you please clarify what you're trying to achieve? I just read the
> thread again and my impression is that you're trying to locally receive
> packets with a certain link layer multicast address.
Yes. The thread is also a bit confusing because we half way through realized
that we misunderstood how the multicast packets should be handled (sorry about
that). To begin with we had a driver where multicast packets was only copied to
the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
how other drivers are doing it, so we changed the driver to include the CPU in
the default multicast flood-mask.

This changes the objective a bit. To begin with we needed to get more packets to
the CPU (which could have been done using tc ingress rules and a trap action).

Now after we changed the driver, we realized that we need something to limit the
flooding of certain L2 multicast packets. This is the new problem we are trying
to solve!

Example: Say we have a bridge with 4 slave interfaces, then we want to install a
forwarding rule saying that packets to a given L2-multicast MAC address, should
only be flooded to 2 of the 4 ports.

(instead of adding rules to get certain packets to the CPU, we are now adding
other rules to prevent other packets from going to the CPU and other ports where
they are not needed/wanted).

This is exactly the same thing as IGMP snooping does dynamically, but only for
IP multicast.

The "bridge mdb" allow users to manually/static add/del a port to a multicast
group, but still it operates on IP multicast address (not L2 multicast
addresses).

> Nik suggested SIOCADDMULTI.
It is not clear to me how this should be used to limit the flooding, maybe we
can make some hacks, but as far as I understand the intend of this is maintain
the list of addresses an interface should receive. I'm not sure this should
influence how for forwarding decisions are being made.

> and I suggested a tc filter to get the packet to the CPU.
The TC solution is a good solution to the original problem where wanted to copy
more frames to the CPU. But we were convinced that this is not the right
approach, and that the CPU by default should receive all multicast packets, and
we should instead try to find a way to limit the flooding of certain frames as
an optimization.

> If you now want to limit the ports to which this packet is flooded, then
> you can use tc filters in *software*:
> 
> # tc qdisc add dev eth2 clsact
> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> 	dst_mac 01:21:6C:00:00:01 action drop
Yes. This can work in the SW bridge.

> If you want to forward the packet in hardware and locally receive it,
> you can chain several mirred action and then a trap action.
I'm not I fully understand how this should be done, but it does sound like it
becomes quite complicated. Also, as far as I understand it will mean that we
will be using TCAM/ACL resources to do something that could have been done with
a simple MAC entry.

> Both options avoid HW egress ACLs which your design does not support.
True, but what is wrong with expanding the functionality of the normal
forwarding/MAC operations to allow multiple destinations?

It is not an uncommon feature (I just browsed the manual of some common L2
switches and they all has this feature).

It seems to fit nicely into the existing user-interface:

bridge fdb add    01:21:6C:00:00:01 port eth0
bridge fdb append 01:21:6C:00:00:01 port eth1

It seems that it can be added to the existing implementation with out adding
significant complexity.

It will be easy to offload in HW.

I do not believe that it will be a performance issue, if this is a concern then
we may have to do a bit of benchmarking, or we can make it a configuration
option.

Long story short, we (Horatiu and I) learned a lot from the discussion here, and
I think we should try do a new patch with the learning we got. Then it is easier
to see what it actually means to the exiting code, complexity, exiting drivers,
performance, default behavioral, backwards compatibly, and other valid concerns.

If the patch is no good, and cannot be fixed, then we will go back and look
further into alternative solutions.

-- 
/Allan

^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: Heiner Kallweit @ 2019-07-30  6:08 UTC (permalink / raw)
  To: liuyonglong, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <1d4be6ad-ffe6-2325-ceab-9f35da617ee9@huawei.com>

On 30.07.2019 06:03, liuyonglong wrote:
> 
> 
> On 2019/7/30 4:57, Heiner Kallweit wrote:
>> On 29.07.2019 05:59, liuyonglong wrote:
>>>
>>>
>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>> copper link status should read twice, or it may get a fake link
>>>>> up status, and cause up->down->up at the first time when link up.
>>>>> This happens more oftem at Realtek phy.
>>>>>
>>>> This is not correct, there is no fake link up status.
>>>> Read the comment in genphy_update_link, only link-down events
>>>> are latched. Means if the first read returns link up, then there
>>>> is no need for a second read. And in polling mode we don't do a
>>>> second read because we want to detect also short link drops.
>>>>
>>>> It would be helpful if you could describe your actual problem
>>>> and whether you use polling or interrupt mode.
>>>>
>>>
>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>
>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>
>>>
>>> According to the datasheet:
>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>> reg 1.2=1 now, means link is up
>>>
>>> We can see that, when we read the link up, the auto-negotiation
>>> is not complete yet, so the speed is invalid.
>>>
>>> I don't know why this happen, maybe this state is keep from bios?
>>> Or we may do something else in the phy initialize to fix it?
>>> And also confuse that why read twice can fix it?
>>>
>> I suppose that basically any delay would do.
>>
>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>
>>>>> I add a fake status read, and can solve this problem.
>>>>>
>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>> read in polling mode, so I don't know whether my solution is
>>>>> correct.
>>>>>
>>
>> Can you test whether the following fixes the issue for you?
>> Also it would be interesting which exact PHY models you tested
>> and whether you built the respective PHY drivers or whether you
>> rely on the genphy driver. Best use the second patch to get the
>> needed info. It may make sense anyway to add the call to
>> phy_attached_info() to the hns3 driver.
>>
> 
> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
> [   40.932133] hns3 0000:bd:00.3 eth7: net open
> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
> 
> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
> [   41.563511] hns3 0000:bd:00.2 eth6: net open
> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
> 
> I am using RTL8211F, but you can see that, both genphy driver and
> RTL8211F driver have the same issue.
> 
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 6b5cb87f3..fbecfe210 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>  
>>  	linkmode_zero(phydev->lp_advertising);
>>  
>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>> +	    (phydev->autoneg_complete || phydev->link)) {
>>  		if (phydev->is_gigabit_capable) {
>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>  			if (lpagb < 0)
>>
> 
> I have try this patch, have no effect. I suppose that at this time,
> the autoneg actually not complete yet.
> 
> Maybe the wrong phy state is passed from bios? Invalid speed just
> happen at the first time when ethX up, after that, repeat the
> ifconfig down/ifconfig up command can not see that again.
> 
> So I think the bios should power off the phy(writing reg 1.11 to 1)
> before it starts the OS? Or any other way to fix this in the OS?
> 
To get a better idea of whats going on it would be good to see a full
MDIO trace. Can you enable MDIO tracing via the following sysctl file
/sys/kernel/debug/tracing/events/mdio/enable
and provide the generated trace?

Due to polling mode each second entries will be generated, so you
better stop network after the issue occurred.

> 
> 
> 

Heiner

^ permalink raw reply

* Re: [patch net-next 0/3] net: devlink: Finish network namespace support
From: Jiri Pirko @ 2019-07-30  6:08 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, davem, jakub.kicinski, sthemmin, mlxsw
In-Reply-To: <eebd6bc7-6466-2c93-4077-72a39f3b8596@gmail.com>

Mon, Jul 29, 2019 at 10:17:25PM CEST, dsahern@gmail.com wrote:
>On 7/27/19 3:44 AM, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Devlink from the beginning counts with network namespaces, but the
>> instances has been fixed to init_net. The first patch allows user
>> to move existing devlink instances into namespaces:
>> 
>
>so you intend for an asic, for example, to have multiple devlink
>instances where each instance governs a set of related ports (e.g.,
>ports that share a set of hardware resources) and those instances can be
>managed from distinct network namespaces?

No, no multiple devlink instances for asic intended.

>

^ permalink raw reply

* Re: [patch iproute2 1/2] devlink: introduce cmdline option to switch to a different namespace
From: Jiri Pirko @ 2019-07-30  6:07 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, netdev, davem, jakub.kicinski,
	sthemmin, mlxsw
In-Reply-To: <d590ac7b-9dc7-41e2-e411-79ac4654c709@gmail.com>

Mon, Jul 29, 2019 at 10:21:11PM CEST, dsahern@gmail.com wrote:
>On 7/27/19 4:21 AM, Jiri Pirko wrote:
>>>> diff --git a/devlink/devlink.c b/devlink/devlink.c
>>>> index d8197ea3a478..9242cc05ad0c 100644
>>>> --- a/devlink/devlink.c
>>>> +++ b/devlink/devlink.c
>>>> @@ -32,6 +32,7 @@
>>>>  #include "mnlg.h"
>>>>  #include "json_writer.h"
>>>>  #include "utils.h"
>>>> +#include "namespace.h"
>>>>  
>>>>  #define ESWITCH_MODE_LEGACY "legacy"
>>>>  #define ESWITCH_MODE_SWITCHDEV "switchdev"
>>>> @@ -6332,7 +6333,7 @@ static int cmd_health(struct dl *dl)
>>>>  static void help(void)
>>>>  {
>>>>  	pr_err("Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }\n"
>>>> -	       "       devlink [ -f[orce] ] -b[atch] filename\n"
>>>> +	       "       devlink [ -f[orce] ] -b[atch] filename -N[etns]
>>>>  netnsname\n"
>>>
>>> 'ip' uses lower-case n for this; why not be consistent?
>> 
>> Because "n" is taken :/
>
>that's really unfortunate. commands within a package should have similar
>syntax and -n/N are backwards between ip/tc and devlink. That's the
>stuff that drives users crazy.

I agree. But this particular ship has sailed :(

^ permalink raw reply

* Re: [patch net-next 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jiri Pirko @ 2019-07-30  6:06 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190729115906.6bc2176d@cakuba.netronome.com>

Mon, Jul 29, 2019 at 08:59:06PM CEST, jakub.kicinski@netronome.com wrote:
>On Sat, 27 Jul 2019 11:44:59 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> When user does create new netdevsim instance using sysfs bus file,
>> create the devlink instance and related netdev instance in the namespace
>> of the caller.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
>> diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
>> index 1a0ff3d7747b..6aeed0c600f8 100644
>> --- a/drivers/net/netdevsim/bus.c
>> +++ b/drivers/net/netdevsim/bus.c
>> @@ -283,6 +283,7 @@ nsim_bus_dev_new(unsigned int id, unsigned int port_count)
>>  	nsim_bus_dev->dev.bus = &nsim_bus;
>>  	nsim_bus_dev->dev.type = &nsim_bus_dev_type;
>>  	nsim_bus_dev->port_count = port_count;
>> +	nsim_bus_dev->initial_net = current->nsproxy->net_ns;
>>  
>>  	err = device_register(&nsim_bus_dev->dev);
>>  	if (err)
>
>The saved initial_net is only to carry the net info from the adding
>process to the probe callback, because probe can be asynchronous?

Exactly.


>I'm not entirely sure netdevsim can probe asynchronously in practice
>given we never return EPROBE_DEFER, but would you mind at least adding
>a comment stating that next to the definition of the field, otherwise 
>I worry someone may mistakenly use this field instead of the up-to-date
>devlink's netns.

Will do.


>
>> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
>> index 79c05af2a7c0..cdf53d0e0c49 100644
>> --- a/drivers/net/netdevsim/netdevsim.h
>> +++ b/drivers/net/netdevsim/netdevsim.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/netdevice.h>
>>  #include <linux/u64_stats_sync.h>
>>  #include <net/devlink.h>
>> +#include <net/net_namespace.h>
>
>You can just do a forward declaration, no need to pull in the header.

Sure, but why?


>
>>  #include <net/xdp.h>
>>  
>>  #define DRV_NAME	"netdevsim"
>
>> @@ -213,6 +215,7 @@ struct nsim_bus_dev {
>>  	struct device dev;
>>  	struct list_head list;
>>  	unsigned int port_count;
>> +	struct net *initial_net;
>>  	unsigned int num_vfs;
>>  	struct nsim_vf_config *vfconfigs;
>>  };
>
>Otherwise makes perfect sense, with the above nits addressed feel free
>to add:
>
>Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [patch net-next 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-07-30  6:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190729.105216.2073541569967891866.davem@davemloft.net>

Mon, Jul 29, 2019 at 07:52:16PM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Sat, 27 Jul 2019 11:44:57 +0200
>
>> +	if ((netns_pid_attr && (netns_fd_attr || netns_id_attr)) ||
>> +	    (netns_fd_attr && (netns_pid_attr || netns_id_attr)) ||
>> +	    (netns_id_attr && (netns_pid_attr || netns_fd_attr))) {
>> +		NL_SET_ERR_MSG(info->extack, "multiple netns identifying attributes specified");
>> +		return ERR_PTR(-EINVAL);
>> +	}
>
>How about:
>
>	if (!!a + !!b + !!c > 1) {
>	...

I just copied the logic from the existing code. But sure :)


>
>> +
>> +	if (netns_pid_attr) {
>> +		net = get_net_ns_by_pid(nla_get_u32(netns_pid_attr));
>> +	} else if (netns_fd_attr) {
>> +		net = get_net_ns_by_fd(nla_get_u32(netns_fd_attr));
>> +	} else if (netns_id_attr) {
>> +		net = get_net_ns_by_id(sock_net(skb->sk),
>> +				       nla_get_u32(netns_id_attr));
>> +		if (!net)
>> +			net = ERR_PTR(-EINVAL);
>> +	}
>> +	if (IS_ERR(net)) {
>
>I think this is going to be one of those cases where a compiler won't be able
>to prove that 'net' is guaranteed to be initialized at this spot.  Please
>rearrange this code somehow so that is unlikely to happen.

It does not complain though. The function cannot be entered unless at
least one is. I'll check again.


>
>Thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Heiner Kallweit @ 2019-07-30  6:00 UTC (permalink / raw)
  To: Tao Ren, Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, Arun Parameswaran,
	Justin Chen, Vladimir Oltean, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <aff2728d-5db1-50fd-767c-29b355890323@fb.com>

On 30.07.2019 07:05, Tao Ren wrote:
> On 7/29/19 8:35 PM, Andrew Lunn wrote:
>> On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
>>> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
>>> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
>>> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
>>> mode (different sets of MII Control/Status registers being used), let's
>>> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.
>>
>> Hi Tao
>>
>> What exactly does it get wrong?
>>
>>      Thanks
>> 	Andrew
> 
> Hi Andrew,
> 
> BCM54616S is set to RGMII-Fiber (1000Base-X) mode on my platform, and none of the features (1000BaseT/100BaseT/10BaseT) can be detected by genphy_read_abilities(), because the PHY only reports 1000BaseX_Full|Half ability in this mode.
> 
Are you going to use the PHY in copper or fibre mode?
In case you use fibre mode, why do you need the copper modes set as supported?
Or does the PHY just start in fibre mode and you want to switch it to copper mode?

> 
> Thanks,
> 
> Tao
> 
Heiner

^ permalink raw reply

* Re: [PATCH bpf-next 00/10] CO-RE offset relocations
From: Song Liu @ 2019-07-30  5:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Song Liu, Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzbKYi53TdF9nAB3i3gAuca8FjM_P3F5aHp1uQ6coMgZ9A@mail.gmail.com>



> On Jul 29, 2019, at 4:09 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Mon, Jul 29, 2019 at 1:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
>> 
>> On Mon, Jul 29, 2019 at 1:20 PM Song Liu <liu.song.a23@gmail.com> wrote:
>>> 
>>> On Wed, Jul 24, 2019 at 1:34 PM Andrii Nakryiko <andriin@fb.com> wrote:
>>>> 
>>>> This patch set implements central part of CO-RE (Compile Once - Run
>>>> Everywhere, see [0] and [1] for slides and video): relocating field offsets.
>>>> Most of the details are written down as comments to corresponding parts of the
>>>> code.
>>>> 
>>>> Patch #1 adds loading of .BTF.ext offset relocations section and macros to
>>>> work with its contents.
>>>> Patch #2 implements CO-RE relocations algorithm in libbpf.
>>>> Patches #3-#10 adds selftests validating various parts of relocation handling,
>>>> type compatibility, etc.
>>>> 
>>>> For all tests to work, you'll need latest Clang/LLVM supporting
>>>> __builtin_preserve_access_index intrinsic, used for recording offset
>>>> relocations. Kernel on which selftests run should have BTF information built
>>>> in (CONFIG_DEBUG_INFO_BTF=y).
>>>> 
>>>>  [0] http://vger.kernel.org/bpfconf2019.html#session-2
>>>>  [1] http://vger.kernel.org/lpc-bpf2018.html#session-2CO-RE relocations
>>>> 
>>>> This patch set implements central part of CO-RE (Compile Once - Run
>>>> Everywhere, see [0] and [1] for slides and video): relocating field offsets.
>>>> Most of the details are written down as comments to corresponding parts of the
>>>> code.
>>>> 
>>>> Patch #1 adds loading of .BTF.ext offset relocations section and macros to
>>>> work with its contents.
>>>> Patch #2 implements CO-RE relocations algorithm in libbpf.
>>>> Patches #3-#10 adds selftests validating various parts of relocation handling,
>>>> type compatibility, etc.
>>>> 
>>>> For all tests to work, you'll need latest Clang/LLVM supporting
>>>> __builtin_preserve_access_index intrinsic, used for recording offset
>>>> relocations. Kernel on which selftests run should have BTF information built
>>>> in (CONFIG_DEBUG_INFO_BTF=y).
>>>> 
>>>>  [0] http://vger.kernel.org/bpfconf2019.html#session-2
>>>>  [1] http://vger.kernel.org/lpc-bpf2018.html#session-2
>>>> 
>>>> Andrii Nakryiko (10):
>>>>  libbpf: add .BTF.ext offset relocation section loading
>>>>  libbpf: implement BPF CO-RE offset relocation algorithm
>>>>  selftests/bpf: add CO-RE relocs testing setup
>>>>  selftests/bpf: add CO-RE relocs struct flavors tests
>>>>  selftests/bpf: add CO-RE relocs nesting tests
>>>>  selftests/bpf: add CO-RE relocs array tests
>>>>  selftests/bpf: add CO-RE relocs enum/ptr/func_proto tests
>>>>  selftests/bpf: add CO-RE relocs modifiers/typedef tests
>>>>  selftest/bpf: add CO-RE relocs ptr-as-array tests
>>>>  selftests/bpf: add CO-RE relocs ints tests
>>>> 
>>>> tools/lib/bpf/btf.c                           |  64 +-
>>>> tools/lib/bpf/btf.h                           |   4 +
>>>> tools/lib/bpf/libbpf.c                        | 866 +++++++++++++++++-
>>>> tools/lib/bpf/libbpf.h                        |   1 +
>>>> tools/lib/bpf/libbpf_internal.h               |  91 ++
>>>> .../selftests/bpf/prog_tests/core_reloc.c     | 363 ++++++++
>>>> .../bpf/progs/btf__core_reloc_arrays.c        |   3 +
>>>> .../btf__core_reloc_arrays___diff_arr_dim.c   |   3 +
>>>> ...btf__core_reloc_arrays___diff_arr_val_sz.c |   3 +
>>>> .../btf__core_reloc_arrays___err_non_array.c  |   3 +
>>>> ...btf__core_reloc_arrays___err_too_shallow.c |   3 +
>>>> .../btf__core_reloc_arrays___err_too_small.c  |   3 +
>>>> ..._core_reloc_arrays___err_wrong_val_type1.c |   3 +
>>>> ..._core_reloc_arrays___err_wrong_val_type2.c |   3 +
>>>> .../bpf/progs/btf__core_reloc_flavors.c       |   3 +
>>>> .../btf__core_reloc_flavors__err_wrong_name.c |   3 +
>>>> .../bpf/progs/btf__core_reloc_ints.c          |   3 +
>>>> .../bpf/progs/btf__core_reloc_ints___bool.c   |   3 +
>>>> .../btf__core_reloc_ints___err_bitfield.c     |   3 +
>>>> .../btf__core_reloc_ints___err_wrong_sz_16.c  |   3 +
>>>> .../btf__core_reloc_ints___err_wrong_sz_32.c  |   3 +
>>>> .../btf__core_reloc_ints___err_wrong_sz_64.c  |   3 +
>>>> .../btf__core_reloc_ints___err_wrong_sz_8.c   |   3 +
>>>> .../btf__core_reloc_ints___reverse_sign.c     |   3 +
>>>> .../bpf/progs/btf__core_reloc_mods.c          |   3 +
>>>> .../progs/btf__core_reloc_mods___mod_swap.c   |   3 +
>>>> .../progs/btf__core_reloc_mods___typedefs.c   |   3 +
>>>> .../bpf/progs/btf__core_reloc_nesting.c       |   3 +
>>>> .../btf__core_reloc_nesting___anon_embed.c    |   3 +
>>>> ...f__core_reloc_nesting___dup_compat_types.c |   5 +
>>>> ...core_reloc_nesting___err_array_container.c |   3 +
>>>> ...tf__core_reloc_nesting___err_array_field.c |   3 +
>>>> ...e_reloc_nesting___err_dup_incompat_types.c |   4 +
>>>> ...re_reloc_nesting___err_missing_container.c |   3 +
>>>> ...__core_reloc_nesting___err_missing_field.c |   3 +
>>>> ..._reloc_nesting___err_nonstruct_container.c |   3 +
>>>> ...e_reloc_nesting___err_partial_match_dups.c |   4 +
>>>> .../btf__core_reloc_nesting___err_too_deep.c  |   3 +
>>>> .../btf__core_reloc_nesting___extra_nesting.c |   3 +
>>>> ..._core_reloc_nesting___struct_union_mixup.c |   3 +
>>>> .../bpf/progs/btf__core_reloc_primitives.c    |   3 +
>>>> ...f__core_reloc_primitives___diff_enum_def.c |   3 +
>>>> ..._core_reloc_primitives___diff_func_proto.c |   3 +
>>>> ...f__core_reloc_primitives___diff_ptr_type.c |   3 +
>>>> ...tf__core_reloc_primitives___err_non_enum.c |   3 +
>>>> ...btf__core_reloc_primitives___err_non_int.c |   3 +
>>>> ...btf__core_reloc_primitives___err_non_ptr.c |   3 +
>>>> .../bpf/progs/btf__core_reloc_ptr_as_arr.c    |   3 +
>>>> .../btf__core_reloc_ptr_as_arr___diff_sz.c    |   3 +
>>>> .../selftests/bpf/progs/core_reloc_types.h    | 642 +++++++++++++
>>>> .../bpf/progs/test_core_reloc_arrays.c        |  58 ++
>>>> .../bpf/progs/test_core_reloc_flavors.c       |  65 ++
>>>> .../bpf/progs/test_core_reloc_ints.c          |  48 +
>>>> .../bpf/progs/test_core_reloc_kernel.c        |  39 +
>>>> .../bpf/progs/test_core_reloc_mods.c          |  68 ++
>>>> .../bpf/progs/test_core_reloc_nesting.c       |  48 +
>>>> .../bpf/progs/test_core_reloc_primitives.c    |  50 +
>>>> .../bpf/progs/test_core_reloc_ptr_as_arr.c    |  34 +
>>>> 58 files changed, 2527 insertions(+), 47 deletions(-)
>>>> create mode 100644 tools/testing/selftests/bpf/prog_tests/core_reloc.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___diff_arr_dim.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___diff_arr_val_sz.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_non_array.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_too_shallow.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_too_small.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_wrong_val_type1.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_wrong_val_type2.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_flavors.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_flavors__err_wrong_name.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___bool.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_bitfield.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_16.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_32.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_64.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_8.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___reverse_sign.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___anon_embed.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___dup_compat_types.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_array_container.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_array_field.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_dup_incompat_types.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_missing_container.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_missing_field.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_nonstruct_container.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_partial_match_dups.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_too_deep.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___extra_nesting.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___struct_union_mixup.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_enum_def.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_func_proto.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_ptr_type.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_enum.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_int.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_ptr.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ptr_as_arr.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ptr_as_arr___diff_sz.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/core_reloc_types.h
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_flavors.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_ints.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_mods.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_nesting.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_primitives.c
>>>> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_ptr_as_arr.c
>>> 
>>> We have created a lot of small files. Would it be cleaner if we can
>>> somehow put these
>>> data in one file (maybe different sections?).
>> 
>> After reading more, I guess you have tried this and end up with current
>> design: keep most struct defines in core_reloc_types.h.
> 
> Yeah, I have all the definition in one header file, but then I need
> individual combinations as separate BTFs, so I essentially "pick"
> desired types using function declarations. Creating those BTFs by hand
> would be a nightmare to create and maintain.
> 
>> 
>>> 
>>> Alternatively, maybe create a folder for these files:
>>>  tools/testing/selftests/bpf/progs/core/
>> 
>> I guess this would still make it cleaner.
> 
> There is nothing too special about core tests to split them. Also it
> would require Makefile changes and would deviate test_progs
> definitions from analogous test_maps, test_verifier, test_btf, etc, so
> I'm not sure about that. I though about putting those btf__* files
> under separate directory, but I'm on the fence there as well, as I'd
> rather have related files to stay together...
> 

I guess we can defer this decision to later. So I am OK with current
approach for now. 

Thanks,
Song


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-07-30  5:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>

Hi Andy, 

> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Andy, 
> 
>>>>> 
>>>> 
>>>> Well, yes. sys_bpf() is pretty powerful. 
>>>> 
>>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In 
>>>> the meanwhile, such users should not take down the whole system easily
>>>> by accident, e.g., with rm -rf /.
>>> 
>>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
>> 
>> This is a great idea! fscaps + /etc/bpfusers should do the trick. 
> 
> After some discussions and more thinking on this, I have some concerns 
> with the user space only approach.  
> 
> IIUC, your proposal for user space only approach is like: 
> 
> 1. bpftool (and other tools) check /etc/bpfusers and only do 
>   setuid for allowed users:
> 
> 	int main()
> 	{
> 		if (/* uid in /etc/bpfusers */)
> 			setuid(0);
> 		sys_bpf(...);
> 	}
> 
> 2. bpftool (and other tools) is installed with CAP_SETUID:
> 
> 	setcap cap_setuid=e+p /bin/bpftool
> 
> 3. sys admin maintains proper /etc/bpfusers. 
> 
> This approach is not ideal, because we need to trust the tool to give 
> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> or use other root only sys calls after setuid(0). 
> 

I would like more comments on this. 

Currently, bpf permission is more or less "root or nothing", which we 
would like to change. 

The short term goal is to separate bpf from root, in other words, it is 
"all or nothing". Special user space utilities, such as systemd, would
benefit from this. Once this is implemented, systemd can call sys_bpf() 
when it is not running as root. 

In longer term, it may be useful to provide finer grain permission of 
sys_bpf(). For example, sys_bpf() should be aware of containers; and
user may only have access to certain bpf maps. Let's call this 
"fine grain" capability. 

Since we are seeing new use cases every year, we will need many 
iterations to implement the fine grain permission. I think we need an 
API that is flexible enough to cover different types of permission 
control. 

For example, bpf_with_cap() can be flexible:

	bpf_with_cap(cmd, attr, size, perm_fd);

We can get different types of permission via different combinations of 
arguments:

    A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so 
    this is "all or nothing" permission. 

    A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some 
    commands to this specific cgroup. 

Alexei raised another idea in offline discussions: instead of adding
bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
permission for the _next_ sys_bpf() from current task:

    bpf(LOAD_PERM_FD, perm_fd);
    /* the next sys_bpf() uses permission from perm_fd */
    bpf(cmd, attr, size);

This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but 
doesn't require the new sys call. 

For both these ideas, we will start with /dev/bpf. As we grow the
fine grain permission control, fewer users/processes will need access 
to /dev/bpf. 

Please let us know your thought on this. Would this make /dev/bpf 
more reasonable? :-)

A few notes for previous discussions:

1. User space only approach doesn't work, even for "all or nothing" 
   permission control. I expanded the discussion in the previous
   email. Please let me know if I missed anything there. 

2. Permission control only at BPF_PROG_ATTACH time is not sufficient.
   We need permission control during BPF_PROG_LOAD, e.g., is_priv in 
   the verifier. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Tao Ren @ 2019-07-30  5:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrew Jeffery, openbmc@lists.ozlabs.org
In-Reply-To: <20190730033558.GB20628@lunn.ch>

On 7/29/19 8:35 PM, Andrew Lunn wrote:
> On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
>> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
>> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
>> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
>> mode (different sets of MII Control/Status registers being used), let's
>> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.
> 
> Hi Tao
> 
> What exactly does it get wrong?
> 
>      Thanks
> 	Andrew

Hi Andrew,

BCM54616S is set to RGMII-Fiber (1000Base-X) mode on my platform, and none of the features (1000BaseT/100BaseT/10BaseT) can be detected by genphy_read_abilities(), because the PHY only reports 1000BaseX_Full|Half ability in this mode.


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-30  4:52 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <CA+h21hq1+E6-ScFx425hXwTPTZHTVZbBuAm7RROFZTBOFvD8vQ@mail.gmail.com>

On 7/29/19 6:32 PM, Vladimir Oltean wrote:
> Hi Tao,
> 
> On Tue, 30 Jul 2019 at 03:31, Tao Ren <taoren@fb.com> wrote:
>>
>> Configure the BCM54616S for 1000Base-X mode when "brcm-phy-mode-1000bx"
>> is set in device tree. This is needed when the PHY is used for fiber and
>> backplane connections.
>>
>> The patch is inspired by commit cd9af3dac6d1 ("PHYLIB: Add 1000Base-X
>> support for Broadcom bcm5482").
> 
> As far as I can see, for the commit you referenced,
> PHY_BCM_FLAGS_MODE_1000BX is referenced from nowhere in the entire
> mainline kernel:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_latest_ident_PHY-5FBCM-5FFLAGS-5FMODE-5F1000BX&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=iYElT7HC77pRZ3byVvW8ng&m=gy6Y-3Ylme-_GQcGF4fvOX10irgAT4xh253Weo0np38&s=KL__E2bvsmvUL-hBL9hUmOS5vyPQ92EMj6fEfByn8t8&e= 
> (it is supposed to be put by the MAC driver in phydev->dev_flags prior
> to calling phy_connect). But I don't see the point to this - can't you
> check for phydev->interface == PHY_INTERFACE_MODE_1000BASEX?
> This has the advantage that no MAC driver will need to know that it's
> talking to a Broadcom PHY. Additionally, no custom DT bindings are
> needed.
> Also, for backplane connections you probably want 1000Base-KX which
> has its own AN/LT, not plain 1000Base-X.

Thank you Vladimir for the quick review!
Perhaps I misunderstood the purpose of phydev->interface, and I thought it was usually used to defined the interface between MAC and PHY. For example, if I need to pass both "rgmii-id" and "1000base-x" from MAC to PHY driver, what would be the preferred way?

>> Signed-off-by: Tao Ren <taoren@fb.com>
>> ---
>>  drivers/net/phy/broadcom.c | 58 +++++++++++++++++++++++++++++++++++---
>>  include/linux/brcmphy.h    |  4 +--
>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
>> index 2b4e41a9d35a..6c22ac3a844b 100644
>> --- a/drivers/net/phy/broadcom.c
>> +++ b/drivers/net/phy/broadcom.c
>> @@ -383,9 +383,9 @@ static int bcm5482_config_init(struct phy_device *phydev)
>>                 /*
>>                  * Select 1000BASE-X register set (primary SerDes)
>>                  */
>> -               reg = bcm_phy_read_shadow(phydev, BCM5482_SHD_MODE);
>> -               bcm_phy_write_shadow(phydev, BCM5482_SHD_MODE,
>> -                                    reg | BCM5482_SHD_MODE_1000BX);
>> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>>
>>                 /*
>>                  * LED1=ACTIVITYLED, LED3=LINKSPD[2]
>> @@ -451,6 +451,34 @@ static int bcm5481_config_aneg(struct phy_device *phydev)
>>         return ret;
>>  }
>>
>> +static int bcm54616s_config_init(struct phy_device *phydev)
>> +{
>> +       int err, reg;
>> +       struct device_node *np = phydev->mdio.dev.of_node;
>> +
>> +       err = bcm54xx_config_init(phydev);
>> +
>> +       if (of_property_read_bool(np, "brcm-phy-mode-1000bx")) {
>> +               /* Select 1000BASE-X register set. */
>> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>> +
>> +               /* Auto-negotiation doesn't seem to work quite right
>> +                * in this mode, so we disable it and force it to the
>> +                * right speed/duplex setting.  Only 'link status'
>> +                * is important.
>> +                */
>> +               phydev->autoneg = AUTONEG_DISABLE;
>> +               phydev->speed = SPEED_1000;
>> +               phydev->duplex = DUPLEX_FULL;
>> +
> 
> 1000Base-X AN does not include speed negotiation, so hardcoding
> SPEED_1000 is probably correct.
> What is wrong with the AN of duplex settings?

FULL_DUPLEX bit is set on my platform by default. Let me enable AN and test it out; will share you results tomorrow.

>> +               phydev->dev_flags |= PHY_BCM_FLAGS_MODE_1000BX;
>> +       }
>> +
>> +       return err;
>> +}
>> +
>>  static int bcm54616s_config_aneg(struct phy_device *phydev)
>>  {
>>         int ret;
>> @@ -464,6 +492,27 @@ static int bcm54616s_config_aneg(struct phy_device *phydev)
>>         return ret;
>>  }
>>
>> +static int bcm54616s_read_status(struct phy_device *phydev)
>> +{
>> +       int ret;
>> +
>> +       ret = genphy_read_status(phydev);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       if (phydev->dev_flags & PHY_BCM_FLAGS_MODE_1000BX) {
>> +               /* Only link status matters for 1000Base-X mode, so force
>> +                * 1000 Mbit/s full-duplex status.
>> +                */
>> +               if (phydev->link) {
>> +                       phydev->speed = SPEED_1000;
>> +                       phydev->duplex = DUPLEX_FULL;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  static int brcm_phy_setbits(struct phy_device *phydev, int reg, int set)
>>  {
>>         int val;
>> @@ -651,8 +700,9 @@ static struct phy_driver broadcom_drivers[] = {
>>         .phy_id_mask    = 0xfffffff0,
>>         .name           = "Broadcom BCM54616S",
>>         .features       = PHY_GBIT_FEATURES,
>> -       .config_init    = bcm54xx_config_init,
>> +       .config_init    = bcm54616s_config_init,
>>         .config_aneg    = bcm54616s_config_aneg,
>> +       .read_status    = bcm54616s_read_status,
>>         .ack_interrupt  = bcm_phy_ack_intr,
>>         .config_intr    = bcm_phy_config_intr,
>>  }, {
>> diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
>> index 6db2d9a6e503..82030155558c 100644
>> --- a/include/linux/brcmphy.h
>> +++ b/include/linux/brcmphy.h
>> @@ -200,8 +200,8 @@
>>  #define BCM5482_SHD_SSD                0x14    /* 10100: Secondary SerDes control */
>>  #define BCM5482_SHD_SSD_LEDM   0x0008  /* SSD LED Mode enable */
>>  #define BCM5482_SHD_SSD_EN     0x0001  /* SSD enable */
>> -#define BCM5482_SHD_MODE       0x1f    /* 11111: Mode Control Register */
>> -#define BCM5482_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
>> +#define BCM54XX_SHD_MODE       0x1f    /* 11111: Mode Control Register */
>> +#define BCM54XX_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
> 
> These registers are also present on my BCM5464, probably safe to
> assume they're generic for the entire family.
> So if you make the registers definitions common, you can probably make
> the 1000Base-X configuration common as well.

If I understand correctly, your recommendation is to add a common function (such as "bcm54xx_config_1000bx") so it can be used by other BCM chips? Sure, I will take care of it.


Thanks,

Tao

^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-30  4:03 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <a0b26e4b-e288-cf44-049a-7d0b7f5696eb@gmail.com>



On 2019/7/30 4:57, Heiner Kallweit wrote:
> On 29.07.2019 05:59, liuyonglong wrote:
>>
>>
>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>> copper link status should read twice, or it may get a fake link
>>>> up status, and cause up->down->up at the first time when link up.
>>>> This happens more oftem at Realtek phy.
>>>>
>>> This is not correct, there is no fake link up status.
>>> Read the comment in genphy_update_link, only link-down events
>>> are latched. Means if the first read returns link up, then there
>>> is no need for a second read. And in polling mode we don't do a
>>> second read because we want to detect also short link drops.
>>>
>>> It would be helpful if you could describe your actual problem
>>> and whether you use polling or interrupt mode.
>>>
>>
>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
> 
> This should not happen. The PHY indicates link up w/o having aneg finished.
> 
>>
>> According to the datasheet:
>> reg 1.5=0 now, means copper auto-negotiation not complete
>> reg 1.2=1 now, means link is up
>>
>> We can see that, when we read the link up, the auto-negotiation
>> is not complete yet, so the speed is invalid.
>>
>> I don't know why this happen, maybe this state is keep from bios?
>> Or we may do something else in the phy initialize to fix it?
>> And also confuse that why read twice can fix it?
>>
> I suppose that basically any delay would do.
> 
>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>
>>>> I add a fake status read, and can solve this problem.
>>>>
>>>> I also see that in genphy_update_link(), had delete the fake
>>>> read in polling mode, so I don't know whether my solution is
>>>> correct.
>>>>
> 
> Can you test whether the following fixes the issue for you?
> Also it would be interesting which exact PHY models you tested
> and whether you built the respective PHY drivers or whether you
> rely on the genphy driver. Best use the second patch to get the
> needed info. It may make sense anyway to add the call to
> phy_attached_info() to the hns3 driver.
> 

[   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
[   40.932133] hns3 0000:bd:00.3 eth7: net open
[   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
[   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
[   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.

[   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
[   41.563511] hns3 0000:bd:00.2 eth6: net open
[   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
[   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
[   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.

I am using RTL8211F, but you can see that, both genphy driver and
RTL8211F driver have the same issue.

> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 6b5cb87f3..fbecfe210 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>  
>  	linkmode_zero(phydev->lp_advertising);
>  
> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
> +	if (phydev->autoneg == AUTONEG_ENABLE &&
> +	    (phydev->autoneg_complete || phydev->link)) {
>  		if (phydev->is_gigabit_capable) {
>  			lpagb = phy_read(phydev, MII_STAT1000);
>  			if (lpagb < 0)
> 

I have try this patch, have no effect. I suppose that at this time,
the autoneg actually not complete yet.

Maybe the wrong phy state is passed from bios? Invalid speed just
happen at the first time when ethX up, after that, repeat the
ifconfig down/ifconfig up command can not see that again.

So I think the bios should power off the phy(writing reg 1.11 to 1)
before it starts the OS? Or any other way to fix this in the OS?




^ permalink raw reply

* Re: [PATCH v4 13/13] ARM: dts: sunxi: Switch from phy to phy-handle
From: Chen-Yu Tsai @ 2019-07-30  3:51 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Mark Rutland, Rob Herring, Frank Rowand, David S . Miller,
	Maxime Coquelin, Alexandre Torgue, netdev, linux-arm-kernel,
	devicetree, linux-stm32, Maxime Chevallier, Antoine Ténart,
	Andrew Lunn, Florian Fainelli, Heiner Kallweit
In-Reply-To: <a1a33392c64c71099021fb49cc811a30790d40a8.1561649505.git-series.maxime.ripard@bootlin.com>

On Thu, Jun 27, 2019 at 11:32 PM Maxime Ripard
<maxime.ripard@bootlin.com> wrote:
>
> The phy device tree property has been deprecated in favor of phy-handle,
> let's replace it.
>
> Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>

This patch breaks Ethernet on all my dwmac-sunxi, i.e. old GMAC, boards, with
the following error messages:

    sun7i-dwmac 1c50000.ethernet eth0: no phy at addr -1
    sun7i-dwmac 1c50000.ethernet eth0: stmmac_open: Cannot attach to
PHY (error: -19)

Reverting this patch fixes it.

It also breaks the A10/A10s, but that's probably because the sun4i-emac
driver doesn't recognize the "phy-handle" property.

ChenYu

^ permalink raw reply

* Re: [PATCH] net: phy: phy_led_triggers: Fix a possible null-pointer dereference in phy_led_trigger_change_speed()
From: David Miller @ 2019-07-30  3:41 UTC (permalink / raw)
  To: andrew; +Cc: baijiaju1990, f.fainelli, hkallweit1, netdev, linux-kernel
In-Reply-To: <20190730033229.GA20628@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Tue, 30 Jul 2019 05:32:29 +0200

> On Tue, Jul 30, 2019 at 10:25:36AM +0800, Jia-Ju Bai wrote:
>> 
>> 
>> On 2019/7/29 21:45, Andrew Lunn wrote:
>> >On Mon, Jul 29, 2019 at 05:24:24PM +0800, Jia-Ju Bai wrote:
>> >>In phy_led_trigger_change_speed(), there is an if statement on line 48
>> >>to check whether phy->last_triggered is NULL:
>> >>     if (!phy->last_triggered)
>> >>
>> >>When phy->last_triggered is NULL, it is used on line 52:
>> >>     led_trigger_event(&phy->last_triggered->trigger, LED_OFF);
>> >>
>> >>Thus, a possible null-pointer dereference may occur.
>> >>
>> >>To fix this bug, led_trigger_event(&phy->last_triggered->trigger,
>> >>LED_OFF) is called when phy->last_triggered is not NULL.
>> >>
>> >>This bug is found by a static analysis tool STCheck written by us.
>> >Who is 'us'?
>> 
>> Me and my colleague...
> 
> Well, we can leave it very vague, giving no idea who 'us' is. But
> often you want to name the company behind it, or the university, or
> the sponsor, etc.

I agree, if you are going to mention that there is a tool you should be
clear exactly who and what organization are behind it.

Thank you.

^ permalink raw reply

* RE: [PATCH net-next v3 2/2] qed: Add driver API for flashing the config attributes.
From: Sudarsana Reddy Kalluru @ 2019-07-30  3:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Michal Kalderon, Ariel Elior
In-Reply-To: <20190729.110342.703558396264560468.davem@davemloft.net>

> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Monday, July 29, 2019 11:34 PM
> To: Sudarsana Reddy Kalluru <skalluru@marvell.com>
> Cc: netdev@vger.kernel.org; Michal Kalderon <mkalderon@marvell.com>;
> Ariel Elior <aelior@marvell.com>
> Subject: Re: [PATCH net-next v3 2/2] qed: Add driver API for flashing the
> config attributes.
> 
> From: Sudarsana Reddy Kalluru <skalluru@marvell.com>
> Date: Sat, 27 Jul 2019 18:55:49 -0700
> 
> > @@ -2268,6 +2330,9 @@ static int qed_nvm_flash(struct qed_dev *cdev,
> const char *name)
> >  			rc = qed_nvm_flash_image_access(cdev, &data,
> >  							&check_resp);
> >  			break;
> > +		case QED_NVM_FLASH_CMD_NVM_CFG_ID:
> > +			rc = qed_nvm_flash_cfg_write(cdev, &data);
> > +			break;

> >  		default:
> >  			DP_ERR(cdev, "Unknown command %08x\n",
> cmd_type);
> 
> I don't see how any existing portable interface can cause this new code to
> actually be used.
> 
> You have to explain this to me.
The API qed_nvm_flash() is used to flash the user provided data (e.g., Management FW) to the required partitions of the adapter. 
   - Format of the input file would be - file signature info, followed by one or more data sets.
   - Each data set is represented with the header followed by its contents. Header captures info such as command name (e.g., FILE_START), data size etc., which specifies how to handle the data.
The API qed_nvm_flash() validates the user provided input file, parses the data sets and handles each accordingly. Here one of the data sets (preferably the last one) could be nvm-attributes page (with cmd-id = QED_NVM_FLASH_CMD_NVM_CHANGE).  

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Andrew Lunn @ 2019-07-30  3:35 UTC (permalink / raw)
  To: Tao Ren
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean, netdev,
	linux-kernel, Andrew Jeffery, openbmc
In-Reply-To: <20190730002532.85509-1-taoren@fb.com>

On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
> mode (different sets of MII Control/Status registers being used), let's
> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.

Hi Tao

What exactly does it get wrong?

     Thanks
	Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox