Netdev List
 help / color / mirror / Atom feed
* Re: [patch iproute2 1/2] devlink: introduce cmdline option to switch to a different namespace
From: Jiri Pirko @ 2019-07-30  6:07 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, netdev, davem, jakub.kicinski,
	sthemmin, mlxsw
In-Reply-To: <d590ac7b-9dc7-41e2-e411-79ac4654c709@gmail.com>

Mon, Jul 29, 2019 at 10:21:11PM CEST, dsahern@gmail.com wrote:
>On 7/27/19 4:21 AM, Jiri Pirko wrote:
>>>> diff --git a/devlink/devlink.c b/devlink/devlink.c
>>>> index d8197ea3a478..9242cc05ad0c 100644
>>>> --- a/devlink/devlink.c
>>>> +++ b/devlink/devlink.c
>>>> @@ -32,6 +32,7 @@
>>>>  #include "mnlg.h"
>>>>  #include "json_writer.h"
>>>>  #include "utils.h"
>>>> +#include "namespace.h"
>>>>  
>>>>  #define ESWITCH_MODE_LEGACY "legacy"
>>>>  #define ESWITCH_MODE_SWITCHDEV "switchdev"
>>>> @@ -6332,7 +6333,7 @@ static int cmd_health(struct dl *dl)
>>>>  static void help(void)
>>>>  {
>>>>  	pr_err("Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }\n"
>>>> -	       "       devlink [ -f[orce] ] -b[atch] filename\n"
>>>> +	       "       devlink [ -f[orce] ] -b[atch] filename -N[etns]
>>>>  netnsname\n"
>>>
>>> 'ip' uses lower-case n for this; why not be consistent?
>> 
>> Because "n" is taken :/
>
>that's really unfortunate. commands within a package should have similar
>syntax and -n/N are backwards between ip/tc and devlink. That's the
>stuff that drives users crazy.

I agree. But this particular ship has sailed :(

^ permalink raw reply

* Re: [patch net-next 0/3] net: devlink: Finish network namespace support
From: Jiri Pirko @ 2019-07-30  6:08 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, davem, jakub.kicinski, sthemmin, mlxsw
In-Reply-To: <eebd6bc7-6466-2c93-4077-72a39f3b8596@gmail.com>

Mon, Jul 29, 2019 at 10:17:25PM CEST, dsahern@gmail.com wrote:
>On 7/27/19 3:44 AM, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Devlink from the beginning counts with network namespaces, but the
>> instances has been fixed to init_net. The first patch allows user
>> to move existing devlink instances into namespaces:
>> 
>
>so you intend for an asic, for example, to have multiple devlink
>instances where each instance governs a set of related ports (e.g.,
>ports that share a set of hardware resources) and those instances can be
>managed from distinct network namespaces?

No, no multiple devlink instances for asic intended.

>

^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: Heiner Kallweit @ 2019-07-30  6:08 UTC (permalink / raw)
  To: liuyonglong, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <1d4be6ad-ffe6-2325-ceab-9f35da617ee9@huawei.com>

On 30.07.2019 06:03, liuyonglong wrote:
> 
> 
> On 2019/7/30 4:57, Heiner Kallweit wrote:
>> On 29.07.2019 05:59, liuyonglong wrote:
>>>
>>>
>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>> copper link status should read twice, or it may get a fake link
>>>>> up status, and cause up->down->up at the first time when link up.
>>>>> This happens more oftem at Realtek phy.
>>>>>
>>>> This is not correct, there is no fake link up status.
>>>> Read the comment in genphy_update_link, only link-down events
>>>> are latched. Means if the first read returns link up, then there
>>>> is no need for a second read. And in polling mode we don't do a
>>>> second read because we want to detect also short link drops.
>>>>
>>>> It would be helpful if you could describe your actual problem
>>>> and whether you use polling or interrupt mode.
>>>>
>>>
>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>
>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>
>>>
>>> According to the datasheet:
>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>> reg 1.2=1 now, means link is up
>>>
>>> We can see that, when we read the link up, the auto-negotiation
>>> is not complete yet, so the speed is invalid.
>>>
>>> I don't know why this happen, maybe this state is keep from bios?
>>> Or we may do something else in the phy initialize to fix it?
>>> And also confuse that why read twice can fix it?
>>>
>> I suppose that basically any delay would do.
>>
>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>
>>>>> I add a fake status read, and can solve this problem.
>>>>>
>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>> read in polling mode, so I don't know whether my solution is
>>>>> correct.
>>>>>
>>
>> Can you test whether the following fixes the issue for you?
>> Also it would be interesting which exact PHY models you tested
>> and whether you built the respective PHY drivers or whether you
>> rely on the genphy driver. Best use the second patch to get the
>> needed info. It may make sense anyway to add the call to
>> phy_attached_info() to the hns3 driver.
>>
> 
> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
> [   40.932133] hns3 0000:bd:00.3 eth7: net open
> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
> 
> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
> [   41.563511] hns3 0000:bd:00.2 eth6: net open
> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
> 
> I am using RTL8211F, but you can see that, both genphy driver and
> RTL8211F driver have the same issue.
> 
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 6b5cb87f3..fbecfe210 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>  
>>  	linkmode_zero(phydev->lp_advertising);
>>  
>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>> +	    (phydev->autoneg_complete || phydev->link)) {
>>  		if (phydev->is_gigabit_capable) {
>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>  			if (lpagb < 0)
>>
> 
> I have try this patch, have no effect. I suppose that at this time,
> the autoneg actually not complete yet.
> 
> Maybe the wrong phy state is passed from bios? Invalid speed just
> happen at the first time when ethX up, after that, repeat the
> ifconfig down/ifconfig up command can not see that again.
> 
> So I think the bios should power off the phy(writing reg 1.11 to 1)
> before it starts the OS? Or any other way to fix this in the OS?
> 
To get a better idea of whats going on it would be good to see a full
MDIO trace. Can you enable MDIO tracing via the following sysctl file
/sys/kernel/debug/tracing/events/mdio/enable
and provide the generated trace?

Due to polling mode each second entries will be generated, so you
better stop network after the issue occurred.

> 
> 
> 

Heiner

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-30  6:27 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190729175136.GA28572@splinter>

The 07/29/2019 20:51, Ido Schimmel wrote:
> Can you please clarify what you're trying to achieve? I just read the
> thread again and my impression is that you're trying to locally receive
> packets with a certain link layer multicast address.
Yes. The thread is also a bit confusing because we half way through realized
that we misunderstood how the multicast packets should be handled (sorry about
that). To begin with we had a driver where multicast packets was only copied to
the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
how other drivers are doing it, so we changed the driver to include the CPU in
the default multicast flood-mask.

This changes the objective a bit. To begin with we needed to get more packets to
the CPU (which could have been done using tc ingress rules and a trap action).

Now after we changed the driver, we realized that we need something to limit the
flooding of certain L2 multicast packets. This is the new problem we are trying
to solve!

Example: Say we have a bridge with 4 slave interfaces, then we want to install a
forwarding rule saying that packets to a given L2-multicast MAC address, should
only be flooded to 2 of the 4 ports.

(instead of adding rules to get certain packets to the CPU, we are now adding
other rules to prevent other packets from going to the CPU and other ports where
they are not needed/wanted).

This is exactly the same thing as IGMP snooping does dynamically, but only for
IP multicast.

The "bridge mdb" allow users to manually/static add/del a port to a multicast
group, but still it operates on IP multicast address (not L2 multicast
addresses).

> Nik suggested SIOCADDMULTI.
It is not clear to me how this should be used to limit the flooding, maybe we
can make some hacks, but as far as I understand the intend of this is maintain
the list of addresses an interface should receive. I'm not sure this should
influence how for forwarding decisions are being made.

> and I suggested a tc filter to get the packet to the CPU.
The TC solution is a good solution to the original problem where wanted to copy
more frames to the CPU. But we were convinced that this is not the right
approach, and that the CPU by default should receive all multicast packets, and
we should instead try to find a way to limit the flooding of certain frames as
an optimization.

> If you now want to limit the ports to which this packet is flooded, then
> you can use tc filters in *software*:
> 
> # tc qdisc add dev eth2 clsact
> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> 	dst_mac 01:21:6C:00:00:01 action drop
Yes. This can work in the SW bridge.

> If you want to forward the packet in hardware and locally receive it,
> you can chain several mirred action and then a trap action.
I'm not I fully understand how this should be done, but it does sound like it
becomes quite complicated. Also, as far as I understand it will mean that we
will be using TCAM/ACL resources to do something that could have been done with
a simple MAC entry.

> Both options avoid HW egress ACLs which your design does not support.
True, but what is wrong with expanding the functionality of the normal
forwarding/MAC operations to allow multiple destinations?

It is not an uncommon feature (I just browsed the manual of some common L2
switches and they all has this feature).

It seems to fit nicely into the existing user-interface:

bridge fdb add    01:21:6C:00:00:01 port eth0
bridge fdb append 01:21:6C:00:00:01 port eth1

It seems that it can be added to the existing implementation with out adding
significant complexity.

It will be easy to offload in HW.

I do not believe that it will be a performance issue, if this is a concern then
we may have to do a bit of benchmarking, or we can make it a configuration
option.

Long story short, we (Horatiu and I) learned a lot from the discussion here, and
I think we should try do a new patch with the learning we got. Then it is easier
to see what it actually means to the exiting code, complexity, exiting drivers,
performance, default behavioral, backwards compatibly, and other valid concerns.

If the patch is no good, and cannot be fixed, then we will go back and look
further into alternative solutions.

-- 
/Allan



^ permalink raw reply

* Re: [PATCH net-next] can: fix ioctl function removal
From: Marc Kleine-Budde @ 2019-07-30  6:30 UTC (permalink / raw)
  To: Oliver Hartkopp, davem, netdev; +Cc: linux-can, kernel test robot
In-Reply-To: <20190729204056.2976-1-socketcan@hartkopp.net>


[-- Attachment #1.1: Type: text/plain, Size: 1183 bytes --]

On 7/29/19 10:40 PM, Oliver Hartkopp wrote:
> Commit 60649d4e0af ("can: remove obsolete empty ioctl() handler") replaced the
> almost empty can_ioctl() function with sock_no_ioctl() which always returns
> -EOPNOTSUPP.
> 
> Even though we don't have any ioctl() functions on socket/network layer we need
> to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
> to the network driver layer.
> 
> This patch fixes the wrong return codes in the CAN network layer protocols.
> 
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Fixes: 60649d4e0af ("can: remove obsolete empty ioctl() handler")
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>

David Miller has already applied the patch.

| commit 473d924d7d46cb57aa4c1863261d18366af345af
| Author: Oliver Hartkopp <socketcan@hartkopp.net>
| Date:   Mon Jul 29 22:40:56 2019 +0200

Thanks,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-30  6:35 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <5087ee34-5776-f02b-d7e5-bce005ba3b92@gmail.com>

:/sys/kernel/debug/tracing$ cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 45/45   #P:128
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
    kworker/64:2-1028  [064] ....   172.295687: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.295726: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.296902: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.296938: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.321213: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.343209: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.343245: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.343882: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.343918: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.362658: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.385961: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.385996: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.386646: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.386681: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.411286: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x00 val:0x1040
    kworker/64:2-1028  [064] ....   172.433225: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x02 val:0x001c
    kworker/64:2-1028  [064] ....   172.433260: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x03 val:0xc916
    kworker/64:2-1028  [064] ....   172.433887: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
    kworker/64:2-1028  [064] ....   172.433922: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0f val:0x2000
    kworker/64:2-1028  [064] ....   172.452862: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
        ifconfig-1324  [011] ....   177.325585: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
  kworker/u257:0-8     [012] ....   177.325642: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x04 val:0x01e1
  kworker/u257:0-8     [012] ....   177.325654: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x04 val:0x05e1
  kworker/u257:0-8     [012] ....   177.325708: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [012] ....   177.325744: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [012] ....   177.325779: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
  kworker/u257:0-8     [012] ....   177.325788: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1240
  kworker/u257:0-8     [012] ....   177.325843: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x798d
  kworker/u257:0-8     [003] ....   178.360488: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   179.384479: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   180.408477: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
  kworker/u257:0-8     [000] ....   181.432474: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79a9
  kworker/u257:0-8     [000] ....   181.432510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x7800
  kworker/u257:0-8     [000] ....   181.432546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [000] ....   181.432582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
  kworker/u257:0-8     [000] ....   182.456510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   182.456546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x4800
  kworker/u257:0-8     [000] ....   182.456582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
  kworker/u257:0-8     [000] ....   182.456618: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
  kworker/u257:0-8     [001] ....   183.480476: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   184.504478: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   185.528486: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
  kworker/u257:0-8     [000] ....   186.552475: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
        ifconfig-1327  [011] ....   187.196036: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
        ifconfig-1327  [011] ....   187.196046: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1840



On 2019/7/30 14:08, Heiner Kallweit wrote:
> On 30.07.2019 06:03, liuyonglong wrote:
>>
>>
>> On 2019/7/30 4:57, Heiner Kallweit wrote:
>>> On 29.07.2019 05:59, liuyonglong wrote:
>>>>
>>>>
>>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>>> copper link status should read twice, or it may get a fake link
>>>>>> up status, and cause up->down->up at the first time when link up.
>>>>>> This happens more oftem at Realtek phy.
>>>>>>
>>>>> This is not correct, there is no fake link up status.
>>>>> Read the comment in genphy_update_link, only link-down events
>>>>> are latched. Means if the first read returns link up, then there
>>>>> is no need for a second read. And in polling mode we don't do a
>>>>> second read because we want to detect also short link drops.
>>>>>
>>>>> It would be helpful if you could describe your actual problem
>>>>> and whether you use polling or interrupt mode.
>>>>>
>>>>
>>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>>
>>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>>
>>>>
>>>> According to the datasheet:
>>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>>> reg 1.2=1 now, means link is up
>>>>
>>>> We can see that, when we read the link up, the auto-negotiation
>>>> is not complete yet, so the speed is invalid.
>>>>
>>>> I don't know why this happen, maybe this state is keep from bios?
>>>> Or we may do something else in the phy initialize to fix it?
>>>> And also confuse that why read twice can fix it?
>>>>
>>> I suppose that basically any delay would do.
>>>
>>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>
>>>>>> I add a fake status read, and can solve this problem.
>>>>>>
>>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>>> read in polling mode, so I don't know whether my solution is
>>>>>> correct.
>>>>>>
>>>
>>> Can you test whether the following fixes the issue for you?
>>> Also it would be interesting which exact PHY models you tested
>>> and whether you built the respective PHY drivers or whether you
>>> rely on the genphy driver. Best use the second patch to get the
>>> needed info. It may make sense anyway to add the call to
>>> phy_attached_info() to the hns3 driver.
>>>
>>
>> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
>> [   40.932133] hns3 0000:bd:00.3 eth7: net open
>> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
>> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
>> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>
>> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
>> [   41.563511] hns3 0000:bd:00.2 eth6: net open
>> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
>> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
>> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
>> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
>>
>> I am using RTL8211F, but you can see that, both genphy driver and
>> RTL8211F driver have the same issue.
>>
>>>
>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>> index 6b5cb87f3..fbecfe210 100644
>>> --- a/drivers/net/phy/phy_device.c
>>> +++ b/drivers/net/phy/phy_device.c
>>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>>  
>>>  	linkmode_zero(phydev->lp_advertising);
>>>  
>>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>>> +	    (phydev->autoneg_complete || phydev->link)) {
>>>  		if (phydev->is_gigabit_capable) {
>>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>>  			if (lpagb < 0)
>>>
>>
>> I have try this patch, have no effect. I suppose that at this time,
>> the autoneg actually not complete yet.
>>
>> Maybe the wrong phy state is passed from bios? Invalid speed just
>> happen at the first time when ethX up, after that, repeat the
>> ifconfig down/ifconfig up command can not see that again.
>>
>> So I think the bios should power off the phy(writing reg 1.11 to 1)
>> before it starts the OS? Or any other way to fix this in the OS?
>>
> To get a better idea of whats going on it would be good to see a full
> MDIO trace. Can you enable MDIO tracing via the following sysctl file
> /sys/kernel/debug/tracing/events/mdio/enable
> and provide the generated trace?
> 
> Due to polling mode each second entries will be generated, so you
> better stop network after the issue occurred.
> 
>>
>>
>>
> 
> Heiner
> 
> .
> 


^ permalink raw reply

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-30  6:39 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <03708d00-a8d9-4a9d-4188-9fe0e38de2b8@huawei.com>



On 2019/7/30 14:35, liuyonglong wrote:
> :/sys/kernel/debug/tracing$ cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 45/45   #P:128
> #
> #                              _-----=> irqs-off
> #                             / _----=> need-resched
> #                            | / _---=> hardirq/softirq
> #                            || / _--=> preempt-depth
> #                            ||| /     delay
> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> #              | |       |   ||||       |         |
>     kworker/64:2-1028  [064] ....   172.295687: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.295726: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.296902: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.296938: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.321213: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.343209: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.343245: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.343882: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.343918: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.362658: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.385961: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.385996: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.386646: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.386681: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.411286: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x00 val:0x1040
>     kworker/64:2-1028  [064] ....   172.433225: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x02 val:0x001c
>     kworker/64:2-1028  [064] ....   172.433260: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x03 val:0xc916
>     kworker/64:2-1028  [064] ....   172.433887: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>     kworker/64:2-1028  [064] ....   172.433922: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0f val:0x2000
>     kworker/64:2-1028  [064] ....   172.452862: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>         ifconfig-1324  [011] ....   177.325585: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>   kworker/u257:0-8     [012] ....   177.325642: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x04 val:0x01e1
>   kworker/u257:0-8     [012] ....   177.325654: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x04 val:0x05e1
>   kworker/u257:0-8     [012] ....   177.325708: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [012] ....   177.325744: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [012] ....   177.325779: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>   kworker/u257:0-8     [012] ....   177.325788: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1240
>   kworker/u257:0-8     [012] ....   177.325843: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x798d
>   kworker/u257:0-8     [003] ....   178.360488: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   179.384479: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   180.408477: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x7989
>   kworker/u257:0-8     [000] ....   181.432474: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79a9
>   kworker/u257:0-8     [000] ....   181.432510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x7800
>   kworker/u257:0-8     [000] ....   181.432546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [000] ....   181.432582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
>   kworker/u257:0-8     [000] ....   182.456510: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   182.456546: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0a val:0x4800
>   kworker/u257:0-8     [000] ....   182.456582: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>   kworker/u257:0-8     [000] ....   182.456618: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x05 val:0xc1e1
>   kworker/u257:0-8     [001] ....   183.480476: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   184.504478: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   185.528486: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>   kworker/u257:0-8     [000] ....   186.552475: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>         ifconfig-1327  [011] ....   187.196036: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>         ifconfig-1327  [011] ....   187.196046: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1840
> 

Add dmesg below:

[  177.325619] hns3 0000:bd:00.1 eth5: net open
[  177.325859] hns3 0000:bd:00.1: invalid speed (-1)
[  177.330180] 8021q: adding VLAN 0 to HW filter on device eth5
[  177.334569] hns3 0000:bd:00.1 eth5: failed to adjust link.
[  177.345674] IPv6: ADDRCONF(NETDEV_CHANGE): eth5: link becomes ready
[  178.021616] hns3 0000:bd:00.1 eth5: link up
[  178.808459] hns3 0000:bd:00.1 eth5: link down
[  182.808452] hns3 0000:bd:00.1 eth5: link up
[  187.191563] hns3 0000:bd:00.1 eth5: net stop
[  187.196049] hns3 0000:bd:00.1 eth5: link down


> 
> 
> On 2019/7/30 14:08, Heiner Kallweit wrote:
>> On 30.07.2019 06:03, liuyonglong wrote:
>>>
>>>
>>> On 2019/7/30 4:57, Heiner Kallweit wrote:
>>>> On 29.07.2019 05:59, liuyonglong wrote:
>>>>>
>>>>>
>>>>> On 2019/7/27 2:14, Heiner Kallweit wrote:
>>>>>> On 26.07.2019 11:53, Yonglong Liu wrote:
>>>>>>> According to the datasheet of Marvell phy and Realtek phy, the
>>>>>>> copper link status should read twice, or it may get a fake link
>>>>>>> up status, and cause up->down->up at the first time when link up.
>>>>>>> This happens more oftem at Realtek phy.
>>>>>>>
>>>>>> This is not correct, there is no fake link up status.
>>>>>> Read the comment in genphy_update_link, only link-down events
>>>>>> are latched. Means if the first read returns link up, then there
>>>>>> is no need for a second read. And in polling mode we don't do a
>>>>>> second read because we want to detect also short link drops.
>>>>>>
>>>>>> It would be helpful if you could describe your actual problem
>>>>>> and whether you use polling or interrupt mode.
>>>>>>
>>>>>
>>>>> [   44.498633] hns3 0000:bd:00.1 eth5: net open
>>>>> [   44.504273] hns3 0000:bd:00.1: reg=0x1, data=0x79ad -> called from phy_start_aneg
>>>>> [   44.532348] hns3 0000:bd:00.1: reg=0x1, data=0x798d -> called from phy_state_machine,update link.
>>>>
>>>> This should not happen. The PHY indicates link up w/o having aneg finished.
>>>>
>>>>>
>>>>> According to the datasheet:
>>>>> reg 1.5=0 now, means copper auto-negotiation not complete
>>>>> reg 1.2=1 now, means link is up
>>>>>
>>>>> We can see that, when we read the link up, the auto-negotiation
>>>>> is not complete yet, so the speed is invalid.
>>>>>
>>>>> I don't know why this happen, maybe this state is keep from bios?
>>>>> Or we may do something else in the phy initialize to fix it?
>>>>> And also confuse that why read twice can fix it?
>>>>>
>>>> I suppose that basically any delay would do.
>>>>
>>>>> [   44.554063] hns3 0000:bd:00.1: invalid speed (-1)
>>>>> [   44.560412] hns3 0000:bd:00.1 eth5: failed to adjust link.
>>>>> [   45.194870] hns3 0000:bd:00.1 eth5: link up
>>>>> [   45.574095] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>>> [   46.150051] hns3 0000:bd:00.1 eth5: link down
>>>>> [   46.598074] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x7989
>>>>> [   47.622075] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79a9
>>>>> [   48.646077] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>> [   48.934050] hns3 0000:bd:00.1 eth5: link up
>>>>> [   49.702140] hns3 0000:bd:00.1: phyid=3, reg=0x1, data=0x79ad
>>>>>
>>>>>>> I add a fake status read, and can solve this problem.
>>>>>>>
>>>>>>> I also see that in genphy_update_link(), had delete the fake
>>>>>>> read in polling mode, so I don't know whether my solution is
>>>>>>> correct.
>>>>>>>
>>>>
>>>> Can you test whether the following fixes the issue for you?
>>>> Also it would be interesting which exact PHY models you tested
>>>> and whether you built the respective PHY drivers or whether you
>>>> rely on the genphy driver. Best use the second patch to get the
>>>> needed info. It may make sense anyway to add the call to
>>>> phy_attached_info() to the hns3 driver.
>>>>
>>>
>>> [   40.100716] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: attached PHY driver [RTL8211F Gigabit Ethernet] (mii_bus:phy_addr=mii-0000:bd:00.3:07, irq=POLL)
>>> [   40.932133] hns3 0000:bd:00.3 eth7: net open
>>> [   40.932458] hns3 0000:bd:00.3: invalid speed (-1)
>>> [   40.932541] 8021q: adding VLAN 0 to HW filter on device eth7
>>> [   40.937149] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>
>>> [   40.662050] Generic PHY mii-0000:bd:00.2:05: attached PHY driver [Generic PHY] (mii_bus:phy_addr=mii-0000:bd:00.2:05, irq=POLL)
>>> [   41.563511] hns3 0000:bd:00.2 eth6: net open
>>> [   41.563853] hns3 0000:bd:00.2: invalid speed (-1)
>>> [   41.563943] 8021q: adding VLAN 0 to HW filter on device eth6
>>> [   41.568578] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
>>> [   41.568898] hns3 0000:bd:00.2 eth6: failed to adjust link.
>>>
>>> I am using RTL8211F, but you can see that, both genphy driver and
>>> RTL8211F driver have the same issue.
>>>
>>>>
>>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>>> index 6b5cb87f3..fbecfe210 100644
>>>> --- a/drivers/net/phy/phy_device.c
>>>> +++ b/drivers/net/phy/phy_device.c
>>>> @@ -1807,7 +1807,8 @@ int genphy_read_status(struct phy_device *phydev)
>>>>  
>>>>  	linkmode_zero(phydev->lp_advertising);
>>>>  
>>>> -	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
>>>> +	if (phydev->autoneg == AUTONEG_ENABLE &&
>>>> +	    (phydev->autoneg_complete || phydev->link)) {
>>>>  		if (phydev->is_gigabit_capable) {
>>>>  			lpagb = phy_read(phydev, MII_STAT1000);
>>>>  			if (lpagb < 0)
>>>>
>>>
>>> I have try this patch, have no effect. I suppose that at this time,
>>> the autoneg actually not complete yet.
>>>
>>> Maybe the wrong phy state is passed from bios? Invalid speed just
>>> happen at the first time when ethX up, after that, repeat the
>>> ifconfig down/ifconfig up command can not see that again.
>>>
>>> So I think the bios should power off the phy(writing reg 1.11 to 1)
>>> before it starts the OS? Or any other way to fix this in the OS?
>>>
>> To get a better idea of whats going on it would be good to see a full
>> MDIO trace. Can you enable MDIO tracing via the following sysctl file
>> /sys/kernel/debug/tracing/events/mdio/enable
>> and provide the generated trace?
>>
>> Due to polling mode each second entries will be generated, so you
>> better stop network after the issue occurred.
>>
>>>
>>>
>>>
>>
>> Heiner
>>
>> .
>>
> 
> 
> .
> 


^ permalink raw reply

* [PATCH net-next v2] can: fix ioctl function removal
From: Oliver Hartkopp @ 2019-07-30  6:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-can, sfr, Oliver Hartkopp, kernel test robot

Commit 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler") replaced
the almost empty can_ioctl() function with sock_no_ioctl() which always returns
-EOPNOTSUPP.

Even though we don't have any ioctl() functions on socket/network layer we need
to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
to the network driver layer.

This patch fixes the wrong return codes in the CAN network layer protocols.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
---

 v2: Changed SHA1 tag to be a least 12 digits long

 net/can/bcm.c | 9 ++++++++-
 net/can/raw.c | 9 ++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index 8da986b19d88..bf1d0bbecec8 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,6 +1680,13 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
+int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+			 unsigned long arg)
+{
+	/* no ioctls for socket layer -> hand it down to NIC layer */
+	return -ENOIOCTLCMD;
+}
+
 static const struct proto_ops bcm_ops = {
 	.family        = PF_CAN,
 	.release       = bcm_release,
@@ -1689,7 +1696,7 @@ static const struct proto_ops bcm_ops = {
 	.accept        = sock_no_accept,
 	.getname       = sock_no_getname,
 	.poll          = datagram_poll,
-	.ioctl         = sock_no_ioctl,
+	.ioctl         = bcm_sock_no_ioctlcmd,
 	.gettstamp     = sock_gettstamp,
 	.listen        = sock_no_listen,
 	.shutdown      = sock_no_shutdown,
diff --git a/net/can/raw.c b/net/can/raw.c
index ff720272f7b7..da386f1fa815 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,6 +837,13 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
+int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+			 unsigned long arg)
+{
+	/* no ioctls for socket layer -> hand it down to NIC layer */
+	return -ENOIOCTLCMD;
+}
+
 static const struct proto_ops raw_ops = {
 	.family        = PF_CAN,
 	.release       = raw_release,
@@ -846,7 +853,7 @@ static const struct proto_ops raw_ops = {
 	.accept        = sock_no_accept,
 	.getname       = raw_getname,
 	.poll          = datagram_poll,
-	.ioctl         = sock_no_ioctl,
+	.ioctl         = raw_sock_no_ioctlcmd,
 	.gettstamp     = sock_gettstamp,
 	.listen        = sock_no_listen,
 	.shutdown      = sock_no_shutdown,
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next v2] can: fix ioctl function removal
From: Oliver Hartkopp @ 2019-07-30  6:48 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-can, sfr, kernel test robot, Marc Kleine-Budde
In-Reply-To: <20190730064333.1581-1-socketcan@hartkopp.net>

According to Marc the original patch has already been applied by Dave.

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=473d924d7d46cb57aa4c1863261d18366af345af

Thanks for the support & sorry for the noise!

Best regards,
Oliver

On 30/07/2019 08.43, Oliver Hartkopp wrote:
> Commit 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler") replaced
> the almost empty can_ioctl() function with sock_no_ioctl() which always returns
> -EOPNOTSUPP.
> 
> Even though we don't have any ioctl() functions on socket/network layer we need
> to return -ENOIOCTLCMD to be able to forward ioctl commands like SIOCGIFINDEX
> to the network driver layer.
> 
> This patch fixes the wrong return codes in the CAN network layer protocols.
> 
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Fixes: 60649d4e0af6c26b ("can: remove obsolete empty ioctl() handler")
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
> ---
> 
>   v2: Changed SHA1 tag to be a least 12 digits long
> 
>   net/can/bcm.c | 9 ++++++++-
>   net/can/raw.c | 9 ++++++++-
>   2 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index 8da986b19d88..bf1d0bbecec8 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1680,6 +1680,13 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>   	return size;
>   }
>   
> +int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +			 unsigned long arg)
> +{
> +	/* no ioctls for socket layer -> hand it down to NIC layer */
> +	return -ENOIOCTLCMD;
> +}
> +
>   static const struct proto_ops bcm_ops = {
>   	.family        = PF_CAN,
>   	.release       = bcm_release,
> @@ -1689,7 +1696,7 @@ static const struct proto_ops bcm_ops = {
>   	.accept        = sock_no_accept,
>   	.getname       = sock_no_getname,
>   	.poll          = datagram_poll,
> -	.ioctl         = sock_no_ioctl,
> +	.ioctl         = bcm_sock_no_ioctlcmd,
>   	.gettstamp     = sock_gettstamp,
>   	.listen        = sock_no_listen,
>   	.shutdown      = sock_no_shutdown,
> diff --git a/net/can/raw.c b/net/can/raw.c
> index ff720272f7b7..da386f1fa815 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -837,6 +837,13 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>   	return size;
>   }
>   
> +int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +			 unsigned long arg)
> +{
> +	/* no ioctls for socket layer -> hand it down to NIC layer */
> +	return -ENOIOCTLCMD;
> +}
> +
>   static const struct proto_ops raw_ops = {
>   	.family        = PF_CAN,
>   	.release       = raw_release,
> @@ -846,7 +853,7 @@ static const struct proto_ops raw_ops = {
>   	.accept        = sock_no_accept,
>   	.getname       = raw_getname,
>   	.poll          = datagram_poll,
> -	.ioctl         = sock_no_ioctl,
> +	.ioctl         = raw_sock_no_ioctlcmd,
>   	.gettstamp     = sock_gettstamp,
>   	.listen        = sock_no_listen,
>   	.shutdown      = sock_no_shutdown,
> 

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Ido Schimmel @ 2019-07-30  7:06 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190730062721.p4vrxo5sxbtulkrx@lx-anielsen.microsemi.net>

On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
> The 07/29/2019 20:51, Ido Schimmel wrote:
> > Can you please clarify what you're trying to achieve? I just read the
> > thread again and my impression is that you're trying to locally receive
> > packets with a certain link layer multicast address.
> Yes. The thread is also a bit confusing because we half way through realized
> that we misunderstood how the multicast packets should be handled (sorry about
> that). To begin with we had a driver where multicast packets was only copied to
> the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
> how other drivers are doing it, so we changed the driver to include the CPU in
> the default multicast flood-mask.

OK, so what prevents you from removing all other ports from the
flood-mask and letting the CPU handle the flooding? Then you can install
software tc filters to limit the flooding.

> This changes the objective a bit. To begin with we needed to get more packets to
> the CPU (which could have been done using tc ingress rules and a trap action).
> 
> Now after we changed the driver, we realized that we need something to limit the
> flooding of certain L2 multicast packets. This is the new problem we are trying
> to solve!
> 
> Example: Say we have a bridge with 4 slave interfaces, then we want to install a
> forwarding rule saying that packets to a given L2-multicast MAC address, should
> only be flooded to 2 of the 4 ports.
> 
> (instead of adding rules to get certain packets to the CPU, we are now adding
> other rules to prevent other packets from going to the CPU and other ports where
> they are not needed/wanted).
> 
> This is exactly the same thing as IGMP snooping does dynamically, but only for
> IP multicast.
> 
> The "bridge mdb" allow users to manually/static add/del a port to a multicast
> group, but still it operates on IP multicast address (not L2 multicast
> addresses).
> 
> > Nik suggested SIOCADDMULTI.
> It is not clear to me how this should be used to limit the flooding, maybe we
> can make some hacks, but as far as I understand the intend of this is maintain
> the list of addresses an interface should receive. I'm not sure this should
> influence how for forwarding decisions are being made.
> 
> > and I suggested a tc filter to get the packet to the CPU.
> The TC solution is a good solution to the original problem where wanted to copy
> more frames to the CPU. But we were convinced that this is not the right
> approach, and that the CPU by default should receive all multicast packets, and
> we should instead try to find a way to limit the flooding of certain frames as
> an optimization.

This can still work. In Linux, ingress tc filters are executed before the
bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
eth2:

# tc filter add dev eth0 ingress pref 1 flower skip_sw \
	dst_mac 01:21:6C:00:00:01 action trap
# tc filter add dev eth2 egress pref 1 flower skip_hw \
	dst_mac 01:21:6C:00:00:01 action drop

The first filter is only present in HW ('skip_sw') and should result in
your HW passing you the sole copy of the packet.

The second filter is only present in SW ('skip_hw', not using HW egress
ACL that you don't have) and drops the packet after it was flooded by
the SW bridge.

As I mentioned earlier, you can install the filter once in your HW and
share it between different ports using a shared block. This means you
only consume one TCAM entry.

Note that this allows you to keep flooding all other multicast packets
in HW.

> > If you now want to limit the ports to which this packet is flooded, then
> > you can use tc filters in *software*:
> > 
> > # tc qdisc add dev eth2 clsact
> > # tc filter add dev eth2 egress pref 1 flower skip_hw \
> > 	dst_mac 01:21:6C:00:00:01 action drop
> Yes. This can work in the SW bridge.
> 
> > If you want to forward the packet in hardware and locally receive it,
> > you can chain several mirred action and then a trap action.
> I'm not I fully understand how this should be done, but it does sound like it
> becomes quite complicated. Also, as far as I understand it will mean that we
> will be using TCAM/ACL resources to do something that could have been done with
> a simple MAC entry.
> 
> > Both options avoid HW egress ACLs which your design does not support.
> True, but what is wrong with expanding the functionality of the normal
> forwarding/MAC operations to allow multiple destinations?
> 
> It is not an uncommon feature (I just browsed the manual of some common L2
> switches and they all has this feature).
> 
> It seems to fit nicely into the existing user-interface:
> 
> bridge fdb add    01:21:6C:00:00:01 port eth0
> bridge fdb append 01:21:6C:00:00:01 port eth1

Wouldn't it be better to instead extend the MDB entries so that they are
either keyed by IP or MAC? I believe FDB should remain as unicast-only.
As a bonus, existing drivers could benefit from it, as MDB entries are
already notified by MAC.

> 
> It seems that it can be added to the existing implementation with out adding
> significant complexity.
> 
> It will be easy to offload in HW.
> 
> I do not believe that it will be a performance issue, if this is a concern then
> we may have to do a bit of benchmarking, or we can make it a configuration
> option.
> 
> Long story short, we (Horatiu and I) learned a lot from the discussion here, and
> I think we should try do a new patch with the learning we got. Then it is easier
> to see what it actually means to the exiting code, complexity, exiting drivers,
> performance, default behavioral, backwards compatibly, and other valid concerns.
> 
> If the patch is no good, and cannot be fixed, then we will go back and look
> further into alternative solutions.

Overall, I tend to agree with Nik. I think your use case is too specific
to justify the amount of changes you want to make in the bridge driver.
We also provided other alternatives. That being said, you're more than
welcome to send the patches and we can continue the discussion then.

^ permalink raw reply

* [PATCH net v2] ipvs: Improve robustness to the ipvs sysctl
From: hujunwei @ 2019-07-30  7:11 UTC (permalink / raw)
  To: wensong, horms, pablo, kadlec, fw, davem, Julian Anastasov,
	Florian Westphal
  Cc: netdev@vger.kernel.org, lvs-devel, netfilter-devel, coreteam,
	Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <1997375e-815d-137f-20c9-0829a8587ee9@huawei.com>

From: Junwei Hu <hujunwei4@huawei.com>

The ipvs module parse the user buffer and save it to sysctl,
then check if the value is valid. invalid value occurs
over a period of time.
Here, I add a variable, struct ctl_table tmp, used to read
the value from the user buffer, and save only when it is valid.
I delete proc_do_sync_mode and use extra1/2 in table for the
proc_dointvec_minmax call.

Fixes: f73181c8288f ("ipvs: add support for sync threads")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
---
V1->V2:
- delete proc_do_sync_mode and use proc_dointvec_minmax call.
---
 net/netfilter/ipvs/ip_vs_ctl.c | 69 +++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 060565e..7aed7b0 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1737,12 +1737,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int val = *valp;
 	int rc;

-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
 	if (write && (*valp != val)) {
-		if ((*valp < 0) || (*valp > 3)) {
-			/* Restore the correct value */
+		if (val < 0 || val > 3) {
+			rc = -EINVAL;
+		} else {
 			*valp = val;
-		} else {
 			update_defense_level(ipvs);
 		}
 	}
@@ -1756,33 +1762,20 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int *valp = table->data;
 	int val[2];
 	int rc;
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = table->maxlen,
+		.mode = table->mode,
+	};

-	/* backup the value first */
 	memcpy(val, valp, sizeof(val));
-
-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
-	if (write && (valp[0] < 0 || valp[1] < 0 ||
-	    (valp[0] >= valp[1] && valp[1]))) {
-		/* Restore the correct value */
-		memcpy(valp, val, sizeof(val));
-	}
-	return rc;
-}
-
-static int
-proc_do_sync_mode(struct ctl_table *table, int write,
-		     void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-	int *valp = table->data;
-	int val = *valp;
-	int rc;
-
-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
-	if (write && (*valp != val)) {
-		if ((*valp < 0) || (*valp > 1)) {
-			/* Restore the correct value */
-			*valp = val;
-		}
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+	if (write) {
+		if (val[0] < 0 || val[1] < 0 ||
+		    (val[0] >= val[1] && val[1]))
+			rc = -EINVAL;
+		else
+			memcpy(valp, val, sizeof(val));
 	}
 	return rc;
 }
@@ -1795,12 +1788,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 	int val = *valp;
 	int rc;

-	rc = proc_dointvec(table, write, buffer, lenp, ppos);
+	struct ctl_table tmp = {
+		.data = &val,
+		.maxlen = sizeof(int),
+		.mode = table->mode,
+	};
+
+	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
 	if (write && (*valp != val)) {
-		if (*valp < 1 || !is_power_of_2(*valp)) {
-			/* Restore the correct value */
+		if (val < 1 || !is_power_of_2(val))
+			rc = -EINVAL;
+		else
 			*valp = val;
-		}
 	}
 	return rc;
 }
@@ -1860,7 +1859,9 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
 		.procname	= "sync_version",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_do_sync_mode,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	},
 	{
 		.procname	= "sync_ports",
-- 
1.7.12.4


^ permalink raw reply related

* Re: KASAN: use-after-free Read in psi_task_change
From: syzbot @ 2019-07-30  7:13 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, linux-kernel, netdev, syzkaller-bugs,
	tglx
In-Reply-To: <000000000000df9d48058e9228cd@google.com>

syzbot has bisected this bug to:

commit e9db4ef6bf4ca9894bb324c76e01b8f1a16b2650
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Sat Jun 30 13:17:47 2018 +0000

     bpf: sockhash fix omitted bucket lock in sock_close

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10f4840c600000
start commit:   bed38c3e Merge tag 'powerpc-5.3-2' of git://git.kernel.org..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=12f4840c600000
console output: https://syzkaller.appspot.com/x/log.txt?x=14f4840c600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9aec8cb13b5f7389
dashboard link: https://syzkaller.appspot.com/bug?extid=f17ba6f9b8d9cc0498d0
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10dc7b34600000

Reported-by: syzbot+f17ba6f9b8d9cc0498d0@syzkaller.appspotmail.com
Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH] net: phy: phy_led_triggers: Fix a possible null-pointer dereference in phy_led_trigger_change_speed()
From: Jia-Ju Bai @ 2019-07-30  8:03 UTC (permalink / raw)
  To: David Miller, andrew; +Cc: f.fainelli, hkallweit1, netdev, linux-kernel
In-Reply-To: <20190729.204113.316505378355498068.davem@davemloft.net>



On 2019/7/30 11:41, David Miller wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Tue, 30 Jul 2019 05:32:29 +0200
>
>> On Tue, Jul 30, 2019 at 10:25:36AM +0800, Jia-Ju Bai wrote:
>>>
>>> On 2019/7/29 21:45, Andrew Lunn wrote:
>>>> On Mon, Jul 29, 2019 at 05:24:24PM +0800, Jia-Ju Bai wrote:
>>>>> In phy_led_trigger_change_speed(), there is an if statement on line 48
>>>>> to check whether phy->last_triggered is NULL:
>>>>>      if (!phy->last_triggered)
>>>>>
>>>>> When phy->last_triggered is NULL, it is used on line 52:
>>>>>      led_trigger_event(&phy->last_triggered->trigger, LED_OFF);
>>>>>
>>>>> Thus, a possible null-pointer dereference may occur.
>>>>>
>>>>> To fix this bug, led_trigger_event(&phy->last_triggered->trigger,
>>>>> LED_OFF) is called when phy->last_triggered is not NULL.
>>>>>
>>>>> This bug is found by a static analysis tool STCheck written by us.
>>>> Who is 'us'?
>>> Me and my colleague...
>> Well, we can leave it very vague, giving no idea who 'us' is. But
>> often you want to name the company behind it, or the university, or
>> the sponsor, etc.
> I agree, if you are going to mention that there is a tool you should be
> clear exactly who and what organization are behind it

Thanks for the advice.
I will add my organization in the patch.


Best wishes,
Jia-Ju Bai

^ permalink raw reply

* [PATCH v2] net: phy: phy_led_triggers: Fix a possible null-pointer dereference in phy_led_trigger_change_speed()
From: Jia-Ju Bai @ 2019-07-30  8:08 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem; +Cc: netdev, linux-kernel, Jia-Ju Bai

In phy_led_trigger_change_speed(), there is an if statement on line 48
to check whether phy->last_triggered is NULL: 
    if (!phy->last_triggered)

When phy->last_triggered is NULL, it is used on line 52:
    led_trigger_event(&phy->last_triggered->trigger, LED_OFF);

Thus, a possible null-pointer dereference may occur.

To fix this bug, led_trigger_event(&phy->last_triggered->trigger,
LED_OFF) is called when phy->last_triggered is not NULL.

This bug is found by a static analysis tool STCheck written by
the OSLAB group in Tsinghua University.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Add the organization of the tool's authors.
  Thank David and Andrew for helpful advice.

---
 drivers/net/phy/phy_led_triggers.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy_led_triggers.c b/drivers/net/phy/phy_led_triggers.c
index b86a4b2116f8..59a94e07e7c5 100644
--- a/drivers/net/phy/phy_led_triggers.c
+++ b/drivers/net/phy/phy_led_triggers.c
@@ -48,8 +48,9 @@ void phy_led_trigger_change_speed(struct phy_device *phy)
 		if (!phy->last_triggered)
 			led_trigger_event(&phy->led_link_trigger->trigger,
 					  LED_FULL);
+		else
+			led_trigger_event(&phy->last_triggered->trigger, LED_OFF);
 
-		led_trigger_event(&phy->last_triggered->trigger, LED_OFF);
 		led_trigger_event(&plt->trigger, LED_FULL);
 		phy->last_triggered = plt;
 	}
-- 
2.17.0


^ permalink raw reply related

* net: ipv6: Fix a bug in ndisc_send_ns when netdev only has a global address
From: Mark Smith @ 2019-07-30  8:15 UTC (permalink / raw)
  To: suyj.fnst, netdev

Hi,

I'm not subscribed to the Linux netdev mailing list, so I can't
directly reply to the patch email.

This patch is not the correct solution to this issue.

Per RFC 4291 "IP Version 6 Addressing Architecture", all IPv6
interfaces are required to have Link-Local addresses, so therefore
there should always be a link-local address available to use as the
source address for an ND NS.

"2.1. Addressing Model"

...

"All interfaces are required to have at least one Link-Local unicast
   address (see Section 2.8 for additional required addresses)."

I have submitted a more specific bug regarding no Link-Local
address/prefix on the Linux kernel loopback interface through RedHat
bugzilla as I use Fedora 30, however it doesn't seem to have been
looked at yet.

"Loopback network interface does not have a Link Local address,
contrary to RFC 4291"
https://bugzilla.redhat.com/show_bug.cgi?id=1706709


Thanks very much,
Mark.

^ permalink raw reply

* Re: [PATCH] drivers: net: wireless: rsi: return explicit error values
From: Kalle Valo @ 2019-07-30  8:29 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: linux-kernel, amitkarwar, siva8118, linux-wireless, netdev
In-Reply-To: <1564413872-10720-1-git-send-email-info@metux.net>

"Enrico Weigelt, metux IT consult" <info@metux.net> writes:

> From: Enrico Weigelt <info@metux.net>
>
> Explicitly return constants instead of variable (and rely on
> it to be explicitly initialized), if the value is supposed
> to be fixed anyways. Align it with the rest of the driver,
> which does it the same way.
>
> Signed-off-by: Enrico Weigelt <info@metux.net>
> ---
>  drivers/net/wireless/rsi/rsi_91x_sdio.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

The prefix should be just "rsi:", I'll fix that.

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-30  8:30 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190730070626.GA508@splinter>

The 07/30/2019 10:06, Ido Schimmel wrote:
> On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
> > The 07/29/2019 20:51, Ido Schimmel wrote:
> > > Can you please clarify what you're trying to achieve? I just read the
> > > thread again and my impression is that you're trying to locally receive
> > > packets with a certain link layer multicast address.
> > Yes. The thread is also a bit confusing because we half way through realized
> > that we misunderstood how the multicast packets should be handled (sorry about
> > that). To begin with we had a driver where multicast packets was only copied to
> > the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
> > how other drivers are doing it, so we changed the driver to include the CPU in
> > the default multicast flood-mask.
> OK, so what prevents you from removing all other ports from the
> flood-mask and letting the CPU handle the flooding? Then you can install
> software tc filters to limit the flooding.
I do not have the bandwidth to forward the multicast traffic in the CPU.

It will also cause enormous latency on the forwarding of L2 multicast packets.

> > This changes the objective a bit. To begin with we needed to get more packets to
> > the CPU (which could have been done using tc ingress rules and a trap action).
> > 
> > Now after we changed the driver, we realized that we need something to limit the
> > flooding of certain L2 multicast packets. This is the new problem we are trying
> > to solve!
> > 
> > Example: Say we have a bridge with 4 slave interfaces, then we want to install a
> > forwarding rule saying that packets to a given L2-multicast MAC address, should
> > only be flooded to 2 of the 4 ports.
> > 
> > (instead of adding rules to get certain packets to the CPU, we are now adding
> > other rules to prevent other packets from going to the CPU and other ports where
> > they are not needed/wanted).
> > 
> > This is exactly the same thing as IGMP snooping does dynamically, but only for
> > IP multicast.
> > 
> > The "bridge mdb" allow users to manually/static add/del a port to a multicast
> > group, but still it operates on IP multicast address (not L2 multicast
> > addresses).
> > 
> > > Nik suggested SIOCADDMULTI.
> > It is not clear to me how this should be used to limit the flooding, maybe we
> > can make some hacks, but as far as I understand the intend of this is maintain
> > the list of addresses an interface should receive. I'm not sure this should
> > influence how for forwarding decisions are being made.
> > 
> > > and I suggested a tc filter to get the packet to the CPU.
> > The TC solution is a good solution to the original problem where wanted to copy
> > more frames to the CPU. But we were convinced that this is not the right
> > approach, and that the CPU by default should receive all multicast packets, and
> > we should instead try to find a way to limit the flooding of certain frames as
> > an optimization.
> 
> This can still work. In Linux, ingress tc filters are executed before the
> bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
> performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
> you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
> eth2:
> 
> # tc filter add dev eth0 ingress pref 1 flower skip_sw \
> 	dst_mac 01:21:6C:00:00:01 action trap
> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> 	dst_mac 01:21:6C:00:00:01 action drop
> 
> The first filter is only present in HW ('skip_sw') and should result in
> your HW passing you the sole copy of the packet.
Agree.

> The second filter is only present in SW ('skip_hw', not using HW egress
> ACL that you don't have) and drops the packet after it was flooded by
> the SW bridge.
Agree.

> As I mentioned earlier, you can install the filter once in your HW and
> share it between different ports using a shared block. This means you
> only consume one TCAM entry.
> 
> Note that this allows you to keep flooding all other multicast packets
> in HW.
Yes, but the frames we want to limit the flood-mask on are the exact frames
which occurs at a very high rate, and where latency is important.

I really do not consider it as an option to forward this in SW, when it is
something that can easily be offloaded in HW.

> > > If you now want to limit the ports to which this packet is flooded, then
> > > you can use tc filters in *software*:
> > > 
> > > # tc qdisc add dev eth2 clsact
> > > # tc filter add dev eth2 egress pref 1 flower skip_hw \
> > > 	dst_mac 01:21:6C:00:00:01 action drop
> > Yes. This can work in the SW bridge.
> > 
> > > If you want to forward the packet in hardware and locally receive it,
> > > you can chain several mirred action and then a trap action.
> > I'm not I fully understand how this should be done, but it does sound like it
> > becomes quite complicated. Also, as far as I understand it will mean that we
> > will be using TCAM/ACL resources to do something that could have been done with
> > a simple MAC entry.
> > 
> > > Both options avoid HW egress ACLs which your design does not support.
> > True, but what is wrong with expanding the functionality of the normal
> > forwarding/MAC operations to allow multiple destinations?
> > 
> > It is not an uncommon feature (I just browsed the manual of some common L2
> > switches and they all has this feature).
> > 
> > It seems to fit nicely into the existing user-interface:
> > 
> > bridge fdb add    01:21:6C:00:00:01 port eth0
> > bridge fdb append 01:21:6C:00:00:01 port eth1
> 
> Wouldn't it be better to instead extend the MDB entries so that they are
> either keyed by IP or MAC? I believe FDB should remain as unicast-only.

You might be right, it was not clear to me which of the two would fit the
purpose best.

From a user-space iproute2 perspective I prefer using the "bridge fdb" command
as it already supports the needed syntax, and I do not think it will be too
pretty if we squeeze this into the "bridge mdb" command syntax.

But that does not mean that it need to go into the FDB database in the
implementation.

Last evening when I looked at it again, I was considering keeping the
net_bridge_fdb_entry structure as is, and add a new hashtable with the
following:

struct net_bridge_fdbmc_entry {
	struct rhash_head		rhnode;
	struct net_bridge_fdbmc_ports   *dst;

	struct net_bridge_fdb_key	key;
	struct hlist_node		fdb_node;
	unsigned char			offloaded:1;

	struct rcu_head			rcu;
};

If we go with this approach then we can look at the MAC address and see if it is
a unicast which will cause a lookup in the fdb, l3-multicast (33:33:* or
01:00:5e:*) which will cause a lookup in the mdb, or finally a fdbmc which will
need to do a lookup in this new hashtable.

Alternative it would be like this:

struct net_bridge_fdb_entry {
	struct rhash_head		rhnode;
	union net_bridge_port_or_list	*dst;

	struct net_bridge_fdb_key	key;
	struct hlist_node		fdb_node;
	unsigned char			is_local:1,
					is_static:1,
					is_sticky:1,
					added_by_user:1,
					added_by_external_learn:1,
					offloaded:1;
					multi_dst:1;

	/* write-heavy members should not affect lookups */
	unsigned long			updated ____cacheline_aligned_in_smp;
	unsigned long			used;

	struct rcu_head			rcu;
};

Both solutions should require fairly few changes, and should not cause any
measurable performance hit.

Making it fit into the net_bridge_mdb_entry seems to be harder.

> As a bonus, existing drivers could benefit from it, as MDB entries are already
> notified by MAC.
Not sure I follow. When FDB entries are added, it also generates notification
events.

> > It seems that it can be added to the existing implementation with out adding
> > significant complexity.
> > 
> > It will be easy to offload in HW.
> > 
> > I do not believe that it will be a performance issue, if this is a concern then
> > we may have to do a bit of benchmarking, or we can make it a configuration
> > option.
> > 
> > Long story short, we (Horatiu and I) learned a lot from the discussion here, and
> > I think we should try do a new patch with the learning we got. Then it is easier
> > to see what it actually means to the exiting code, complexity, exiting drivers,
> > performance, default behavioral, backwards compatibly, and other valid concerns.
> > 
> > If the patch is no good, and cannot be fixed, then we will go back and look
> > further into alternative solutions.
> Overall, I tend to agree with Nik. I think your use case is too specific
> to justify the amount of changes you want to make in the bridge driver.
> We also provided other alternatives. That being said, you're more than
> welcome to send the patches and we can continue the discussion then.
Okay, good to know. I'm not sure I agree that the alternative solutions really
solves the issue this is trying to solve, nor do I agree that this is specific
to our needs.

But lets take a look at a new patch, and see what is the amount of changes we
are talking about. Without having the patch it is really hard to know for sure.

-- 
/Allan


^ permalink raw reply

* Re: [PATCH v2 0/2] mmc: core: Fix Marvell WiFi reset by adding SDIO API to replug card
From: Andreas Fenkart @ 2019-07-30  8:46 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Ulf Hansson, Kalle Valo, Adrian Hunter, Ganapathi Bhat,
	linux-wireless, Brian Norris, Amitkumar Karwar, linux-rockchip,
	Wolfram Sang, Nishant Sarmukadam, netdev, Avri Altman, linux-mmc,
	David Miller, Xinming Hu, linux-kernel, Thomas Gleixner,
	Kate Stewart
In-Reply-To: <20190722193939.125578-1-dianders@chromium.org>

Hi Douglas,

Am Mo., 22. Juli 2019 um 21:41 Uhr schrieb Douglas Anderson
<dianders@chromium.org>:
>
> As talked about in the thread at:
>
> http://lkml.kernel.org/r/CAD=FV=X7P2F1k_zwHc0mbtfk55-rucTz_GoDH=PL6zWqKYcpuw@mail.gmail.com
>
> ...when the Marvell WiFi card tries to reset itself it kills
> Bluetooth.  It was observed that we could re-init the card properly by
> unbinding / rebinding the host controller.  It was also observed that
> in the downstream Chrome OS codebase the solution used was
> mmc_remove_host() / mmc_add_host(), which is similar to the solution
> in this series.
>
> So far I've only done testing of this series using the reset test
> source that can be simulated via sysfs.  Specifically I ran this test:
>
> for i in $(seq 1000); do
>   echo "LOOP $i --------"
>   echo 1 > /sys/kernel/debug/mwifiex/mlan0/reset
>
>   while true; do
>     if ! ping -w15 -c1 "${GW}" >/dev/null 2>&1; then
>       fail=$(( fail + 1 ))
>       echo "Fail WiFi ${fail}"
>       if [[ ${fail} == 3 ]]; then
>         exit 1
>       fi
>     else
>       fail=0
>       break
>     fi
>   done
>
>   hciconfig hci0 down
>   sleep 1
>   if ! hciconfig hci0 up; then
>     echo "Fail BT"
>     exit 1
>   fi
> done
>
> I ran this several times and got several hundred iterations each
> before a failure.  When I saw failures:
>
> * Once I saw a "Fail BT"; manually resetting the card again fixed it.
>   I didn't give it time to see if it would have detected this
>   automatically.
> * Once I saw the ping fail because (for some reason) my device only
>   got an IPv6 address from my router and the IPv4 ping failed.  I
>   changed my script to use 'ping6' to see if that would help.
> * Once I saw the ping fail because the higher level network stack
>   ("shill" in my case) seemed to crash.  A few minutes later the
>   system recovered itself automatically.  https://crbug.com/984593 if
>   you want more details.
> * Sometimes while I was testing I saw "Fail WiFi 1" indicating a
>   transitory failure.  Usually this was an association failure, but in
>   one case I saw the device do "Firmware wakeup failed" after I
>   triggered the reset.  This caused the driver to trigger a re-reset
>   of itself which eventually recovered things.  This was good because
>   it was an actual test of the normal reset flow (not the one
>   triggered via sysfs).

This error triggers something. I remember that when I was working on
suspend-to-ram feature, we had problems to wake up the firmware
reliable. I found this patch in one of my old 3.13 tree

    the missing bit -- ugly hack to force cmd52 before cmd53.
---
 drivers/mmc/host/omap_hsmmc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/mmc/host/omap_hsmmc.c b/drivers/mmc/host/omap_hsmmc.c
index fb24a006080f..710d0bdf39e5 100644
--- a/drivers/mmc/host/omap_hsmmc.c
+++ b/drivers/mmc/host/omap_hsmmc.c
@@ -2372,6 +2372,12 @@ static int omap_hsmmc_suspend(struct device *dev)
        if (host->flags & HSMMC_SWAKEUP_QUIRK)
                disable_irq(host->gpio_sdio_irq);

+       /*
+        * force a polling cycle after resume.
+        * will issue cmd52, not cmd53 straight away
+        */
+       omap_hsmmc_enable_sdio_irq(host->mmc, false);
+
        if (host->dbclk)
                clk_disable_unprepare(host->dbclk);

>
> Changes in v2:
> - s/routnine/routine (Brian Norris, Matthias Kaehlcke).
> - s/contining/containing (Matthias Kaehlcke).
> - Add Matthias Reviewed-by tag.
> - Removed clear_bit() calls and old comment (Brian Norris).
> - Explicit CC of Andreas Fenkart.
> - Explicit CC of Brian Norris.
> - Add "Fixes" pointing at the commit Brian talked about.
> - Add Brian's Reviewed-by tag.
>
> Douglas Anderson (2):
>   mmc: core: Add sdio_trigger_replug() API
>   mwifiex: Make use of the new sdio_trigger_replug() API to reset
>
>  drivers/mmc/core/core.c                     | 28 +++++++++++++++++++--
>  drivers/mmc/core/sdio_io.c                  | 20 +++++++++++++++
>  drivers/net/wireless/marvell/mwifiex/sdio.c | 16 +-----------
>  include/linux/mmc/host.h                    | 15 ++++++++++-
>  include/linux/mmc/sdio_func.h               |  2 ++
>  5 files changed, 63 insertions(+), 18 deletions(-)
>
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply related

* [patch net-next v2 0/3] net: devlink: Finish network namespace support
From: Jiri Pirko @ 2019-07-30  8:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw

From: Jiri Pirko <jiri@mellanox.com>

Devlink from the beginning counts with network namespaces, but the
instances has been fixed to init_net. The first patch allows user
to move existing devlink instances into namespaces:

$ devlink dev
netdevsim/netdevsim1
$ ip netns add ns1
$ devlink dev set netdevsim/netdevsim1 netns ns1
$ devlink -N ns1 dev
netdevsim/netdevsim1

The last patch allows user to create new netdevsim instance directly
inside network namespace of a caller.

Jiri Pirko (3):
  net: devlink: allow to change namespaces
  net: devlink: export devlink net set/get helpers
  netdevsim: create devlink and netdev instances in namespace

 drivers/net/netdevsim/bus.c       |   1 +
 drivers/net/netdevsim/dev.c       |  17 ++--
 drivers/net/netdevsim/netdev.c    |   4 +-
 drivers/net/netdevsim/netdevsim.h |   8 +-
 include/net/devlink.h             |   3 +
 include/uapi/linux/devlink.h      |   4 +
 net/core/devlink.c                | 129 ++++++++++++++++++++++++++++--
 7 files changed, 152 insertions(+), 14 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [patch net-next v2 2/3] net: devlink: export devlink net set/get helpers
From: Jiri Pirko @ 2019-07-30  8:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Allow drivers to set/get net struct for devlink instance. Set is only
allowed for newly allocated devlink instance.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  3 +++
 net/core/devlink.c    | 18 ++++++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index bc36f942a7d5..98b89eabd73a 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -35,6 +35,7 @@ struct devlink {
 	struct device *dev;
 	possible_net_t _net;
 	struct mutex lock;
+	bool registered;
 	char priv[0] __aligned(NETDEV_ALIGN);
 };
 
@@ -591,6 +592,8 @@ static inline struct devlink *netdev_to_devlink(struct net_device *dev)
 
 struct ib_device;
 
+struct net *devlink_net(const struct devlink *devlink);
+void devlink_net_set(struct devlink *devlink, struct net *net);
 struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size);
 int devlink_register(struct devlink *devlink, struct device *dev);
 void devlink_unregister(struct devlink *devlink);
diff --git a/net/core/devlink.c b/net/core/devlink.c
index e1cbfd90f788..fc364bdb9cf2 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -92,16 +92,25 @@ static LIST_HEAD(devlink_list);
  */
 static DEFINE_MUTEX(devlink_mutex);
 
-static struct net *devlink_net(const struct devlink *devlink)
+struct net *devlink_net(const struct devlink *devlink)
 {
 	return read_pnet(&devlink->_net);
 }
+EXPORT_SYMBOL_GPL(devlink_net);
 
-static void devlink_net_set(struct devlink *devlink, struct net *net)
+static void __devlink_net_set(struct devlink *devlink, struct net *net)
 {
 	write_pnet(&devlink->_net, net);
 }
 
+void devlink_net_set(struct devlink *devlink, struct net *net)
+{
+	if (WARN_ON(devlink->registered))
+		return;
+	__devlink_net_set(devlink, net);
+}
+EXPORT_SYMBOL_GPL(devlink_net_set);
+
 static struct devlink *devlink_get_from_attrs(struct net *net,
 					      struct nlattr **attrs)
 {
@@ -696,7 +705,7 @@ static void devlink_netns_change(struct devlink *devlink, struct net *net)
 	if (net_eq(devlink_net(devlink), net))
 		return;
 	devlink_notify(devlink, DEVLINK_CMD_DEL);
-	devlink_net_set(devlink, net);
+	__devlink_net_set(devlink, net);
 	devlink_notify(devlink, DEVLINK_CMD_NEW);
 }
 
@@ -5603,7 +5612,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	if (!devlink)
 		return NULL;
 	devlink->ops = ops;
-	devlink_net_set(devlink, &init_net);
+	__devlink_net_set(devlink, &init_net);
 	INIT_LIST_HEAD(&devlink->port_list);
 	INIT_LIST_HEAD(&devlink->sb_list);
 	INIT_LIST_HEAD_RCU(&devlink->dpipe_table_list);
@@ -5627,6 +5636,7 @@ int devlink_register(struct devlink *devlink, struct device *dev)
 {
 	mutex_lock(&devlink_mutex);
 	devlink->dev = dev;
+	devlink->registered = true;
 	list_add_tail(&devlink->list, &devlink_list);
 	devlink_notify(devlink, DEVLINK_CMD_NEW);
 	mutex_unlock(&devlink_mutex);
-- 
2.21.0


^ permalink raw reply related

* [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-07-30  8:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

All devlink instances are created in init_net and stay there for a
lifetime. Allow user to be able to move devlink instances into
namespaces.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
v1->v2:
- change the check for multiple attributes
- add warnon in case there is no attribute passed
---
 include/uapi/linux/devlink.h |   4 ++
 net/core/devlink.c           | 113 ++++++++++++++++++++++++++++++++++-
 2 files changed, 114 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index ffc993256527..95f0a1edab99 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -348,6 +348,10 @@ enum devlink_attr {
 	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u16 */
 	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u16 */
 
+	DEVLINK_ATTR_NETNS_FD,			/* u32 */
+	DEVLINK_ATTR_NETNS_PID,			/* u32 */
+	DEVLINK_ATTR_NETNS_ID,			/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 4f40aeace902..e1cbfd90f788 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -439,8 +439,16 @@ static void devlink_nl_post_doit(const struct genl_ops *ops,
 {
 	struct devlink *devlink;
 
-	devlink = devlink_get_from_info(info);
-	if (~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
+	/* When devlink changes netns, it would not be found
+	 * by devlink_get_from_info(). So try if it is stored first.
+	 */
+	if (ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEVLINK) {
+		devlink = info->user_ptr[0];
+	} else {
+		devlink = devlink_get_from_info(info);
+		WARN_ON(IS_ERR(devlink));
+	}
+	if (!IS_ERR(devlink) && ~ops->internal_flags & DEVLINK_NL_FLAG_NO_LOCK)
 		mutex_unlock(&devlink->lock);
 	mutex_unlock(&devlink_mutex);
 }
@@ -645,6 +653,71 @@ static int devlink_nl_cmd_get_doit(struct sk_buff *skb, struct genl_info *info)
 	return genlmsg_reply(msg, info);
 }
 
+static struct net *devlink_netns_get(struct sk_buff *skb,
+				     struct devlink *devlink,
+				     struct genl_info *info)
+{
+	struct nlattr *netns_pid_attr = info->attrs[DEVLINK_ATTR_NETNS_PID];
+	struct nlattr *netns_fd_attr = info->attrs[DEVLINK_ATTR_NETNS_FD];
+	struct nlattr *netns_id_attr = info->attrs[DEVLINK_ATTR_NETNS_ID];
+	struct net *net;
+
+	if (!!netns_pid_attr + !!netns_fd_attr + !!netns_id_attr > 1) {
+		NL_SET_ERR_MSG(info->extack, "multiple netns identifying attributes specified");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (netns_pid_attr) {
+		net = get_net_ns_by_pid(nla_get_u32(netns_pid_attr));
+	} else if (netns_fd_attr) {
+		net = get_net_ns_by_fd(nla_get_u32(netns_fd_attr));
+	} else if (netns_id_attr) {
+		net = get_net_ns_by_id(sock_net(skb->sk),
+				       nla_get_u32(netns_id_attr));
+		if (!net)
+			net = ERR_PTR(-EINVAL);
+	} else {
+		WARN_ON(1);
+		net = ERR_PTR(-EINVAL);
+	}
+	if (IS_ERR(net)) {
+		NL_SET_ERR_MSG(info->extack, "Unknown network namespace");
+		return ERR_PTR(-EINVAL);
+	}
+	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
+		put_net(net);
+		return ERR_PTR(-EPERM);
+	}
+	return net;
+}
+
+static void devlink_netns_change(struct devlink *devlink, struct net *net)
+{
+	if (net_eq(devlink_net(devlink), net))
+		return;
+	devlink_notify(devlink, DEVLINK_CMD_DEL);
+	devlink_net_set(devlink, net);
+	devlink_notify(devlink, DEVLINK_CMD_NEW);
+}
+
+static int devlink_nl_cmd_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct devlink *devlink = info->user_ptr[0];
+
+	if (info->attrs[DEVLINK_ATTR_NETNS_PID] ||
+	    info->attrs[DEVLINK_ATTR_NETNS_FD] ||
+	    info->attrs[DEVLINK_ATTR_NETNS_ID]) {
+		struct net *net;
+
+		net = devlink_netns_get(skb, devlink, info);
+		if (IS_ERR(net))
+			return PTR_ERR(net);
+		devlink_netns_change(devlink, net);
+		put_net(net);
+	}
+	return 0;
+}
+
 static int devlink_nl_cmd_get_dumpit(struct sk_buff *msg,
 				     struct netlink_callback *cb)
 {
@@ -5184,6 +5257,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER] = { .type = NLA_U8 },
 	[DEVLINK_ATTR_FLASH_UPDATE_FILE_NAME] = { .type = NLA_NUL_STRING },
 	[DEVLINK_ATTR_FLASH_UPDATE_COMPONENT] = { .type = NLA_NUL_STRING },
+	[DEVLINK_ATTR_NETNS_PID] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_NETNS_FD] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_NETNS_ID] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -5195,6 +5271,13 @@ static const struct genl_ops devlink_nl_ops[] = {
 		.internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
 		/* can be retrieved by unprivileged users */
 	},
+	{
+		.cmd = DEVLINK_CMD_SET,
+		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
+		.doit = devlink_nl_cmd_set_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
+	},
 	{
 		.cmd = DEVLINK_CMD_PORT_GET,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
@@ -6955,9 +7038,33 @@ int devlink_compat_switch_id_get(struct net_device *dev,
 	return 0;
 }
 
+static void __net_exit devlink_pernet_exit(struct net *net)
+{
+	struct devlink *devlink;
+
+	mutex_lock(&devlink_mutex);
+	list_for_each_entry(devlink, &devlink_list, list)
+		if (net_eq(devlink_net(devlink), net))
+			devlink_netns_change(devlink, &init_net);
+	mutex_unlock(&devlink_mutex);
+}
+
+static struct pernet_operations __net_initdata devlink_pernet_ops = {
+	.exit = devlink_pernet_exit,
+};
+
 static int __init devlink_init(void)
 {
-	return genl_register_family(&devlink_nl_family);
+	int err;
+
+	err = genl_register_family(&devlink_nl_family);
+	if (err)
+		goto out;
+	err = register_pernet_device(&devlink_pernet_ops);
+
+out:
+	WARN_ON(err);
+	return err;
 }
 
 subsys_initcall(devlink_init);
-- 
2.21.0


^ permalink raw reply related

* [patch net-next v2 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jiri Pirko @ 2019-07-30  8:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

When user does create new netdevsim instance using sysfs bus file,
create the devlink instance and related netdev instance in the namespace
of the caller.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
v1->v2:
- remove net_namespace.h include and forward decralared net struct
- add comment to initial_net pointer
---
 drivers/net/netdevsim/bus.c       |  1 +
 drivers/net/netdevsim/dev.c       | 17 +++++++++++------
 drivers/net/netdevsim/netdev.c    |  4 +++-
 drivers/net/netdevsim/netdevsim.h |  8 +++++++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
index 1a0ff3d7747b..6aeed0c600f8 100644
--- a/drivers/net/netdevsim/bus.c
+++ b/drivers/net/netdevsim/bus.c
@@ -283,6 +283,7 @@ nsim_bus_dev_new(unsigned int id, unsigned int port_count)
 	nsim_bus_dev->dev.bus = &nsim_bus;
 	nsim_bus_dev->dev.type = &nsim_bus_dev_type;
 	nsim_bus_dev->port_count = port_count;
+	nsim_bus_dev->initial_net = current->nsproxy->net_ns;
 
 	err = device_register(&nsim_bus_dev->dev);
 	if (err)
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index c5c417a3c0ce..685dd21f5500 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -268,7 +268,8 @@ static const struct devlink_ops nsim_dev_devlink_ops = {
 };
 
 static struct nsim_dev *
-nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev, unsigned int port_count)
+nsim_dev_create(struct net *net, struct nsim_bus_dev *nsim_bus_dev,
+		unsigned int port_count)
 {
 	struct nsim_dev *nsim_dev;
 	struct devlink *devlink;
@@ -277,6 +278,7 @@ nsim_dev_create(struct nsim_bus_dev *nsim_bus_dev, unsigned int port_count)
 	devlink = devlink_alloc(&nsim_dev_devlink_ops, sizeof(*nsim_dev));
 	if (!devlink)
 		return ERR_PTR(-ENOMEM);
+	devlink_net_set(devlink, net);
 	nsim_dev = devlink_priv(devlink);
 	nsim_dev->nsim_bus_dev = nsim_bus_dev;
 	nsim_dev->switch_id.id_len = sizeof(nsim_dev->switch_id.id);
@@ -335,7 +337,7 @@ static void nsim_dev_destroy(struct nsim_dev *nsim_dev)
 	devlink_free(devlink);
 }
 
-static int __nsim_dev_port_add(struct nsim_dev *nsim_dev,
+static int __nsim_dev_port_add(struct net *net, struct nsim_dev *nsim_dev,
 			       unsigned int port_index)
 {
 	struct nsim_dev_port *nsim_dev_port;
@@ -361,7 +363,7 @@ static int __nsim_dev_port_add(struct nsim_dev *nsim_dev,
 	if (err)
 		goto err_dl_port_unregister;
 
-	nsim_dev_port->ns = nsim_create(nsim_dev, nsim_dev_port);
+	nsim_dev_port->ns = nsim_create(net, nsim_dev, nsim_dev_port);
 	if (IS_ERR(nsim_dev_port->ns)) {
 		err = PTR_ERR(nsim_dev_port->ns);
 		goto err_port_debugfs_exit;
@@ -404,17 +406,19 @@ static void nsim_dev_port_del_all(struct nsim_dev *nsim_dev)
 
 int nsim_dev_probe(struct nsim_bus_dev *nsim_bus_dev)
 {
+	struct net *initial_net = nsim_bus_dev->initial_net;
 	struct nsim_dev *nsim_dev;
 	int i;
 	int err;
 
-	nsim_dev = nsim_dev_create(nsim_bus_dev, nsim_bus_dev->port_count);
+	nsim_dev = nsim_dev_create(initial_net, nsim_bus_dev,
+				   nsim_bus_dev->port_count);
 	if (IS_ERR(nsim_dev))
 		return PTR_ERR(nsim_dev);
 	dev_set_drvdata(&nsim_bus_dev->dev, nsim_dev);
 
 	for (i = 0; i < nsim_bus_dev->port_count; i++) {
-		err = __nsim_dev_port_add(nsim_dev, i);
+		err = __nsim_dev_port_add(initial_net, nsim_dev, i);
 		if (err)
 			goto err_port_del_all;
 	}
@@ -449,13 +453,14 @@ int nsim_dev_port_add(struct nsim_bus_dev *nsim_bus_dev,
 		      unsigned int port_index)
 {
 	struct nsim_dev *nsim_dev = dev_get_drvdata(&nsim_bus_dev->dev);
+	struct net *net = devlink_net(priv_to_devlink(nsim_dev));
 	int err;
 
 	mutex_lock(&nsim_dev->port_list_lock);
 	if (__nsim_dev_port_lookup(nsim_dev, port_index))
 		err = -EEXIST;
 	else
-		err = __nsim_dev_port_add(nsim_dev, port_index);
+		err = __nsim_dev_port_add(net, nsim_dev, port_index);
 	mutex_unlock(&nsim_dev->port_list_lock);
 	return err;
 }
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0740940f41b1..25c7de7a4a31 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -280,7 +280,8 @@ static void nsim_setup(struct net_device *dev)
 }
 
 struct netdevsim *
-nsim_create(struct nsim_dev *nsim_dev, struct nsim_dev_port *nsim_dev_port)
+nsim_create(struct net *net, struct nsim_dev *nsim_dev,
+	    struct nsim_dev_port *nsim_dev_port)
 {
 	struct net_device *dev;
 	struct netdevsim *ns;
@@ -290,6 +291,7 @@ nsim_create(struct nsim_dev *nsim_dev, struct nsim_dev_port *nsim_dev_port)
 	if (!dev)
 		return ERR_PTR(-ENOMEM);
 
+	dev_net_set(dev, net);
 	ns = netdev_priv(dev);
 	ns->netdev = dev;
 	ns->nsim_dev = nsim_dev;
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 79c05af2a7c0..9563acb61b03 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -74,8 +74,11 @@ struct netdevsim {
 	struct nsim_ipsec ipsec;
 };
 
+struct net;
+
 struct netdevsim *
-nsim_create(struct nsim_dev *nsim_dev, struct nsim_dev_port *nsim_dev_port);
+nsim_create(struct net *net, struct nsim_dev *nsim_dev,
+	    struct nsim_dev_port *nsim_dev_port);
 void nsim_destroy(struct netdevsim *ns);
 
 #ifdef CONFIG_BPF_SYSCALL
@@ -213,6 +216,9 @@ struct nsim_bus_dev {
 	struct device dev;
 	struct list_head list;
 	unsigned int port_count;
+	struct net *initial_net; /* Purpose of this is to carry net pointer
+				  * during the probe time only.
+				  */
 	unsigned int num_vfs;
 	struct nsim_vf_config *vfconfigs;
 };
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Nikolay Aleksandrov @ 2019-07-30  8:58 UTC (permalink / raw)
  To: Allan W. Nielsen, Ido Schimmel
  Cc: Horatiu Vultur, roopa, davem, bridge, netdev, linux-kernel
In-Reply-To: <20190730083027.biuzy7h5dbq7pik3@lx-anielsen.microsemi.net>

On 30/07/2019 11:30, Allan W. Nielsen wrote:
> The 07/30/2019 10:06, Ido Schimmel wrote:
>> On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
>>> The 07/29/2019 20:51, Ido Schimmel wrote:
>>>> Can you please clarify what you're trying to achieve? I just read the
>>>> thread again and my impression is that you're trying to locally receive
>>>> packets with a certain link layer multicast address.
>>> Yes. The thread is also a bit confusing because we half way through realized
>>> that we misunderstood how the multicast packets should be handled (sorry about
>>> that). To begin with we had a driver where multicast packets was only copied to
>>> the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
>>> how other drivers are doing it, so we changed the driver to include the CPU in
>>> the default multicast flood-mask.
>> OK, so what prevents you from removing all other ports from the
>> flood-mask and letting the CPU handle the flooding? Then you can install
>> software tc filters to limit the flooding.
> I do not have the bandwidth to forward the multicast traffic in the CPU.
> 
> It will also cause enormous latency on the forwarding of L2 multicast packets.
> 
>>> This changes the objective a bit. To begin with we needed to get more packets to
>>> the CPU (which could have been done using tc ingress rules and a trap action).
>>>
>>> Now after we changed the driver, we realized that we need something to limit the
>>> flooding of certain L2 multicast packets. This is the new problem we are trying
>>> to solve!
>>>
>>> Example: Say we have a bridge with 4 slave interfaces, then we want to install a
>>> forwarding rule saying that packets to a given L2-multicast MAC address, should
>>> only be flooded to 2 of the 4 ports.
>>>
>>> (instead of adding rules to get certain packets to the CPU, we are now adding
>>> other rules to prevent other packets from going to the CPU and other ports where
>>> they are not needed/wanted).
>>>
>>> This is exactly the same thing as IGMP snooping does dynamically, but only for
>>> IP multicast.
>>>
>>> The "bridge mdb" allow users to manually/static add/del a port to a multicast
>>> group, but still it operates on IP multicast address (not L2 multicast
>>> addresses).
>>>
>>>> Nik suggested SIOCADDMULTI.
>>> It is not clear to me how this should be used to limit the flooding, maybe we
>>> can make some hacks, but as far as I understand the intend of this is maintain
>>> the list of addresses an interface should receive. I'm not sure this should
>>> influence how for forwarding decisions are being made.
>>>
>>>> and I suggested a tc filter to get the packet to the CPU.
>>> The TC solution is a good solution to the original problem where wanted to copy
>>> more frames to the CPU. But we were convinced that this is not the right
>>> approach, and that the CPU by default should receive all multicast packets, and
>>> we should instead try to find a way to limit the flooding of certain frames as
>>> an optimization.
>>
>> This can still work. In Linux, ingress tc filters are executed before the
>> bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
>> performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
>> you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
>> eth2:
>>
>> # tc filter add dev eth0 ingress pref 1 flower skip_sw \
>> 	dst_mac 01:21:6C:00:00:01 action trap
>> # tc filter add dev eth2 egress pref 1 flower skip_hw \
>> 	dst_mac 01:21:6C:00:00:01 action drop
>>
>> The first filter is only present in HW ('skip_sw') and should result in
>> your HW passing you the sole copy of the packet.
> Agree.
> 
>> The second filter is only present in SW ('skip_hw', not using HW egress
>> ACL that you don't have) and drops the packet after it was flooded by
>> the SW bridge.
> Agree.
> 
>> As I mentioned earlier, you can install the filter once in your HW and
>> share it between different ports using a shared block. This means you
>> only consume one TCAM entry.
>>
>> Note that this allows you to keep flooding all other multicast packets
>> in HW.
> Yes, but the frames we want to limit the flood-mask on are the exact frames
> which occurs at a very high rate, and where latency is important.
> 
> I really do not consider it as an option to forward this in SW, when it is
> something that can easily be offloaded in HW.
> 
>>>> If you now want to limit the ports to which this packet is flooded, then
>>>> you can use tc filters in *software*:
>>>>
>>>> # tc qdisc add dev eth2 clsact
>>>> # tc filter add dev eth2 egress pref 1 flower skip_hw \
>>>> 	dst_mac 01:21:6C:00:00:01 action drop
>>> Yes. This can work in the SW bridge.
>>>
>>>> If you want to forward the packet in hardware and locally receive it,
>>>> you can chain several mirred action and then a trap action.
>>> I'm not I fully understand how this should be done, but it does sound like it
>>> becomes quite complicated. Also, as far as I understand it will mean that we
>>> will be using TCAM/ACL resources to do something that could have been done with
>>> a simple MAC entry.
>>>
>>>> Both options avoid HW egress ACLs which your design does not support.
>>> True, but what is wrong with expanding the functionality of the normal
>>> forwarding/MAC operations to allow multiple destinations?
>>>
>>> It is not an uncommon feature (I just browsed the manual of some common L2
>>> switches and they all has this feature).
>>>
>>> It seems to fit nicely into the existing user-interface:
>>>
>>> bridge fdb add    01:21:6C:00:00:01 port eth0
>>> bridge fdb append 01:21:6C:00:00:01 port eth1
>>
>> Wouldn't it be better to instead extend the MDB entries so that they are
>> either keyed by IP or MAC? I believe FDB should remain as unicast-only.
> 
> You might be right, it was not clear to me which of the two would fit the
> purpose best.
> 
> From a user-space iproute2 perspective I prefer using the "bridge fdb" command
> as it already supports the needed syntax, and I do not think it will be too
> pretty if we squeeze this into the "bridge mdb" command syntax.
> 

MDB is a much better fit as Ido already suggested. FDB should remain unicast
and mixing them is not a good idea, we already have a good ucast/mcast separation
and we'd like to keep it that way.

> But that does not mean that it need to go into the FDB database in the
> implementation.
> 
> Last evening when I looked at it again, I was considering keeping the
> net_bridge_fdb_entry structure as is, and add a new hashtable with the
> following:
> 
> struct net_bridge_fdbmc_entry {
> 	struct rhash_head		rhnode;
> 	struct net_bridge_fdbmc_ports   *dst;
> 
> 	struct net_bridge_fdb_key	key;
> 	struct hlist_node		fdb_node;
> 	unsigned char			offloaded:1;
> 
> 	struct rcu_head			rcu;
> };
> 

What would the notification for this look like ?

> If we go with this approach then we can look at the MAC address and see if it is
> a unicast which will cause a lookup in the fdb, l3-multicast (33:33:* or
> 01:00:5e:*) which will cause a lookup in the mdb, or finally a fdbmc which will
> need to do a lookup in this new hashtable.

That sounds wrong, you will change the current default behaviour of flooding these
packets. This will have to be well hidden behind a new option and enabled only on user
request.

> 
> Alternative it would be like this:
> 
> struct net_bridge_fdb_entry {
> 	struct rhash_head		rhnode;
> 	union net_bridge_port_or_list	*dst;
> 
> 	struct net_bridge_fdb_key	key;
> 	struct hlist_node		fdb_node;
> 	unsigned char			is_local:1,
> 					is_static:1,
> 					is_sticky:1,
> 					added_by_user:1,
> 					added_by_external_learn:1,
> 					offloaded:1;
> 					multi_dst:1;
> 
> 	/* write-heavy members should not affect lookups */
> 	unsigned long			updated ____cacheline_aligned_in_smp;
> 	unsigned long			used;
> 
> 	struct rcu_head			rcu;
> };
> 
> Both solutions should require fairly few changes, and should not cause any
> measurable performance hit.
> 

You'll have to convert these bits to use the proper atomic bitops if you go with
the second solution. That has to be done even today, but the second case would
make it a must.

> Making it fit into the net_bridge_mdb_entry seems to be harder.
> 

But it is the correct abstraction from bridge POV, so please stop trying to change
the FDB code and try to keep to the multicast code.

>> As a bonus, existing drivers could benefit from it, as MDB entries are already
>> notified by MAC.
> Not sure I follow. When FDB entries are added, it also generates notification
> events.
> 

Could you please show fdb event with multiple ports ?

>>> It seems that it can be added to the existing implementation with out adding
>>> significant complexity.
>>>
>>> It will be easy to offload in HW.
>>>
>>> I do not believe that it will be a performance issue, if this is a concern then
>>> we may have to do a bit of benchmarking, or we can make it a configuration
>>> option.
>>>
>>> Long story short, we (Horatiu and I) learned a lot from the discussion here, and
>>> I think we should try do a new patch with the learning we got. Then it is easier
>>> to see what it actually means to the exiting code, complexity, exiting drivers,
>>> performance, default behavioral, backwards compatibly, and other valid concerns.
>>>
>>> If the patch is no good, and cannot be fixed, then we will go back and look
>>> further into alternative solutions.
>> Overall, I tend to agree with Nik. I think your use case is too specific
>> to justify the amount of changes you want to make in the bridge driver.
>> We also provided other alternatives. That being said, you're more than
>> welcome to send the patches and we can continue the discussion then.
> Okay, good to know. I'm not sure I agree that the alternative solutions really
> solves the issue this is trying to solve, nor do I agree that this is specific
> to our needs.
> 
> But lets take a look at a new patch, and see what is the amount of changes we
> are talking about. Without having the patch it is really hard to know for sure.
> 

Please keep in mind that this case is the exception, not the norm, thus it should
not under any circumstance affect the standard deployments.

^ permalink raw reply

* [patch iproute2-next v2 1/2] devlink: introduce cmdline option to switch to a different namespace
From: Jiri Pirko @ 2019-07-30  8:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 devlink/devlink.c  | 12 ++++++++++--
 man/man8/devlink.8 |  4 ++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index d8197ea3a478..9242cc05ad0c 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -32,6 +32,7 @@
 #include "mnlg.h"
 #include "json_writer.h"
 #include "utils.h"
+#include "namespace.h"
 
 #define ESWITCH_MODE_LEGACY "legacy"
 #define ESWITCH_MODE_SWITCHDEV "switchdev"
@@ -6332,7 +6333,7 @@ static int cmd_health(struct dl *dl)
 static void help(void)
 {
 	pr_err("Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }\n"
-	       "       devlink [ -f[orce] ] -b[atch] filename\n"
+	       "       devlink [ -f[orce] ] -b[atch] filename -N[etns] netnsname\n"
 	       "where  OBJECT := { dev | port | sb | monitor | dpipe | resource | region | health }\n"
 	       "       OPTIONS := { -V[ersion] | -n[o-nice-names] | -j[son] | -p[retty] | -v[erbose] }\n");
 }
@@ -6478,6 +6479,7 @@ int main(int argc, char **argv)
 		{ "json",		no_argument,		NULL, 'j' },
 		{ "pretty",		no_argument,		NULL, 'p' },
 		{ "verbose",		no_argument,		NULL, 'v' },
+		{ "Netns",		required_argument,	NULL, 'N' },
 		{ NULL, 0, NULL, 0 }
 	};
 	const char *batch_file = NULL;
@@ -6493,7 +6495,7 @@ int main(int argc, char **argv)
 		return EXIT_FAILURE;
 	}
 
-	while ((opt = getopt_long(argc, argv, "Vfb:njpv",
+	while ((opt = getopt_long(argc, argv, "Vfb:njpvN:",
 				  long_options, NULL)) >= 0) {
 
 		switch (opt) {
@@ -6519,6 +6521,12 @@ int main(int argc, char **argv)
 		case 'v':
 			dl->verbose = true;
 			break;
+		case 'N':
+			if (netns_switch(optarg)) {
+				ret = EXIT_FAILURE;
+				goto dl_free;
+			}
+			break;
 		default:
 			pr_err("Unknown option.\n");
 			help();
diff --git a/man/man8/devlink.8 b/man/man8/devlink.8
index 13d4dcd908b3..9fc9b034eefe 100644
--- a/man/man8/devlink.8
+++ b/man/man8/devlink.8
@@ -51,6 +51,10 @@ When combined with -j generate a pretty JSON output.
 .BR "\-v" , " --verbose"
 Turn on verbose output.
 
+.TP
+.BR "\-N", " \-Netns " <NETNSNAME>
+Switches to the specified network namespace.
+
 .SS
 .I OBJECT
 
-- 
2.21.0


^ permalink raw reply related

* [patch iproute2-next v2 2/2] devlink: add support for network namespace change
From: Jiri Pirko @ 2019-07-30  8:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 devlink/devlink.c            | 54 +++++++++++++++++++++++++++++++++++-
 include/uapi/linux/devlink.h |  4 +++
 man/man8/devlink-dev.8       | 12 ++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 9242cc05ad0c..a39bd8771d8b 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -235,6 +235,7 @@ static void ifname_map_free(struct ifname_map *ifname_map)
 #define DL_OPT_HEALTH_REPORTER_NAME	BIT(27)
 #define DL_OPT_HEALTH_REPORTER_GRACEFUL_PERIOD	BIT(27)
 #define DL_OPT_HEALTH_REPORTER_AUTO_RECOVER	BIT(28)
+#define DL_OPT_NETNS	BIT(29)
 
 struct dl_opts {
 	uint32_t present; /* flags of present items */
@@ -271,6 +272,8 @@ struct dl_opts {
 	const char *reporter_name;
 	uint64_t reporter_graceful_period;
 	bool reporter_auto_recover;
+	bool netns_is_pid;
+	uint32_t netns;
 };
 
 struct dl {
@@ -1331,6 +1334,22 @@ static int dl_argv_parse(struct dl *dl, uint32_t o_required,
 			if (err)
 				return err;
 			o_found |= DL_OPT_HEALTH_REPORTER_AUTO_RECOVER;
+		} else if (dl_argv_match(dl, "netns") &&
+			(o_all & DL_OPT_NETNS)) {
+			const char *netns_str;
+
+			dl_arg_inc(dl);
+			err = dl_argv_str(dl, &netns_str);
+			if (err)
+				return err;
+			opts->netns = netns_get_fd(netns_str);
+			if (opts->netns < 0) {
+				err = dl_argv_uint32_t(dl, &opts->netns);
+				if (err)
+					return err;
+				opts->netns_is_pid = true;
+			}
+			o_found |= DL_OPT_NETNS;
 		} else {
 			pr_err("Unknown option \"%s\"\n", dl_argv(dl));
 			return -EINVAL;
@@ -1444,7 +1463,11 @@ static void dl_opts_put(struct nlmsghdr *nlh, struct dl *dl)
 	if (opts->present & DL_OPT_HEALTH_REPORTER_AUTO_RECOVER)
 		mnl_attr_put_u8(nlh, DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER,
 				opts->reporter_auto_recover);
-
+	if (opts->present & DL_OPT_NETNS)
+		mnl_attr_put_u32(nlh,
+				 opts->netns_is_pid ? DEVLINK_ATTR_NETNS_PID :
+						      DEVLINK_ATTR_NETNS_FD,
+				 opts->netns);
 }
 
 static int dl_argv_parse_put(struct nlmsghdr *nlh, struct dl *dl,
@@ -1499,6 +1522,7 @@ static bool dl_dump_filter(struct dl *dl, struct nlattr **tb)
 static void cmd_dev_help(void)
 {
 	pr_err("Usage: devlink dev show [ DEV ]\n");
+	pr_err("       devlink dev set DEV netns { PID | NAME | ID }\n");
 	pr_err("       devlink dev eswitch set DEV [ mode { legacy | switchdev } ]\n");
 	pr_err("                               [ inline-mode { none | link | network | transport } ]\n");
 	pr_err("                               [ encap { disable | enable } ]\n");
@@ -2551,6 +2575,31 @@ static int cmd_dev_show(struct dl *dl)
 	return err;
 }
 
+static void cmd_dev_set_help(void)
+{
+	pr_err("Usage: devlink dev set DEV netns { PID | NAME | ID }\n");
+}
+
+static int cmd_dev_set(struct dl *dl)
+{
+	struct nlmsghdr *nlh;
+	int err;
+
+	if (dl_argv_match(dl, "help") || dl_no_arg(dl)) {
+		cmd_dev_set_help();
+		return 0;
+	}
+
+	nlh = mnlg_msg_prepare(dl->nlg, DEVLINK_CMD_SET,
+			       NLM_F_REQUEST | NLM_F_ACK);
+
+	err = dl_argv_parse_put(nlh, dl, DL_OPT_HANDLE, DL_OPT_NETNS);
+	if (err)
+		return err;
+
+	return _mnlg_socket_sndrcv(dl->nlg, nlh, NULL, NULL);
+}
+
 static void cmd_dev_reload_help(void)
 {
 	pr_err("Usage: devlink dev reload [ DEV ]\n");
@@ -2747,6 +2796,9 @@ static int cmd_dev(struct dl *dl)
 		   dl_argv_match(dl, "list") || dl_no_arg(dl)) {
 		dl_arg_inc(dl);
 		return cmd_dev_show(dl);
+	} else if (dl_argv_match(dl, "set")) {
+		dl_arg_inc(dl);
+		return cmd_dev_set(dl);
 	} else if (dl_argv_match(dl, "eswitch")) {
 		dl_arg_inc(dl);
 		return cmd_dev_eswitch(dl);
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index fc195cbd66f4..bc1869993e20 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -348,6 +348,10 @@ enum devlink_attr {
 	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u16 */
 	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u16 */
 
+	DEVLINK_ATTR_NETNS_FD,			/* u32 */
+	DEVLINK_ATTR_NETNS_PID,			/* u32 */
+	DEVLINK_ATTR_NETNS_ID,			/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/man/man8/devlink-dev.8 b/man/man8/devlink-dev.8
index 1804463b2321..0e1a5523fa7b 100644
--- a/man/man8/devlink-dev.8
+++ b/man/man8/devlink-dev.8
@@ -25,6 +25,13 @@ devlink-dev \- devlink device configuration
 .ti -8
 .B devlink dev help
 
+.ti -8
+.BR "devlink dev set"
+.IR DEV
+.RI "[ "
+.BI "netns { " PID " | " NAME " | " ID " }
+.RI "]"
+
 .ti -8
 .BR "devlink dev eswitch set"
 .IR DEV
@@ -92,6 +99,11 @@ Format is:
 .in +2
 BUS_NAME/BUS_ADDRESS
 
+.SS devlink dev set  - sets devlink device attributes
+
+.TP
+.BI "netns { " PID " | " NAME " | " ID " }
+
 .SS devlink dev eswitch show - display devlink device eswitch attributes
 .SS devlink dev eswitch set  - sets devlink device eswitch attributes
 
-- 
2.21.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox