Netdev List
 help / color / mirror / Atom feed
* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-21 18:20 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180621165913.7e3f4faa.cohuck@redhat.com>

On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> On Wed, 20 Jun 2018 22:48:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> > > In any case, I'm not sure anymore why we'd want the extra uuid.  
> > 
> > It's mostly so we can have e.g. multiple devices with same MAC
> > (which some people seem to want in order to then use
> > then with different containers).
> > 
> > But it is also handy for when you assign a PF, since then you
> > can't set the MAC.
> > 
> 
> OK, so what about the following:
> 
> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>   that we have a new uuid field in the virtio-net config space
> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>   offer VIRTIO_NET_F_STANDBY_UUID if set
> - when configuring, set the property to the group UUID of the vfio-pci
>   device
> - in the guest, use the uuid from the virtio-net device's config space
>   if applicable; else, fall back to matching by MAC as done today
> 
> That should work for all virtio transports.

True. I'm a bit unhappy that it's virtio net specific though
since down the road I expect we'll have a very similar feature
for scsi (and maybe others).

But we do not have a way to have fields that are portable
both across devices and transports, and I think it would
be a useful addition. How would this work though? Any idea?

-- 
MST

^ permalink raw reply

* Re: [PATCH] net: nixge: Add __packed attribute to DMA descriptor struct
From: Moritz Fischer @ 2018-06-21 18:30 UTC (permalink / raw)
  To: David Miller; +Cc: f.fainelli, mdf, keescook, netdev, linux-kernel
In-Reply-To: <20180620.073750.642289685695664600.davem@davemloft.net>

Hi David,

On Wed, Jun 20, 2018 at 07:37:50AM +0900, David Miller wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
> Date: Tue, 19 Jun 2018 10:13:55 -0700
> 
> > How could padding be inserted given than all of the structure members
> > are naturally aligned (all u32 type). Compiler bug?
> 
> Agreed, this looks completely unnecessary.
> 
> __packed should only be used when absolutely necessary because using
> it generates less efficient code on some architectures.

Thanks for your input, will fix with the whole series when I submit it.

- Moritz

^ permalink raw reply

* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 18:34 UTC (permalink / raw)
  To: Vadim Pasternak, Guenter Roeck
  Cc: davem@davemloft.net, netdev@vger.kernel.org, jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37531F5503D85EB153A6C672A2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>

> Hi Andrew,

Adding Guenter Roeck, the HWMON maintainer.
 
> The temperature of each individual module can be obtained
> through ethtool.

You mean via --module-info?

FYI: I plan to add hwmon support to the kernel SFP code. So if you
ever decide to swap to the kernel SFP code, not your own, the raw
temperatures will be exported.

> The worst temperature is necessary for the system cooling
> control decision.

I would expect the system cooling would understand that.

> Up to 64 SFP/QSFP modules could be connected to the system.
> Some of them could cooper modules, which doesn't provide
> temperature measurement.

SFP modules are hot-plugable. So i would also expect the hwmon devices
to hotplug. If there is no sensor, then there is no hwmon device... If
there is no hwmon device, it plays no part in the thermal control
loop.

> Some of them could be optical modules, providing untrusted
> temperature measurement, which could impact thermal
> control of the system.

Why would you not trust it? Are you saying some modules simply have
broken temperature sensors? Do you have a whitelist/blacklist of
modules?

> Also optical modules could be from the different vendors,  and
> this is real situation, when, f.e. one module has the warning and
> critical thresholds 75C and 85C, while another 70C and 80C.

But hwmon exports both the actual temperature and the alarm
temperatures. I would expect the thermal control code to use all this
information when making its decisions, not just the current
temperature.

> So, nominal temperature is not the case here, we should know the
> "worst" value for the thermal control decision.

What it sounds like to me is you are working around problems in the
thermal control by fudging the raw temperatures. That is the wrong
thing to do. hwmon should export the raw data, and you should fix the
thermal control code to use it correctly.

	Andrew

^ permalink raw reply

* [PATCH] net: phy: Allow compile test of GPIO consumers if !GPIOLIB
From: Geert Uytterhoeven @ 2018-06-21 18:58 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Geert Uytterhoeven

The GPIO subsystem provides dummy GPIO consumer functions if GPIOLIB is
not enabled. Hence drivers that depend on GPIOLIB, but use GPIO consumer
functionality only, can still be compiled if GPIOLIB is not enabled.

Relax the dependency on GPIOLIB if COMPILE_TEST is enabled, where
appropriate.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
---
v3:
  - Rebased,

v2:
  - Add Acked-by.
---
 drivers/net/phy/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 343989f9f9d981e2..ceede09a28459f45 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -92,7 +92,8 @@ config MDIO_CAVIUM
 
 config MDIO_GPIO
 	tristate "GPIO lib-based bitbanged MDIO buses"
-	depends on MDIO_BITBANG && GPIOLIB
+	depends on MDIO_BITBANG
+	depends on GPIOLIB || COMPILE_TEST
 	---help---
 	  Supports GPIO lib-based MDIO busses.
 
-- 
2.17.1

^ permalink raw reply related

* RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-21 19:17 UTC (permalink / raw)
  To: Andrew Lunn, Guenter Roeck
  Cc: davem@davemloft.net, netdev@vger.kernel.org, jiri@resnulli.us
In-Reply-To: <20180621183440.GA10038@lunn.ch>



> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, June 21, 2018 9:35 PM
> To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck <linux@roeck-
> us.net>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
> 
> > Hi Andrew,
> 
> Adding Guenter Roeck, the HWMON maintainer.
> 
> > The temperature of each individual module can be obtained through
> > ethtool.
> 
> You mean via --module-info?

Yes.

> 
> FYI: I plan to add hwmon support to the kernel SFP code. So if you ever decide to
> swap to the kernel SFP code, not your own, the raw temperatures will be
> exported.
> 

Not sure it'll work for us, since we read SFP/QSFP ports through our SW/FW
interface.
But would be nice if you can provide some reference to this code.

> > The worst temperature is necessary for the system cooling control
> > decision.
> 
> I would expect the system cooling would understand that.
> 

In thermal zone infrastructure there is one temperature input.
How you can consider 64+ different inputs?

> > Up to 64 SFP/QSFP modules could be connected to the system.
> > Some of them could cooper modules, which doesn't provide temperature
> > measurement.
> 
> SFP modules are hot-plugable. So i would also expect the hwmon devices to
> hotplug. If there is no sensor, then there is no hwmon device... If there is no
> hwmon device, it plays no part in the thermal control loop.
> 
> > Some of them could be optical modules, providing untrusted temperature
> > measurement, which could impact thermal control of the system.
> 
> Why would you not trust it? Are you saying some modules simply have broken
> temperature sensors? Do you have a whitelist/blacklist of modules?
> 

We are reading temperature info through the firmware.
In case of "broken" module (module is supposed to be capable of
reading temperature, but returns some non-valid code), we'll get
some error code.

> > Also optical modules could be from the different vendors,  and this is
> > real situation, when, f.e. one module has the warning and critical
> > thresholds 75C and 85C, while another 70C and 80C.
> 
> But hwmon exports both the actual temperature and the alarm temperatures. I
> would expect the thermal control code to use all this information when making
> its decisions, not just the current temperature.
> 

All information is used, but the decision to increase FAN speed is taken
based on the worst measure, which is logical.

> > So, nominal temperature is not the case here, we should know the
> > "worst" value for the thermal control decision.
> 
> What it sounds like to me is you are working around problems in the thermal
> control by fudging the raw temperatures. That is the wrong thing to do. hwmon
> should export the raw data, and you should fix the thermal control code to use it
> correctly.

By default we are using kernel step-wise thermal algorithm, considering
all the module and ASIC ambient sensors temperature. This is not working
around. In thermal zone we have one PWM control and cumulative temperature
from the modules and ASIC. And it gives stable and correct results.

> 
> 	Andrew

^ permalink raw reply

* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 19:49 UTC (permalink / raw)
  To: Vadim Pasternak
  Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
	jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37537B5DCD0D607DFB7C7099A2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>

On Thu, Jun 21, 2018 at 07:17:03PM +0000, Vadim Pasternak wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Thursday, June 21, 2018 9:35 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck <linux@roeck-
> > us.net>
> > Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> > Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> > port temperature reading
> > 
> > > Hi Andrew,
> > 
> > Adding Guenter Roeck, the HWMON maintainer.
> > 
> > > The temperature of each individual module can be obtained through
> > > ethtool.
> > 
> > You mean via --module-info?
> 
> Yes.
> 
> > 
> > FYI: I plan to add hwmon support to the kernel SFP code. So if you ever decide to
> > swap to the kernel SFP code, not your own, the raw temperatures will be
> > exported.
> > 
> 
> Not sure it'll work for us, since we read SFP/QSFP ports through our SW/FW
> interface.

Can you make fake i2c busses? Pass the i2c transactions to the
firmware?

> But would be nice if you can provide some reference to this code.

drivers/net/phy/sfp.c

> 
> > > The worst temperature is necessary for the system cooling control
> > > decision.
> > 
> > I would expect the system cooling would understand that.
> > 
> 
> In thermal zone infrastructure there is one temperature input.
> How you can consider 64+ different inputs?

I've never used the thermal zone code. But i've used boards with 4
sensors spread around it. If the thermal zone code could not support
that, i would be surprised.

[Goes away and reads https://www.kernel.org/doc/Documentation/thermal/sysfs-api.txt]

So it sounds like, one zone is one sensor. So you actually have
hot-plugable zones, up to 64 of them. It also looks like you can bind
a zone to a cooling device. There does not seem to be a 1:1
mapping. So you should be able to bind 64 zones to one fan. Or if you
have multiple fans, bind a zone to the nearest fan.

But as i said, i'm no expert on this. You really should be posting
these patches on the hwmon list and the linux-pm list. The netdev list
does not have the needed specialist. Once Rui Zhang, Eduardo Valentin,
and Guenter Roack have given them Acked-by, David Miller can then
merge them.

      Andrew

^ permalink raw reply

* Re: Route fallback issue
From: Julian Anastasov @ 2018-06-21 19:57 UTC (permalink / raw)
  To: Grant Taylor; +Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <0a920d2d-4e97-284b-9aad-54cf75bcb755@spamtrap.tnetconsulting.net>


	Hello,

On Wed, 20 Jun 2018, Grant Taylor wrote:

> On 06/20/2018 01:00 PM, Julian Anastasov wrote:
> > You can also try alternative routes.
> 
> "Alternative routes"?  I can't say as I've heard that description as a
> specific technique / feature / capability before.
> 
> Is that it's official name?

	I think so

> Where can I find out more about it?

	You can search on net. I have some old docs on
these issues, they should be actual:

http://ja.ssi.bg/dgd-usage.txt

> > But as the kernel supports only default alternative routes, you can put them
> > in their own table:
> 
> I don't know that that is the case any more.
> 
> I was able to issue the following commands without a problem:
> 
> # ip route append 192.0.2.128/26 via 192.0.2.62
> # ip route append 192.0.2.128/26 via 192.0.2.126
> 
> I crated two network namespaces and had a pair of vEths between them
> (192.0.2.0/26 and 192.0.2.64/26).  I added a dummy network to each NetNS
> (192.0.2.128/26 and 192.0.2.192/26).
> 
> I ran the following commands while a persistent ping was running from one
> NetNS to the IP on the other's dummy0 interface:
> 
> # ip link set ns2b up && ip route append 192.0.2.192/26 via 192.0.2.126 && ip
> link set ns2a down
> (pause and watch things)
> # ip link set ns2a up && ip route append 192.0.2.192/26 via 192.0.2.62 && ip
> link set ns2b down
> (pause and watch things)
> 
> I could iterate between the two above commands and pings continued to work.
> 
> So, I think that it's now possible to use "alternate routes" (new to me) on
> specific prefixes in addition to the default.  Thus there is no longer any
> need for a separate table and the associated IP rule.

	Not true. net/ipv4/fib_semantics.c:fib_select_path()
calls fib_select_default() only when prefixlen = 0 (default route).
Otherwise, only the first route will be considered.

	fib_select_default() is the function that decides which
nexthop is reachable and whether to contact it. It uses the ARP
state via fib_detect_death(). That is all code that is behind this
feature called "alternative routes": the kernel selects one
based on nexthop's ARP state. Routes with different metric are
considered only when the routes with lower metric are removed.

> I'm running kernel version 4.9.76.
> 
> I did go ahead and set net.ipv4.conf.ns2b.ignore_routes_with_linkdown to 1.
> 
> for i in /proc/sys/net/ipv4/conf/*/ignore_routes_with_linkdown; do echo 1 >
> $i; done

	IIRC, this flag invalidates nexthops depending on
the link state. If your link is always UP it does not help
much. If you rely on user space tool, you can check the state
of the desired hops: device link state, your gateway to
ISP, one or more gateways in the ISP network which you
consider permanent part of the path via this ISP.

> Doing that dropped the number of dropped pings from 60 ~ 90 (1 / second) to 0
> ~ 5 (1 / second).  (Rarely, maybe 1 out of 20 flips, would it take upwards of
> 10 pings / seconds.)
> 
> > # Alternative routes use same metric!!!
> > ip route append default via 192.168.1.254 dev eno1 table 100
> > ip route append default via 192.168.2.254 dev eno2 table 100
> > ip rule add prio 100 to 172.16.0.0/12 table 100
> 
> I did have to "append" the route.  I couldn't just "add" the route. When I
> tried to "add" the second route, I got an error about the route already
> existing.  Using "append" instead of "add" with everything else the same
> worked just fine.
> 
> Note:  I did go ahead and remove the single route that was added via "add" and
> used "append" for both.

	First route can be created with 'add' but all next
alternative routes can be added only with "append". If you
successfully add them with "add" it means they are not
alternatives to the first one, they are not considered at all.

Regards

^ permalink raw reply

* RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-21 20:02 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
	jiri@resnulli.us
In-Reply-To: <20180621194917.GC10038@lunn.ch>



> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, June 21, 2018 10:49 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: Guenter Roeck <linux@roeck-us.net>; davem@davemloft.net;
> netdev@vger.kernel.org; jiri@resnulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
> 
> On Thu, Jun 21, 2018 at 07:17:03PM +0000, Vadim Pasternak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > > Sent: Thursday, June 21, 2018 9:35 PM
> > > To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck
> > > <linux@roeck- us.net>
> > > Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> > > Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment
> > > module for port temperature reading
> > >
> > > > Hi Andrew,
> > >
> > > Adding Guenter Roeck, the HWMON maintainer.
> > >
> > > > The temperature of each individual module can be obtained through
> > > > ethtool.
> > >
> > > You mean via --module-info?
> >
> > Yes.
> >
> > >
> > > FYI: I plan to add hwmon support to the kernel SFP code. So if you
> > > ever decide to swap to the kernel SFP code, not your own, the raw
> > > temperatures will be exported.
> > >
> >
> > Not sure it'll work for us, since we read SFP/QSFP ports through our
> > SW/FW interface.
> 
> Can you make fake i2c busses? Pass the i2c transactions to the firmware?

Theoretically yes.
But have well-defined SW/FW interface, working over PCI and FW at
ASIC end implements I2C master.
> 
> > But would be nice if you can provide some reference to this code.
> 
> drivers/net/phy/sfp.c
> 

Thank you.

> >
> > > > The worst temperature is necessary for the system cooling control
> > > > decision.
> > >
> > > I would expect the system cooling would understand that.
> > >
> >
> > In thermal zone infrastructure there is one temperature input.
> > How you can consider 64+ different inputs?
> 
> I've never used the thermal zone code. But i've used boards with 4 sensors
> spread around it. If the thermal zone code could not support that, i would be
> surprised.
> 
> [Goes away and reads
> https://www.kernel.org/doc/Documentation/thermal/sysfs-api.txt]
> 
> So it sounds like, one zone is one sensor. So you actually have hot-plugable
> zones, up to 64 of them. It also looks like you can bind a zone to a cooling
> device. There does not seem to be a 1:1 mapping. So you should be able to bind
> 64 zones to one fan. Or if you have multiple fans, bind a zone to the nearest fan.
> 

It means I will have 64 thermal zones for each module (actually for the
some coming new systems will support 128 modules, plus thermal zone
for ASIC ambient temperatures.
And each zone will try to control same PWM.
As I result PWM will be extremely jumpy and non-effective.

> But as i said, i'm no expert on this. You really should be posting these patches on
> the hwmon list and the linux-pm list. The netdev list does not have the needed
> specialist. Once Rui Zhang, Eduardo Valentin, and Guenter Roack have given
> them Acked-by, David Miller can then merge them.

Thanks,
Vadim.

> 
>       Andrew

^ permalink raw reply

* [PATCH] net: bridge: fix potential null pointer dereference on return from br_port_get_rtnl()
From: Garry McNulty @ 2018-06-21 20:14 UTC (permalink / raw)
  To: netdev; +Cc: stephen, davem, jiri, nikolay, bridge, linux-kernel,
	Garry McNulty

br_port_get_rtnl() can return NULL if the network device is not a bridge
port (IFF_BRIDGE_PORT flag not set). br_port_slave_changelink() and
br_port_fill_slave_info() callbacks dereference this pointer without
checking. Currently this is not a problem because slave devices always
set this flag. Add null check in case these conditions ever change.

Detected by CoverityScan, CID 1339613 ("Dereference null return value")

Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
---
 net/bridge/br_netlink.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 9f5eb05b0373..b3ad135b7157 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -947,13 +947,14 @@ static int br_port_slave_changelink(struct net_device *brdev,
 				    struct netlink_ext_ack *extack)
 {
 	struct net_bridge *br = netdev_priv(brdev);
+	struct net_bridge_port *p = br_port_get_rtnl(dev);
 	int ret;
 
-	if (!data)
+	if (!data || !p)
 		return 0;
 
 	spin_lock_bh(&br->lock);
-	ret = br_setport(br_port_get_rtnl(dev), data);
+	ret = br_setport(p, data);
 	spin_unlock_bh(&br->lock);
 
 	return ret;
@@ -963,7 +964,9 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
 				   const struct net_device *brdev,
 				   const struct net_device *dev)
 {
-	return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
+	struct net_bridge_port *p = br_port_get_rtnl(dev);
+
+	return p ? br_port_fill_attrs(skb, p) : -EINVAL;
 }
 
 static size_t br_port_get_slave_size(const struct net_device *brdev,
-- 
2.14.4

^ permalink raw reply related

* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 20:18 UTC (permalink / raw)
  To: Vadim Pasternak
  Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
	jiri@resnulli.us
In-Reply-To: <HE1PR0502MB3753C8AF457059B3D95FA35CA2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>

> It means I will have 64 thermal zones for each module (actually for the
> some coming new systems will support 128 modules, plus thermal zone
> for ASIC ambient temperatures.
> And each zone will try to control same PWM.
> As I result PWM will be extremely jumpy and non-effective.

Hi Vadim

Please repost to the correct lists, and ask the experts how this
should be done. netdev is not the right place to discuss this.

       Andrew

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Don Bollinger @ 2018-06-21 20:28 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Florian Fainelli, don
In-Reply-To: <20180621081127.GA31742@lunn.ch>

On Thu, Jun 21, 2018 at 10:11:27AM +0200, Andrew Lunn wrote:
> > I'm trying to figure out how the netdev environment works on large
> > switches.  I can't imagine that the kernel network stack is involved in
> > any part of the data plane. 
> 
> Hi Don
> 
> It is involved in the slow path. I.e. packets from the host out
> network ports. BPDU, IGMP, ARP, ND, etc. It can also be involved when
> the hardware is missing features. Also, for communication with the
> host itself.
> 
> What is more important is it is the control plane. Want to bridge two
> ports together?  You create a software bridge, add the two ports, and
> then offload it to the hardware. The kernel STP code in the software
> bridge then does the state tracking, blocked, learning, forwarding
> etc. Need RSTP? Just run the standard RSTP daemon on the software
> bridge interface.
> 
> Basically, use the Linux network stack as is, and offload what you can
> to the hardware. That means you keep all your existing user space
> network tools, daemons, SNMP agents, etc. They all just work, because
> the kernel APIs remain the same, independent of if you have a switch,
> or just a bunch of networks cards.
> 
> > Can you point me to any conference slides,
> > or design docs or other documentation that describes a netdev
> > implementation on Trident II switch silicon?  Or any other switch that
> > supports 128 x 10G (or 32 x 40G) ports?
> 
> Look at past netdev conference. In particular, talks given by
> Mellanox, Cumulus, and Netronome. You can also see there drivers in

Thanks.  I found a slide deck from Cumulus at
www.slideshare.net/CumulusNetworks/webinarlinux-networking-is-awesome

I think this connects the dots between our worlds.  It turns out that
optoe actually is derived from the Cumulus sff_8436 driver, which they
use to access QSFP devices.  It was the best available, but had an
experimental implementation of SFP (didn't work yet).  They actually use
the at24 driver for SFP.  Optoe is actually architecturally identical to
the Cumulus implementation.  It does not use the SFP framework, but it 
does interface with their Linux network stack via ethtool, etc.  In slide
10 of the deck they explicitly call out device drivers, saying "Innovation
and change here is good."

> drivers/ethernet/{mellonex|netronome}. These devices however tend to
> go for firmware to control the PHYs, not the Linux network stack.
> drivers/net/dsa covers SOHO switches, so up to 10 ports, mostly 1G,
> some 10G ports. There is a lot of industry involved thin this segment,
> trains, planes, busses, plant automation, etc, and some WiFi and STP.
> Switches with DSA drivers make use of Linux to control the PHYs, not
> firmware.

Again, optoe does not control the PHYs.  It only access the EEPROMs (on
the PHYs).  It does not touch any of the electrical pins.  It can provide 
the EEPROM access to any component that wants that access, including sfp.c.

> SFP are also slowly starting to enter the consumer market, with
> products like the Clearfog, MACCHIATObin, and there are a few
> industrial boards with SOHO switches with SFP sockets or SFF
> modules. These are what are driving in kernel SFP support.

Got it.  I'm targeting a different market, with a different
architecture.  In this architecture it makes more sense to separate the
EEPROM access from the IO pins control.

> > Also, I see the sfp.c code.  Is there any QSFP code?  I'd like to see
> > how that code compares to the sfp.c code.
> 
> Currently, none that i know of. SFP+ is the limit so far. Mainly
> because SoC/SOHO switches currently top out at 10G, so SFP+ is
> sufficient.

SFP+ is not sufficient for another market, which is using Linux to
manage larger switches.  These switches all have some QSFP ports, many
of them have exclusively QSFP ports.  I have some useful code for those
environments.

> > optoe can provide access, through the SFP code, to the rest of the EEPROM
> > address space.  It can also provide access to the QSFP EEPROM.  I would
> > like to collaborate on the integration, that would fit my objective of
> > making more EEPROM space accessible on more platforms and distros.
> > 
> > However, you don't want me to make the changes to SFP myself.  I don't
> > have any hardware or OS environment that currently supports that code.
> > The cost and learning curve exceed my resources.  I *think* the changes
> > to the SFP code would be small, but I would need someone who understands
> > the code and can test it to actually make and submit the changes.
> 
> So i have been thinking about this some more. But given your lack of
> resources, i'm guessing this won't actually work for you. But it is
> probably the correct way forwards.
> 
> The basic problem the systems you are talking about is that they don't
> have a network interface per port. So they cannot use ethtool
> --module-info, which IS the defined API to get access to SFP
> data. Adding another API is probably not going to get accepted.

Got it.  I don't think I'm adding another API.  Note that Cumulus is
using the same architecture as optoe, and providing all the expected
linux services, including ethtool --module-info.  They are accessing the
module-info data throug ioctl, which opens the device file provided by
their driver and reads/writes the appropriate location.  Optoe works the
same way.

> However, the current ethtool API is ioctl based. The network stack is
> moving away from that, replacing it with netlink sockets. All IP
> configuration is now via netlink. All bridge configuration is by
> netlink, etc. So there is a desire to move ethtool to netlink.
> 
> This move makes the API more flexible. By default, you would expect
> the replacement implementation for --module-info to pass an ifindex,
> or maybe an interface name. However, it could be made to optionally
> take an i2c bus number. That could then go direct to the SFP code,
> rather than go via the MAC driver. That would give evil user space,
> proprietary, binary blob drivers access to SFP via the standard
> supported kernel API, using the standard supported kernel SFP driver.

Here's what I have in mind...  struct sfp in sfp.c has a read and write:

   int (*read) struct sfp *, bool, u8, void *, size_t);
   int (*write) struct sfp *, bool, u8, void *, size_t);

These are instantiated with:

   sfp->read = sfp_i2c_read;
   sfp->write = sfp_i2c_write;

So to insert optoe into this stack, we would need to add an i2c_client
to struct sfp:

   struct i2c_client *i2c_client;

We would need to initialize that i2c_client in sfp_i2c_configure:

   board_info = alloc_board_info(sfp);
   sfp->i2c_client = i2c_new_device(i2c, board_info);

We need to write the brief routine that allocates a struct_i2c_board_info,
and stuffs the necessary data into it.  I'm assuming that data can come
from sfp.  The data required includes an appropriate name for this device,
and whether it is an SFP or QSFP device.  (When QSFP is added to sfp.c,
we can add a flag to struct sfp.  Your stack will have to know which it
is anyway to know where the necessary data is in the EEPROM.)

Finally, we replace the body of sfp_i2c_{read, write} with a callback to
optoe.  All of the necessary data is already in the parameters to
sfp_i2c_{read, write}.

> 
> But that requires you roll up your sleeves and get stuck in to this
> conversion work.

I'm offering an improvement to sfp.c.  The improvement is access to more
pages of SFP EEPROM, and access to QSFP EEPROMs.  It comes in the form of
a specialized EEPROM driver custom built for {Q}SFP devices.  I'm also
offering to help integrate that driver into sfp.c.  I can modify optoe
to accomodate sfp.c, I can recommend how to instantiate and call it. I am
not going to be able to spend the time and money required to modify and
test sfp.c.  I'm pretty sure you can do it MUCH faster, and MUCH better
than I can.

> 
> But you say you work for a fibre module company. Do they produce
> SFP/SFP+ modules? You could get one of the supported consumer boards
> with an SFP/SFP+ socket and test your modules work properly. Build out

Unfortunately, that isn't going to happen on their dime.  Their dimes
are running out for this kind of work.

> the SFP code. It has been on my TODO list to add HWMON support for the
> temperature sensors, etc.

Huh.  Just read Documentation/hwmon/sysfs-interface.  Looks like a good
way to deliver that EEPROM data.  Wish I'd found that two years ago when
there were a few more dimes available.

Don

^ permalink raw reply

* Re: [PATCH net] bpf: enforce correct alignment for instructions
From: Daniel Borkmann @ 2018-06-21 21:03 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, edumazet; +Cc: netdev, kafai, ast
In-Reply-To: <e1aec35e-957d-c38d-0bc5-d39c20773136@gmail.com>

On 06/21/2018 06:08 AM, Eric Dumazet wrote:
> On 06/20/2018 08:46 PM, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Wed, 20 Jun 2018 17:24:09 -0700
>>
>>> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
>>> breaking powerpc BPF badly, since instructions need to be word aligned.
>>>
>>> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> I'll apply this directly, thanks Eric.
> 
> Thanks David :)

Sigh, sorry for the breakage, looks like I got fooled by x86 gcc.

struct bpf_binary_header {
        u16                        pages;                /*     0     2 */
        u16                        locked:1;             /*     2:15  2 */

        /* XXX 15 bits hole, try to pack */

        u8                         image[0];             /*     4     0 */

        /* size: 4, cachelines: 1, members: 3 */
        /* bit holes: 1, sum bit holes: 15 bits */
        /* last cacheline: 4 bytes */
};

Thanks Eric!

^ permalink raw reply

* Re: Route fallback issue
From: Grant Taylor @ 2018-06-21 21:08 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <alpine.LFD.2.20.1806212218070.2159@ja.home.ssi.bg>

[-- Attachment #1: Type: text/plain, Size: 6350 bytes --]

On 06/21/2018 01:57 PM, Julian Anastasov wrote:
> Hello,

Hi.

> I think so

Okay.

I'll do some more digging.

> You can search on net. I have some old docs on these issues, they should 
> be actual:
> 
> http://ja.ssi.bg/dgd-usage.txt

"DGD" or "Dead Gateway Detection" sounds very familiar.  I referenced it 
in an earlier reply.

I distinctly remember DGD not behaving satisfactorily years ago.  Where 
unsatisfactorily was something like 90 seconds (or more) to recover. 
Which actually matches what I was getting without the 
ignore_routes_with_linkdown=1 setting that David A. mentioned.

With ignore_routes_with_linkdown=1 things behaved much better.

> Not true. net/ipv4/fib_semantics.c:fib_select_path() calls 
> fib_select_default() only when prefixlen = 0 (default route).

Okay....  My testing last night disagrees with you.  Specifically, I was 
able to add a alternate routes to the same prefix, 192.0.2.128/26. 
There was not any default gateway configured on any of the NetNSs.  So 
everything was using routes for locally attacked or the two added via 
"ip route append".

What am I misinterpreting?  Or where are we otherwise talking past each 
other?

> Otherwise, only the first route will be considered.

"only the first route" almost sounds like something akin to Equal Cost 
Multi Path.

I was not expecting "alternative routes" to use more than one route at a 
time, equally or otherwise.  I was wanting for the kernel to fall back 
to an alternate route / gateway / path in the event that the one that 
was being used became unusable / unreachable.

So what should "Alternative Routes" do?  How does this compare / 
contract to E.C.M.P. or D.G.D.

> fib_select_default() is the function that decides which nexthop 
> is reachable and whether to contact it. It uses the ARP state via 
> fib_detect_death(). That is all code that is behind this feature called 
> "alternative routes": the kernel selects one based on nexthop's ARP 
> state.

Please confirm that you aren't entering / referring to E.C.M.P. 
territory when you say "nexthop".  I think that you are not, but I want 
to ask and be sure, particularly seeing as how things are very closely 
related.

It sounds like you're referring to literally the router that is the next 
hop in the path.  I.e. the device on the other end of the wire.

I'll have to find, read, and try to grok the code to have a better idea. 
  That being said, it looks like (based on the name) that 
fib_select_default() deals with the default route.  The testing I did 
last night, and positive results, indicate that the kernel did what I 
wanted it to do.  (See above about D.G.D. vs E.C.M.P.)

So, it seems as if something about alternative routes worked using 
non-default routes.  I have no way of knowing if it was the code that 
we're talking about, or something else that produced the results.  Given 
the way I did the test (specific prefixes, non-default, routes being 
appended with no other routes) worked the way that I would have thought 
that a feature that uses alternative routes (or historically D.G.D.) 
would have worked.

The following ping works just fine as I bounce interfaces on NS1.

ns2# ping -I 192.0.2.254 192.0.2.129

I can confirm that traffic is moving back and forth between the vEth 
links between the NetNSs.  Granted, the traffic sticks to one vEth 
interface until it goes away.

I can shut down ns2a on NS1 so that ns1a sees loss of link but but stays 
up on NS2, and traffic moves to vEth-B.

I can then open up ns2a on NS1 so that ns1a sees link on NS2, and 
re-append the route on NS1.

I can then shut down ns2b on NS1 so that ns1b sees loss of link but 
stays up on NS2, and traffic moves to vEth-A.

I can then open up ns2b on NS1 so that ns1b sees link on NS2, and 
re-append the route on NS1.

NS2 behaves exactly as I would hope.  Traffic will move from the down 
interface to the remaining up interface.  Back and forth, no problem.

I don't know where the disconnect is, but I feel like there is one.

> Routes with different metric are considered only when the routes with 
> lower metric are removed.

I agree with the statement.  What I question is where metric came into 
play here.  All of the routes had the same (default) metric.  None of 
the routes I tested had different metrics.

ns1# ip route show
192.0.2.0/26 dev ns2a proto kernel scope link src 192.0.2.1
192.0.2.64/26 dev ns2b proto kernel scope link src 192.0.2.65
192.0.2.128/26 dev dummy0 proto kernel scope link src 192.0.2.129
192.0.2.192/26 via 192.0.2.62 dev ns2a
192.0.2.192/26 via 192.0.2.126 dev ns2b

ns2# ip route show
192.0.2.0/26 dev ns1a proto kernel scope link src 192.0.2.62
192.0.2.64/26 dev ns1b proto kernel scope link src 192.0.2.126
192.0.2.128/26 via 192.0.2.65 dev ns1b
192.0.2.128/26 via 192.0.2.1 dev ns1a
192.0.2.192/26 dev dummy0 proto kernel scope link src 192.0.2.254

> IIRC, this flag invalidates nexthops depending on the link state. If 
> your link is always UP it does not help much.

That's what I gathered.  So things like DSL & cable modems or other L2 
bridging devices might not drop the link when their circuit drops.

This is also why I asked the follow up questions to David's email.

I want to do some testing to see if fib_multipath_use_neigh alters this 
behavior at all.  I'm hoping that it will invalidate an alternate route 
if the MAC is not resolvable even if the physical link stays up.

Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering 
this behavior.  But having that timeout and starting to use an 
alternative route is considerably better than not using an alternative 
route.

> If you rely on user space tool, you can check the state of the desired 
> hops: device link state, your gateway to ISP, one or more gateways in the 
> ISP network which you consider permanent part of the path via this ISP.

This is what I have thought about doing previously.

> First route can be created with 'add' but all next alternative routes 
> can be added only with "append". If you successfully add them with 
> "add" it means they are not alternatives to the first one, they are not 
> considered at all.

ACK



-- 
Grant. . . .
unix || die


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3982 bytes --]

^ permalink raw reply

* Re: [PATCH bpf 0/2] tools: bpftool: small fixes for error handling in prog load
From: Daniel Borkmann @ 2018-06-21 21:10 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: netdev, oss-drivers
In-Reply-To: <20180620184246.18672-1-jakub.kicinski@netronome.com>

On 06/20/2018 08:42 PM, Jakub Kicinski wrote:
> Hi!
> 
> Two small fixes for error handling in bpftool prog load, first patch
> removes a duplicated message, second one frees resources correctly.
> Multiple error messages break JSON:
> 
> # bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a
> {
>     "error": "can't pin the object (/sys/fs/bpf/a): File exists"
> },{
>     "error": "failed to pin program"
> }	
> 
> Jakub Kicinski (2):
>   tools: bpftool: remove duplicated error message on prog load
>   tools: bpftool: remember to close the libbpf object on prog load error
> 
>  tools/bpf/bpftool/prog.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 

Applied to bpf, thanks Jakub!

^ permalink raw reply

* [PATCH net] net/packet: fix use-after-free
From: Eric Dumazet @ 2018-06-21 21:16 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, Willem de Bruijn

We should put copy_skb in receive_queue only after
a successful call to virtio_net_hdr_from_skb().

syzbot report :

BUG: KASAN: use-after-free in __skb_unlink include/linux/skbuff.h:1843 [inline]
BUG: KASAN: use-after-free in __skb_dequeue include/linux/skbuff.h:1863 [inline]
BUG: KASAN: use-after-free in skb_dequeue+0x16a/0x180 net/core/skbuff.c:2815
Read of size 8 at addr ffff8801b044ecc0 by task syz-executor217/4553

CPU: 0 PID: 4553 Comm: syz-executor217 Not tainted 4.18.0-rc1+ #111
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 __skb_unlink include/linux/skbuff.h:1843 [inline]
 __skb_dequeue include/linux/skbuff.h:1863 [inline]
 skb_dequeue+0x16a/0x180 net/core/skbuff.c:2815
 skb_queue_purge+0x26/0x40 net/core/skbuff.c:2852
 packet_set_ring+0x675/0x1da0 net/packet/af_packet.c:4331
 packet_release+0x630/0xd90 net/packet/af_packet.c:2991
 __sock_release+0xd7/0x260 net/socket.c:603
 sock_close+0x19/0x20 net/socket.c:1186
 __fput+0x35b/0x8b0 fs/file_table.c:209
 ____fput+0x15/0x20 fs/file_table.c:243
 task_work_run+0x1ec/0x2a0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x1b08/0x2750 kernel/exit.c:865
 do_group_exit+0x177/0x440 kernel/exit.c:968
 __do_sys_exit_group kernel/exit.c:979 [inline]
 __se_sys_exit_group kernel/exit.c:977 [inline]
 __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4448e9
Code: Bad RIP value.
RSP: 002b:00007ffd5f777ca8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004448e9
RDX: 00000000004448e9 RSI: 000000000000fcfb RDI: 0000000000000001
RBP: 00000000006cf018 R08: 00007ffd0000a45b R09: 0000000000000000
R10: 00007ffd5f777e48 R11: 0000000000000202 R12: 00000000004021f0
R13: 0000000000402280 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 4553:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
 skb_clone+0x1f5/0x500 net/core/skbuff.c:1282
 tpacket_rcv+0x28f7/0x3200 net/packet/af_packet.c:2221
 deliver_skb net/core/dev.c:1925 [inline]
 deliver_ptype_list_skb net/core/dev.c:1940 [inline]
 __netif_receive_skb_core+0x1bfb/0x3680 net/core/dev.c:4611
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 netif_receive_skb_internal+0x12e/0x7d0 net/core/dev.c:4767
 netif_receive_skb+0xbf/0x420 net/core/dev.c:4791
 tun_rx_batched.isra.55+0x4ba/0x8c0 drivers/net/tun.c:1571
 tun_get_user+0x2af1/0x42f0 drivers/net/tun.c:1981
 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:2009
 call_write_iter include/linux/fs.h:1795 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x6c6/0x9f0 fs/read_write.c:487
 vfs_write+0x1f8/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4553:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
 kfree_skbmem+0x154/0x230 net/core/skbuff.c:582
 __kfree_skb net/core/skbuff.c:642 [inline]
 kfree_skb+0x1a5/0x580 net/core/skbuff.c:659
 tpacket_rcv+0x189e/0x3200 net/packet/af_packet.c:2385
 deliver_skb net/core/dev.c:1925 [inline]
 deliver_ptype_list_skb net/core/dev.c:1940 [inline]
 __netif_receive_skb_core+0x1bfb/0x3680 net/core/dev.c:4611
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 netif_receive_skb_internal+0x12e/0x7d0 net/core/dev.c:4767
 netif_receive_skb+0xbf/0x420 net/core/dev.c:4791
 tun_rx_batched.isra.55+0x4ba/0x8c0 drivers/net/tun.c:1571
 tun_get_user+0x2af1/0x42f0 drivers/net/tun.c:1981
 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:2009
 call_write_iter include/linux/fs.h:1795 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x6c6/0x9f0 fs/read_write.c:487
 vfs_write+0x1f8/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801b044ecc0
 which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 0 bytes inside of
 232-byte region [ffff8801b044ecc0, ffff8801b044eda8)
The buggy address belongs to the page:
page:ffffea0006c11380 count:1 mapcount:0 mapping:ffff8801d9be96c0 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0006c17988 ffff8801d9bec248 ffff8801d9be96c0
raw: 0000000000000000 ffff8801b044e040 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801b044eb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8801b044ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>ffff8801b044ec80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                           ^
 ffff8801b044ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801b044ed80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc

Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 50809748c1279ea17b7499acbec5699443804f64..ff8e7e245c375504d3fca3293c50518831816bcd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2262,6 +2262,13 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		if (po->stats.stats1.tp_drops)
 			status |= TP_STATUS_LOSING;
 	}
+
+	if (do_vnet &&
+	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
+				    sizeof(struct virtio_net_hdr),
+				    vio_le(), true, 0))
+		goto drop_n_account;
+
 	po->stats.stats1.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
@@ -2269,15 +2276,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	}
 	spin_unlock(&sk->sk_receive_queue.lock);
 
-	if (do_vnet) {
-		if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
-					    sizeof(struct virtio_net_hdr),
-					    vio_le(), true, 0)) {
-			spin_lock(&sk->sk_receive_queue.lock);
-			goto drop_n_account;
-		}
-	}
-
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
 	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* [PATCH 12/26] Convert net_namespace to new IDA API
From: Matthew Wilcox @ 2018-06-21 21:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, David S. Miller, Kirill Tkhai, Andrei Vagin,
	Eric W. Biederman, Eric Dumazet, Florian Westphal, netdev
In-Reply-To: <20180621212835.5636-1-willy@infradead.org>

Signed-off-by: Matthew Wilcox <willy@infradead.org>
---
 net/core/net_namespace.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..f447cebdcea3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -973,22 +973,18 @@ static int register_pernet_operations(struct list_head *list,
 	int error;
 
 	if (ops->id) {
-again:
-		error = ida_get_new_above(&net_generic_ids, MIN_PERNET_OPS_ID, ops->id);
-		if (error < 0) {
-			if (error == -EAGAIN) {
-				ida_pre_get(&net_generic_ids, GFP_KERNEL);
-				goto again;
-			}
+		error = ida_alloc_min(&net_generic_ids, MIN_PERNET_OPS_ID,
+				GFP_KERNEL);
+		if (error < 0)
 			return error;
-		}
+		*ops->id = error;
 		max_gen_ptrs = max(max_gen_ptrs, *ops->id + 1);
 	}
 	error = __register_pernet_operations(list, ops);
 	if (error) {
 		rcu_barrier();
 		if (ops->id)
-			ida_remove(&net_generic_ids, *ops->id);
+			ida_free(&net_generic_ids, *ops->id);
 	}
 
 	return error;
@@ -999,7 +995,7 @@ static void unregister_pernet_operations(struct pernet_operations *ops)
 	__unregister_pernet_operations(ops);
 	rcu_barrier();
 	if (ops->id)
-		ida_remove(&net_generic_ids, *ops->id);
+		ida_free(&net_generic_ids, *ops->id);
 }
 
 /**
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-21 21:59 UTC (permalink / raw)
  To: Okash Khawaja
  Cc: Daniel Borkmann, Martin KaFai Lau, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180620203703.101156292@fb.com>

On Wed, 20 Jun 2018 13:30:53 -0700, Okash Khawaja wrote:
> $ sudo bpftool map dump -p id 14
> [{
>         "key": 0
>     },{
>         "value": {
>             "m": 1,
>             "n": 2,
>             "o": "c",
>             "p": [15,16,17,18,15,16,17,18
>             ],
>             "q": [[25,26,27,28,25,26,27,28
>                 ],[35,36,37,38,35,36,37,38
>                 ],[45,46,47,48,45,46,47,48
>                 ],[55,56,57,58,55,56,57,58
>                 ]
>             ],
>             "r": 1,
>             "s": 0x7ffff6f70568,
>             "t": {
>                 "x": 5,
>                 "y": 10
>             },
>             "u": 100,
>             "v": 20,
>             "w1": 0x7,
>             "w2": 0x3
>         }
>     }
> ]

I don't think this format is okay, JSON output is an API you shouldn't
break.  You can change the non-JSON output whatever way you like, but
JSON must remain backwards compatible.

The dump today has object per entry, e.g.:

{
        "key":["0x00","0x00","0x00","0x00",
        ],
        "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
        ]
}

This format must remain, you may only augment it with new fields.  E.g.:

{
        "key":["0x00","0x00","0x00","0x00",
        ],
	"key_struct":{
		"index":0
	},
        "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
        ],
	"value_struct":{
		"src_ip":2,
		"dst_ip:0
	}
}

The name XYZ_struct may not be the best, perhaps you can come up with a
better one?  

Does that make sense?  Am I missing what you're doing here?

One process note - please make sure you run checkpatch.pl --strict on
bpftool patches before posting.

Thanks for working on this!

^ permalink raw reply

* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Guenter Roeck @ 2018-06-21 22:06 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vadim Pasternak, davem@davemloft.net, netdev@vger.kernel.org,
	jiri@resnulli.us
In-Reply-To: <20180621183440.GA10038@lunn.ch>

On Thu, Jun 21, 2018 at 08:34:40PM +0200, Andrew Lunn wrote:
> > Hi Andrew,
> 
> Adding Guenter Roeck, the HWMON maintainer.
>  
> > The temperature of each individual module can be obtained
> > through ethtool.
> 
> You mean via --module-info?
> 
> FYI: I plan to add hwmon support to the kernel SFP code. So if you
> ever decide to swap to the kernel SFP code, not your own, the raw
> temperatures will be exported.
> 
As should be. Unless adjustments are made by the hardware (eg a thermal
diode temperature offset register), all adjustments should be made in
userspace.

> > The worst temperature is necessary for the system cooling
> > control decision.
> 
> I would expect the system cooling would understand that.
> 
> > Up to 64 SFP/QSFP modules could be connected to the system.
> > Some of them could cooper modules, which doesn't provide
> > temperature measurement.
> 
> SFP modules are hot-plugable. So i would also expect the hwmon devices
> to hotplug. If there is no sensor, then there is no hwmon device... If
> there is no hwmon device, it plays no part in the thermal control
> loop.
> 
One hardware monitoring device per SFP, and I would assume that the
hwmon device for an SFP is only instantiated if a thermal sensor
is present.

> > Some of them could be optical modules, providing untrusted
> > temperature measurement, which could impact thermal
> > control of the system.
> 
> Why would you not trust it? Are you saying some modules simply have
> broken temperature sensors? Do you have a whitelist/blacklist of
> modules?
> 
> > Also optical modules could be from the different vendors,  and
> > this is real situation, when, f.e. one module has the warning and
> > critical thresholds 75C and 85C, while another 70C and 80C.
> 
> But hwmon exports both the actual temperature and the alarm
> temperatures. I would expect the thermal control code to use all this
> information when making its decisions, not just the current
> temperature.
> 
The respective information would either be provided by hardware
and reported to userspace, or userspace needs to determine the limits
and write them into the hardware. Either case, that is only relevant
if the hardware has limit registers. Otherwise all limits can and
should be handled in the thermal subsystem.

> > So, nominal temperature is not the case here, we should know the
> > "worst" value for the thermal control decision.
> 
> What it sounds like to me is you are working around problems in the
> thermal control by fudging the raw temperatures. That is the wrong
> thing to do. hwmon should export the raw data, and you should fix the
> thermal control code to use it correctly.
> 
Agreed. From the context it sounds like there should be some kind of
temperature aggregator which would probably reside in the thermal
subsystem (definitely not in hwmon).

I have not seen any hwmon specific patches. For new drivers,
please use [devm_]hwmon_device_register_with_info().

Guenter

^ permalink raw reply

* Re: [PATCH net] bpf: enforce correct alignment for instructions
From: David Miller @ 2018-06-21 22:10 UTC (permalink / raw)
  To: daniel; +Cc: eric.dumazet, edumazet, netdev, kafai, ast
In-Reply-To: <e57bc889-bf89-fafc-217c-fb3c57af2003@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 21 Jun 2018 23:03:09 +0200

> On 06/21/2018 06:08 AM, Eric Dumazet wrote:
>> On 06/20/2018 08:46 PM, David Miller wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>> Date: Wed, 20 Jun 2018 17:24:09 -0700
>>>
>>>> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>>> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
>>>> breaking powerpc BPF badly, since instructions need to be word aligned.
>>>>
>>>> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>
>>> I'll apply this directly, thanks Eric.
>> 
>> Thanks David :)
> 
> Sigh, sorry for the breakage, looks like I got fooled by x86 gcc.
> 
> struct bpf_binary_header {
>         u16                        pages;                /*     0     2 */
>         u16                        locked:1;             /*     2:15  2 */

Note that you can also just make locked a plan u16 for now until you
need more flag bits, the code generated will be more efficient
especially on non-x86.

^ permalink raw reply

* Re: [PATCH net] net/packet: fix use-after-free
From: David Miller @ 2018-06-21 22:18 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, willemb
In-Reply-To: <20180621211602.62011-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Thu, 21 Jun 2018 14:16:02 -0700

> We should put copy_skb in receive_queue only after
> a successful call to virtio_net_hdr_from_skb().
> 
> syzbot report :
 ...
> Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Cc: Willem de Bruijn <willemb@google.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [PATCH] net: bridge: fix potential null pointer dereference on return from br_port_get_rtnl()
From: David Miller @ 2018-06-21 22:20 UTC (permalink / raw)
  To: garrmcnu; +Cc: netdev, stephen, jiri, nikolay, bridge, linux-kernel
In-Reply-To: <20180621201427.4961-1-garrmcnu@gmail.com>

From: Garry McNulty <garrmcnu@gmail.com>
Date: Thu, 21 Jun 2018 21:14:27 +0100

> br_port_get_rtnl() can return NULL if the network device is not a bridge
> port (IFF_BRIDGE_PORT flag not set). br_port_slave_changelink() and
> br_port_fill_slave_info() callbacks dereference this pointer without
> checking. Currently this is not a problem because slave devices always
> set this flag. Add null check in case these conditions ever change.
> 
> Detected by CoverityScan, CID 1339613 ("Dereference null return value")
> 
> Signed-off-by: Garry McNulty <garrmcnu@gmail.com>

I don't think this is reasonable.

The bridge code will never, ever, install a slave that doesn't have
that bit set.  It's the most fundamental aspect of how these objects
are managed.

^ permalink raw reply

* Re: [PATCH net v2] cls_flower: fix use after free in flower S/W path
From: David Miller @ 2018-06-21 22:25 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, jhs, xiyou.wangcong, jiri, marcelo.leitner, paulb
In-Reply-To: <fd96de4e9dc358e3982922ae681fdb1b9d8ae72a.1529603970.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 21 Jun 2018 20:02:16 +0200

> If flower filter is created without the skip_sw flag, fl_mask_put()
> can race with fl_classify() and we can destroy the mask rhashtable
> while a lookup operation is accessing it.
 ...
> Fix the above waiting for a RCU grace period before destroying the
> rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
> must run in process context, as pointed out by Cong Wang.
> 
> v1 -> v2: use tcf_queue_work to run rhashtable_destroy().
> 
> Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH bpf] tools/bpf: fix test_sockmap failure
From: Daniel Borkmann @ 2018-06-21 22:33 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180621170220.3055491-1-yhs@fb.com>

On 06/21/2018 07:02 PM, Yonghong Song wrote:
> On one of our production test machine, when running
> bpf selftest test_sockmap, I got the following error:
>   # sudo ./test_sockmap
>   libbpf: failed to create map (name: 'sock_map'): Operation not permitted
>   libbpf: failed to load object 'test_sockmap_kern.o'
>   libbpf: Can't get the 0th fd from program sk_skb1: only -1 instances
>   ......
>   load_bpf_file: (-1) Operation not permitted
>   ERROR: (-1) load bpf failed
> 
> The error is due to not-big-enough rlimit
>   struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
> 
> The test already includes "bpf_rlimit.h", which sets current
> and max rlimit to RLIM_INFINITY. Let us just use it.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>

Agree, applied to bpf, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH] selftests: bpf: notification about privilege required to run test_kmod.sh testing script
From: Daniel Borkmann @ 2018-06-21 22:33 UTC (permalink / raw)
  To: Jeffrin Jose T, ast, shuah; +Cc: netdev, linux-kernel, linux-kselftest
In-Reply-To: <20180621170020.9073-1-ahiliation@gmail.com>

On 06/21/2018 07:00 PM, Jeffrin Jose T wrote:
> The test_kmod.sh script require root privilege for the successful
> execution of the test.
> 
> This patch is to notify the user about the privilege the script
> demands for the successful execution of the test.
> 
> Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com>

Applied to bpf, thanks Jeffrin!

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-21 22:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180621145935.41ff8974@cakuba.netronome.com>

On Thu, Jun 21, 2018 at 02:59:35PM -0700, Jakub Kicinski wrote:
> On Wed, 20 Jun 2018 13:30:53 -0700, Okash Khawaja wrote:
> > $ sudo bpftool map dump -p id 14
> > [{
> >         "key": 0
> >     },{
> >         "value": {
> >             "m": 1,
> >             "n": 2,
> >             "o": "c",
> >             "p": [15,16,17,18,15,16,17,18
> >             ],
> >             "q": [[25,26,27,28,25,26,27,28
> >                 ],[35,36,37,38,35,36,37,38
> >                 ],[45,46,47,48,45,46,47,48
> >                 ],[55,56,57,58,55,56,57,58
> >                 ]
> >             ],
> >             "r": 1,
> >             "s": 0x7ffff6f70568,
> >             "t": {
> >                 "x": 5,
> >                 "y": 10
> >             },
> >             "u": 100,
> >             "v": 20,
> >             "w1": 0x7,
> >             "w2": 0x3
> >         }
> >     }
> > ]
> 
> I don't think this format is okay, JSON output is an API you shouldn't
> break.  You can change the non-JSON output whatever way you like, but
> JSON must remain backwards compatible.
> 
> The dump today has object per entry, e.g.:
> 
> {
>         "key":["0x00","0x00","0x00","0x00",
>         ],
>         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
>         ]
> }
> 
> This format must remain, you may only augment it with new fields.  E.g.:
> 
> {
>         "key":["0x00","0x00","0x00","0x00",
>         ],
> 	"key_struct":{
> 		"index":0
> 	},
>         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
>         ],
> 	"value_struct":{
> 		"src_ip":2,
> 		"dst_ip:0
> 	}
> }
I am not sure how useful to have both "key|value" and "(key|value)_struct"
while most people would prefer "key_struct"/"value_struct" if it is
available.

How about introducing a new option, like "-b", to print the
map with BTF (if available) such that it won't break the existing
one (-j or -p) while the "-b" output can keep using the "key"
and "value".

The existing json can be kept as is.

> 
> The name XYZ_struct may not be the best, perhaps you can come up with a
> better one?  
> 
> Does that make sense?  Am I missing what you're doing here?
> 
> One process note - please make sure you run checkpatch.pl --strict on
> bpftool patches before posting.
> 
> Thanks for working on this!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox