Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v4 6/6] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Vivien Didelot @ 2019-08-26 17:50 UTC (permalink / raw)
  To: Marek Behun; +Cc: Andrew Lunn, netdev, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826193125.4c94662e@nic.cz>

On Mon, 26 Aug 2019 19:31:25 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> On Mon, 26 Aug 2019 17:38:30 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > +static int mv88e6xxx_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
> > > +				    phy_interface_t mode, bool allow_over_2500,
> > > +				    bool make_cmode_writable)  
> > 
> > I don't like these two parameters. The caller of this function can do
> > the check for allow_over_2500 and error out before calling this.
> > 
> > Is make_cmode_writable something that could be done once at probe and
> > then forgotten about? Or is it needed before every write? At least
> > move it into the specific port_set_cmode() that requires it.
> > 
> > Thanks
> > 	Andrew
> 
> Btw those two additional parameters were modeled as in
>   static int mv88e6xxx_port_set_speed(struct mv88e6xxx_chip *chip,
>                                       int port, int speed, bool alt_bit,
>                                       bool force_bit);

"AltSpeed" and "ForceSpeed" are two bits found in the MAC Control Register
of certain switch models, which can be set by this private helper.

I don't think that is the case for "allow_over_2500" and "make_cmode_writable".

^ permalink raw reply

* Re: [PATCH net-next v4 6/6] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Marek Behun @ 2019-08-26 17:52 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: Andrew Lunn, netdev, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826134418.GB29480@t480s.localdomain>

On Mon, 26 Aug 2019 13:44:18 -0400
Vivien Didelot <vivien.didelot@gmail.com> wrote:

> > It can be done once at probe. At first I thought about doing this in
> > setup_errata, but this is not an erratum. So shall I create a new
> > method for this in chip operations structure? Something like
> > port_additional_setup() ?  
> 
> No. Those "setup" or "config" functions are likely to do everything and
> become a mess, thus unmaintainable. Operations must be specific.

What about Andrew's complaint, then, abouth the two additional
parameters?

^ permalink raw reply

* [PATCH v1 net-next] net: phy: mdio_bus: make mdiobus_scan also cover PHY that only talks C45
From: Voon Weifeng @ 2019-08-27  1:52 UTC (permalink / raw)
  To: David S. Miller, Maxime Coquelin
  Cc: netdev, linux-kernel, Jose Abreu, Andrew Lunn, Florian Fainelli,
	Heiner Kallweit, Ong Boon Leong, Voon Weifeng

From: Ong Boon Leong <boon.leong.ong@intel.com>

Make mdiobus_scan() to try harder to look for any PHY that only talks C45.

Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index bd04fe762056..30dbc48b4c7e 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -525,8 +525,12 @@ struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr)
 	int err;
 
 	phydev = get_phy_device(bus, addr, false);
-	if (IS_ERR(phydev))
-		return phydev;
+	if (IS_ERR(phydev)) {
+		/* Try C45 to ensure we don't miss PHY that only talks C45 */
+		phydev = get_phy_device(bus, addr, true);
+		if (IS_ERR(phydev))
+			return phydev;
+	}
 
 	/*
 	 * For DT, see if the auto-probed phy has a correspoding child
-- 
1.9.1


^ permalink raw reply related

* Re: [PATCH bpf-next v2 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Jonathan Lemon @ 2019-08-26 17:54 UTC (permalink / raw)
  To: Björn Töpel
  Cc: ast, daniel, netdev, Björn Töpel, magnus.karlsson,
	magnus.karlsson, bpf, syzbot+c82697e3043781e08802, hdanton,
	i.maximets
In-Reply-To: <20190826061053.15996-3-bjorn.topel@gmail.com>



On 25 Aug 2019, at 23:10, Björn Töpel wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The state variable was read, and written outside the control mutex
> (struct xdp_sock, mutex), without proper barriers and {READ,
> WRITE}_ONCE correctness.
>
> In this commit this issue is addressed, and the state member is now
> used a point of synchronization whether the socket is setup correctly
> or not.
>
> This also fixes a race, found by syzcaller, in xsk_poll() where umem
> could be accessed when stale.
>
> Suggested-by: Hillf Danton <hdanton@sina.com>
> Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
> Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP 
> rings")
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Jonathan Lemon @ 2019-08-26 17:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Ilya Maximets, Björn Töpel, ast, daniel, netdev,
	magnus.karlsson, magnus.karlsson, bpf,
	syzbot+c82697e3043781e08802, hdanton
In-Reply-To: <1b780dd4-227f-64c4-260d-9e819ba7081f@intel.com>



On 26 Aug 2019, at 9:34, Björn Töpel wrote:

> On 2019-08-26 17:24, Ilya Maximets wrote:
>> This changes the error code a bit.
>> Previously:
>>     umem exists + xs unbound    --> EINVAL
>>     no umem     + xs unbound    --> EBADF
>>     xs bound to different dev/q --> EINVAL
>>
>> With this change:
>>     umem exists + xs unbound    --> EBADF
>>     no umem     + xs unbound    --> EBADF
>>     xs bound to different dev/q --> EINVAL
>>
>> Just a note. Not sure if this is important.
>>
>
> Note that this is for *shared* umem, so it's very seldom used. Still,
> you're right, that strictly this is an uapi break, but I'd vote for the
> change still. I find it hard to see that anyone relies on EINVAL/EBADF
> for shared umem bind.
>
> Opinions? :-)

I'd agree - if it isn't documented somewhere, it's not an API break. :)
-- 
Jonathan

^ permalink raw reply

* [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Marek Behún @ 2019-08-26 17:59 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	Marek Behún
In-Reply-To: <20190826134418.GB29480@t480s.localdomain>

Currently we support SERDES on the Topaz family in a limited way: no
IRQs and the cmode is not writable, thus the mode is determined by
strapping pins.

Marvell's examples though show how to make cmode writable on port 5 and
support SGMII autonegotiation. It is done by writing hidden registers,
for which we already have code.

This patch adds support for making the cmode for the SERDES port
writable on the Topaz family, at setup, by calling a new chip operation,
port_setup_extra. This operation is implemented for Topaz.

Then it enables cmode setting and SERDES IRQs for Topaz.

Tested on Turris Mox.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 15 ++++++++
 drivers/net/dsa/mv88e6xxx/chip.h |  3 ++
 drivers/net/dsa/mv88e6xxx/port.c | 65 ++++++++++++++++++++++++++++++--
 drivers/net/dsa/mv88e6xxx/port.h |  5 +++
 4 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 202ccce65b1c..9fee2cfe469f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2121,6 +2121,13 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
 	chip->ports[port].chip = chip;
 	chip->ports[port].port = port;
 
+	/* Some ports may need extra setup to be used as desired */
+	if (chip->info->ops->port_setup_extra) {
+		err = chip->info->ops->port_setup_extra(chip, port);
+		if (err)
+			return err;
+	}
+
 	/* MAC Forcing register: don't force link, speed, duplex or flow control
 	 * state to any particular values on physical ports, but force the CPU
 	 * port and all DSA ports to their maximum bandwidth and full duplex.
@@ -2913,7 +2920,9 @@ static const struct mv88e6xxx_ops mv88e6141_ops = {
 	.port_disable_pri_override = mv88e6xxx_port_disable_pri_override,
 	.port_link_state = mv88e6352_port_link_state,
 	.port_get_cmode = mv88e6352_port_get_cmode,
+	.port_set_cmode = mv88e6341_port_set_cmode,
 	.port_setup_message_port = mv88e6xxx_setup_message_port,
+	.port_setup_extra = mv88e6341_port_setup_extra,
 	.stats_snapshot = mv88e6390_g1_stats_snapshot,
 	.stats_set_histogram = mv88e6095_g1_stats_set_histogram,
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
@@ -2929,6 +2938,8 @@ static const struct mv88e6xxx_ops mv88e6141_ops = {
 	.vtu_loadpurge = mv88e6352_g1_vtu_loadpurge,
 	.serdes_power = mv88e6390_serdes_power,
 	.serdes_get_lane = mv88e6341_serdes_get_lane,
+	.serdes_irq_setup = mv88e6390_serdes_irq_setup,
+	.serdes_irq_free = mv88e6390_serdes_irq_free,
 	.gpio_ops = &mv88e6352_gpio_ops,
 	.phylink_validate = mv88e6341_phylink_validate,
 };
@@ -3608,7 +3619,9 @@ static const struct mv88e6xxx_ops mv88e6341_ops = {
 	.port_disable_pri_override = mv88e6xxx_port_disable_pri_override,
 	.port_link_state = mv88e6352_port_link_state,
 	.port_get_cmode = mv88e6352_port_get_cmode,
+	.port_set_cmode = mv88e6341_port_set_cmode,
 	.port_setup_message_port = mv88e6xxx_setup_message_port,
+	.port_setup_extra = mv88e6341_port_setup_extra,
 	.stats_snapshot = mv88e6390_g1_stats_snapshot,
 	.stats_set_histogram = mv88e6095_g1_stats_set_histogram,
 	.stats_get_sset_count = mv88e6320_stats_get_sset_count,
@@ -3624,6 +3637,8 @@ static const struct mv88e6xxx_ops mv88e6341_ops = {
 	.vtu_loadpurge = mv88e6352_g1_vtu_loadpurge,
 	.serdes_power = mv88e6390_serdes_power,
 	.serdes_get_lane = mv88e6341_serdes_get_lane,
+	.serdes_irq_setup = mv88e6390_serdes_irq_setup,
+	.serdes_irq_free = mv88e6390_serdes_irq_free,
 	.gpio_ops = &mv88e6352_gpio_ops,
 	.avb_ops = &mv88e6390_avb_ops,
 	.ptp_ops = &mv88e6352_ptp_ops,
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 15d0c9f00f54..905e4136bccf 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -397,6 +397,9 @@ struct mv88e6xxx_ops {
 	int (*port_disable_pri_override)(struct mv88e6xxx_chip *chip, int port);
 	int (*port_setup_message_port)(struct mv88e6xxx_chip *chip, int port);
 
+	/* Some ports may need extra setup to be used as desired */
+	int (*port_setup_extra)(struct mv88e6xxx_chip *chip, int port);
+
 	/* CMODE control what PHY mode the MAC will use, eg. SGMII, RGMII, etc.
 	 * Some chips allow this to be configured on specific ports.
 	 */
diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 7183c94a92ec..6886accbcb60 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -392,8 +392,35 @@ phy_interface_t mv88e6390x_port_max_speed_mode(int port)
 	return PHY_INTERFACE_MODE_NA;
 }
 
-int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
-			      phy_interface_t mode)
+/* Port 5 on Topaz is a SERDES port as ports 9 and 10 on Peridot family,
+   but on Topaz it's cmode is not writable by default. This hidden register
+   configuration makes it writable. */
+int mv88e6341_port_setup_extra(struct mv88e6xxx_chip *chip, int port)
+{
+	int err, addr;
+	u16 reg, bits;
+
+	if (port != 5)
+		return 0;
+
+	addr = chip->info->port_base_addr + port;
+
+	err = mv88e6xxx_port_hidden_read(chip, 0x7, addr, 0, &reg);
+	if (err)
+		return err;
+
+	bits = MV88E6341_PORT_RESERVED_1A_FORCE_CMODE |
+	       MV88E6341_PORT_RESERVED_1A_SGMII_AN;
+
+	if ((reg & bits) == bits)
+		return 0;
+
+	reg |= bits;
+	return mv88e6xxx_port_hidden_write(chip, 0x7, addr, 0, reg);
+}
+
+static int mv88e6xxx_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
+				    phy_interface_t mode)
 {
 	s8 lane;
 	u16 cmode;
@@ -484,9 +511,41 @@ int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
 	return 0;
 }
 
+int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
+			      phy_interface_t mode)
+{
+	if (port != 9 && port != 10)
+		return -EOPNOTSUPP;
+
+	return mv88e6xxx_port_set_cmode(chip, port, mode);
+}
+
 int mv88e6390_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
 			     phy_interface_t mode)
 {
+	if (port != 9 && port != 10)
+		return -EOPNOTSUPP;
+
+	switch (mode) {
+	case PHY_INTERFACE_MODE_NA:
+		return 0;
+	case PHY_INTERFACE_MODE_XGMII:
+	case PHY_INTERFACE_MODE_XAUI:
+	case PHY_INTERFACE_MODE_RXAUI:
+		return -EINVAL;
+	default:
+		break;
+	}
+
+	return mv88e6xxx_port_set_cmode(chip, port, mode);
+}
+
+int mv88e6341_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
+			     phy_interface_t mode)
+{
+	if (port != 5)
+		return -EOPNOTSUPP;
+
 	switch (mode) {
 	case PHY_INTERFACE_MODE_NA:
 		return 0;
@@ -498,7 +557,7 @@ int mv88e6390_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
 		break;
 	}
 
-	return mv88e6390x_port_set_cmode(chip, port, mode);
+	return mv88e6xxx_port_set_cmode(chip, port, mode);
 }
 
 int mv88e6185_port_get_cmode(struct mv88e6xxx_chip *chip, int port, u8 *cmode)
diff --git a/drivers/net/dsa/mv88e6xxx/port.h b/drivers/net/dsa/mv88e6xxx/port.h
index 6d7a067da0f5..eb19e568ad9b 100644
--- a/drivers/net/dsa/mv88e6xxx/port.h
+++ b/drivers/net/dsa/mv88e6xxx/port.h
@@ -269,6 +269,8 @@
 #define MV88E6XXX_PORT_RESERVED_1A_BLOCK_SHIFT	10
 #define MV88E6XXX_PORT_RESERVED_1A_CTRL_PORT	0x04
 #define MV88E6XXX_PORT_RESERVED_1A_DATA_PORT	0x05
+#define MV88E6341_PORT_RESERVED_1A_FORCE_CMODE	0x8000
+#define MV88E6341_PORT_RESERVED_1A_SGMII_AN	0x2000
 
 int mv88e6xxx_port_read(struct mv88e6xxx_chip *chip, int port, int reg,
 			u16 *val);
@@ -334,6 +336,9 @@ int mv88e6097_port_pause_limit(struct mv88e6xxx_chip *chip, int port, u8 in,
 			       u8 out);
 int mv88e6390_port_pause_limit(struct mv88e6xxx_chip *chip, int port, u8 in,
 			       u8 out);
+int mv88e6341_port_setup_extra(struct mv88e6xxx_chip *chip, int port);
+int mv88e6341_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
+			     phy_interface_t mode);
 int mv88e6390_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
 			     phy_interface_t mode);
 int mv88e6390x_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Marek Behun @ 2019-08-26 18:03 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826175920.21043-1-marek.behun@nic.cz>

What about this?

It adds a new chip operation (I know Vivien said not to, but I was
doing it already) port_setup_extra, and implements it for Topaz.

Also it changes the mv88e6xxx_port_set_cmode so that it does not use
those 2 additional parameters.

Should I rewrite it so that only the second change is applied?

Should I send a new series, v5, today? I think that I once read David
complain that he does not like if new versions of patch series are sent
at the same day.

Marek

^ permalink raw reply

* Re: [PATCH net-next v3 06/10] net: sched: conditionally obtain rtnl lock in cls hw offloads API
From: Jiri Pirko @ 2019-08-26 18:13 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-7-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:45:02PM CEST, vladbu@mellanox.com wrote:
>In order to remove dependency on rtnl lock from offloads code of
>classifiers, take rtnl lock conditionally before executing driver
>callbacks. Only obtain rtnl lock if block is bound to devices that require
>it.
>
>Block bind/unbind code is rtnl-locked and obtains block->cb_lock while
>holding rtnl lock. Obtain locks in same order in tc_setup_cb_*() functions
>to prevent deadlock.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH v1 net-next] net: phy: mdio_bus: make mdiobus_scan also cover PHY that only talks C45
From: Florian Fainelli @ 2019-08-26 18:27 UTC (permalink / raw)
  To: Voon Weifeng, David S. Miller, Maxime Coquelin
  Cc: netdev, linux-kernel, Jose Abreu, Andrew Lunn, Heiner Kallweit,
	Ong Boon Leong
In-Reply-To: <1566870769-9967-1-git-send-email-weifeng.voon@intel.com>

On 8/26/19 6:52 PM, Voon Weifeng wrote:
> From: Ong Boon Leong <boon.leong.ong@intel.com>
> 
> Make mdiobus_scan() to try harder to look for any PHY that only talks C45.
If you are not using Device Tree or ACPI, and you are letting the MDIO
bus be scanned, it sounds like there should be a way for you to provide
a hint as to which addresses should be scanned (that's
mii_bus::phy_mask) and possibly enhance that with a mask of possible C45
devices?

> 
> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index bd04fe762056..30dbc48b4c7e 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -525,8 +525,12 @@ struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr)
>  	int err;
>  
>  	phydev = get_phy_device(bus, addr, false);
> -	if (IS_ERR(phydev))
> -		return phydev;
> +	if (IS_ERR(phydev)) {
> +		/* Try C45 to ensure we don't miss PHY that only talks C45 */
> +		phydev = get_phy_device(bus, addr, true);
> +		if (IS_ERR(phydev))
> +			return phydev;
> +	}
>  
>  	/*
>  	 * For DT, see if the auto-probed phy has a correspoding child
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Vivien Didelot @ 2019-08-26 18:28 UTC (permalink / raw)
  To: Marek Behun; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826200315.0e080172@nic.cz>

On Mon, 26 Aug 2019 20:03:15 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> What about this?
> 
> It adds a new chip operation (I know Vivien said not to, but I was
> doing it already) port_setup_extra, and implements it for Topaz.

So what feedback do you expect exactly? That is *exactly* what I told
you I did not want. What's gonna be added in those "port_setup_extra"
implementations next? And from where should they be called exactly?

Ask yourself what is the single task achieved by this function, and name this
operation accordingly. It seems to change the CMODE to be writable, only
supported by certain switch models right? So in addition to port_get_cmode
and port_set_cmode, you can add port_set_cmode_writable, and call it right
before or after port_set_cmode in mv88e6xxx_port_setup_mac.

Also please address the last comment I made in v3 in the new series.

Thanks,

	Vivien

^ permalink raw reply

* Re: [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Marek Behun @ 2019-08-26 18:36 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826142809.GC9628@t480s.localdomain>

On Mon, 26 Aug 2019 14:28:09 -0400
Vivien Didelot <vivien.didelot@gmail.com> wrote:

> Ask yourself what is the single task achieved by this function, and name this
> operation accordingly. It seems to change the CMODE to be writable, only
> supported by certain switch models right? So in addition to port_get_cmode
> and port_set_cmode, you can add port_set_cmode_writable, and call it right
> before or after port_set_cmode in mv88e6xxx_port_setup_mac.

Andrew's complaint was also about this function being called every time
cmode is to be changed. The cmode does need to be made writable only
once. In this sense it does make sense to put into into
mv88e6xxx_setup_port.

> Also please address the last comment I made in v3 in the new series.

I shall.

^ permalink raw reply

* Re: [PATCH v6 4/4] net: phy: realtek: Add LED configuration support for RTL8211E
From: Matthias Kaehlcke @ 2019-08-26 18:40 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Doug Anderson, Pavel Machek, David S . Miller, Rob Herring,
	Mark Rutland, Andrew Lunn, Heiner Kallweit, netdev, devicetree,
	LKML
In-Reply-To: <f1fd7aba-b36f-8cc4-9ed7-9977c0912b9d@gmail.com>

On Fri, Aug 23, 2019 at 12:58:09PM -0700, Florian Fainelli wrote:
> On 8/16/19 3:39 PM, Doug Anderson wrote:
> > Hi,
> > 
> > On Fri, Aug 16, 2019 at 3:12 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >> On 8/16/19 2:27 PM, Matthias Kaehlcke wrote:
> >>> On Fri, Aug 16, 2019 at 10:13:42PM +0200, Pavel Machek wrote:
> >>>> On Tue 2019-08-13 12:11:47, Matthias Kaehlcke wrote:
> >>>>> Add a .config_led hook which is called by the PHY core when
> >>>>> configuration data for a PHY LED is available. Each LED can be
> >>>>> configured to be solid 'off, solid 'on' for certain (or all)
> >>>>> link speeds or to blink on RX/TX activity.
> >>>>>
> >>>>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> >>>>
> >>>> THis really needs to go through the LED subsystem,
> >>>
> >>> Sorry, I used what get_maintainers.pl threw at me, I should have
> >>> manually cc-ed the LED list.
> >>>
> >>>> and use the same userland interfaces as the rest of the system.
> >>>
> >>> With the PHY maintainers we discussed to define a binding that is
> >>> compatible with that of the LED one, to have the option to integrate
> >>> it with the LED subsystem later. The integration itself is beyond the
> >>> scope of this patchset.
> >>>
> >>> The PHY LED configuration is a low priority for the project I'm
> >>> working on. I wanted to make an attempt to upstream it and spent
> >>> already significantly more time on it than planned, if integration
> >>> with the LED framework now is a requirement please consider this
> >>> series abandonded.
> >>
> >> While I have an appreciation for how hard it can be to work in a
> >> corporate environment while doing upstream first and working with
> >> virtually unbounded goals (in time or scope) due to maintainers and
> >> reviewers, that kind of statement can hinder your ability to establish
> >> trust with peers in the community as it can be read as take it or leave it.
> > 
> > You think so?  I feel like Matthias is simply expressing the reality
> > of the situation here and I'd rather see a statement like this posted
> > than the series just silently dropped.  Communication is good.
> > 
> > In general on Chrome OS we don't spent lots of time tweaking with
> > Ethernet and even less time tweaking with Ethernet on ARM boards where
> > you might need a binding like this, so it's pretty hard to justify up
> > the management chain spending massive amounts of resources on it.  In
> > this case we have two existing ARM boards which we're trying to uprev
> > from 3.14 to 4.19 which were tweaking the Ethernet driver in some
> > downstream code.  We thought it would be nice to try to come up with a
> > solution that could land upstream, which is usually what we try to do
> > in these cases.
> > 
> > Normally if there is some major architecture needed that can't fit in
> > the scope of a project, we would do a downstream solution for the
> > project and then fork off the task (maybe by a different Engineer or a
> > contractor) to get a solution that can land upstream.  ...but in this
> > case it seems hard to justify because it's unlikely we would need it
> > again anytime remotely soon.
> > 
> > So I guess the alternatives to what Matthias did would have been:
> > 
> > A) Don't even try to upstream.  Seems worse.  At least this way
> > there's something a future person can start from and the discussion is
> > rolling.
> > 
> > B) Keep spending tons of time on something even though management
> > doesn't want him to.  Seems worse.
> > 
> > C) Spend his nights and weekends working on this.  Seems worse.
> > 
> > D) Silently stop working on it without saying "I'm going to stop".  Seems worse.
> > 
> > ...unless you have a brilliant "E)" I think what Matthias did here is
> > exactly right.
> 
> I must apologize for making that statement since it was not fair to
> Matthias, and he has been clear about how much time he can spend on that
> specific, please accept my apologies for that.
> 
> Having had many recent encounters with various people not driving
> projects to completion lately (not specifically within Linux), it looks
> like I am overly sensitive about flagging words and patch status that
> may fall within that lexicon. The choice of word is what triggered me.

No worries, I understand that it can be frustrating if you repeatedly
experience that projects remain unfinished.

Hopefully this series can be revived eventually when somebody finds
the time to work on the integration with the LED framework.

^ permalink raw reply

* Re: [PATCH net 1/2] openvswitch: Properly set L4 keys on "later" IP fragments.
From: Justin Pettit @ 2019-08-26 18:42 UTC (permalink / raw)
  To: Pravin Shelar, David Miller
  Cc: Linux Kernel Network Developers, Joe Stringer, Greg Rose
In-Reply-To: <CAOrHB_BKd=QKR_ScO+r7qAyZaniEbFur+iup1iXbtiycaFawjw@mail.gmail.com>

> On Aug 25, 2019, at 1:40 PM, Pravin Shelar <pshelar@ovn.org> wrote:
> 
> Actually I am not sure about this change. caller of this function
> (ovs_ct_execute()) does skb-pull and push of L2 header, calling
> ovs_flow_key_update() is not safe here, it expect skb data to point to
> L2 header.

Thanks for the feedback, Pravin and David.  Greg Rose has a revised version that will address the issues you raised and also make it so that we don't bother re-extracting the L2 headers.  He'll hopefully get that out today once we've done some internal testing on it.

--Justin

^ permalink raw reply

* Re: [PATCH v1 net-next] net: stmmac: Add support for MDIO interrupts
From: Andrew Lunn @ 2019-08-26 18:47 UTC (permalink / raw)
  To: Voon Weifeng
  Cc: David S. Miller, Maxime Coquelin, netdev, linux-kernel,
	Jose Abreu, Giuseppe Cavallaro, Alexandre Torgue, Ong Boon Leong
In-Reply-To: <1566870320-9825-1-git-send-email-weifeng.voon@intel.com>

On Tue, Aug 27, 2019 at 09:45:20AM +0800, Voon Weifeng wrote:
> From: "Chuah, Kim Tatt" <kim.tatt.chuah@intel.com>
> 
> DW EQoS v5.xx controllers added capability for interrupt generation
> when MDIO interface is done (GMII Busy bit is cleared).
> This patch adds support for this interrupt on supported HW to avoid
> polling on GMII Busy bit.
> 
> stmmac_mdio_read() & stmmac_mdio_write() will sleep until wake_up() is
> called by the interrupt handler.

Hi Voon

I _think_ there are some order of operation issues here. The mdiobus
is registered in the probe function. As soon as of_mdiobus_register()
is called, the MDIO bus must work. At that point MDIO read/writes can
start to happen.

As far as i can see, the interrupt handler is only requested in
stmmac_open(). So it seems like any MDIO operations after probe, but
before open are going to fail?

Thanks
       Andrew

^ permalink raw reply

* Re: [PATCH RFC] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Vivien Didelot @ 2019-08-26 18:53 UTC (permalink / raw)
  To: Marek Behun; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826203614.6f9f6a8d@nic.cz>

On Mon, 26 Aug 2019 20:36:14 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> > Ask yourself what is the single task achieved by this function, and name this
> > operation accordingly. It seems to change the CMODE to be writable, only
> > supported by certain switch models right? So in addition to port_get_cmode
> > and port_set_cmode, you can add port_set_cmode_writable, and call it right
> > before or after port_set_cmode in mv88e6xxx_port_setup_mac.
> 
> Andrew's complaint was also about this function being called every time
> cmode is to be changed. The cmode does need to be made writable only
> once. In this sense it does make sense to put into into
> mv88e6xxx_setup_port.

mv88e6xxx_port_setup_mac is called by mv88e6xxx_setup_port as expected and also
.phylink_mac_config. I don't think they are called that often and both deal
with configuration, so I'd prefer to keep this consistent and group the two
operations together in mv88e6xxx_port_setup_mac, if that's good for Andrew too.


Thanks,

	Vivien

^ permalink raw reply

* Re: [PATCH v1 net-next] net: phy: mdio_bus: make mdiobus_scan also cover PHY that only talks C45
From: Andrew Lunn @ 2019-08-26 18:54 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Voon Weifeng, David S. Miller, Maxime Coquelin, netdev,
	linux-kernel, Jose Abreu, Heiner Kallweit, Ong Boon Leong
In-Reply-To: <e9ece5ad-a669-6d6b-d050-c633cad15476@gmail.com>

On Mon, Aug 26, 2019 at 11:27:53AM -0700, Florian Fainelli wrote:
> On 8/26/19 6:52 PM, Voon Weifeng wrote:
> > From: Ong Boon Leong <boon.leong.ong@intel.com>
> > 
> > Make mdiobus_scan() to try harder to look for any PHY that only talks C45.
> If you are not using Device Tree or ACPI, and you are letting the MDIO
> bus be scanned, it sounds like there should be a way for you to provide
> a hint as to which addresses should be scanned (that's
> mii_bus::phy_mask) and possibly enhance that with a mask of possible C45
> devices?

Yes, i don't like this unconditional c45 scanning. A lot of MDIO bus
drivers don't look for the MII_ADDR_C45. They are going to do a C22
transfer, and maybe not mask out the MII_ADDR_C45 from reg, causing an
invalid register write. Bad things can then happen.

With DT and ACPI, we have an explicit indication that C45 should be
used, so we know on this platform C45 is safe to use. We need
something similar when not using DT or ACPI.

	  Andrew

^ permalink raw reply

* [PATCH] net/hamradio/6pack: Fix the size of a sk_buff used in 'sp_bump()'
From: Christophe JAILLET @ 2019-08-26 19:02 UTC (permalink / raw)
  To: ajk, davem
  Cc: linux-hams, netdev, linux-kernel, kernel-janitors,
	Christophe JAILLET

We 'allocate' 'count' bytes here. In fact, 'dev_alloc_skb' already add some
extra space for padding, so a bit more is allocated.

However, we use 1 byte for the KISS command, then copy 'count' bytes, so
count+1 bytes.

Explicitly allocate and use 1 more byte to be safe.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
This patch should be safe, be however may no be the correct way to fix the
"buffer overflow". Maybe, the allocated size is correct and we should have:
   memcpy(ptr, sp->cooked_buf + 1, count - 1);
or
   memcpy(ptr, sp->cooked_buf + 1, count - 1sp->rcount);

I've not dig deep enough to understand the link betwwen 'rcount' and
how 'cooked_buf' is used.
---
 drivers/net/hamradio/6pack.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 331c16d30d5d..23281aeeb222 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -344,10 +344,10 @@ static void sp_bump(struct sixpack *sp, char cmd)
 
 	sp->dev->stats.rx_bytes += count;
 
-	if ((skb = dev_alloc_skb(count)) == NULL)
+	if ((skb = dev_alloc_skb(count + 1)) == NULL)
 		goto out_mem;
 
-	ptr = skb_put(skb, count);
+	ptr = skb_put(skb, count + 1);
 	*ptr++ = cmd;	/* KISS command */
 
 	memcpy(ptr, sp->cooked_buf + 1, count);
-- 
2.20.1


^ permalink raw reply related

* [PATCH 04/10] mm/oom_debug: Add ARP and ND Table Summary usage
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron, David S. Miller, netdev
In-Reply-To: <20190826193638.6638-1-echron@arista.com>

Adds config options and code to support printing ARP Table usage and or
Neighbour Discovery Table usage when an OOM event occurs. This summarized
information provides the memory usage for each table when configured.

Configuring these two OOM Debug Options
---------------------------------------
Two OOM debug options: CONFIG_DEBUG_OOM_ARP_TBL, CONFIG_DEBUG_OOM_ND_TBL
To get the output for both tables they both must be configured.
The ARP Table uses the CONFIG_DEBUG_OOM_ARP_TBL kernel config option
and the ND Table uses the CONFIG_DEBUG_OOM_ND_TBL kernel config option
both of which are found in the kernel config under the entries:
Kernel hacking, Memory Debugging, OOM Debugging entry. The ARP Table and
ND Table are configured there with the options: DEBUG_OOM_ARP_TBL and
DEBUG_OOM_ND_TBL respectively.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
The oom debugfs base directory is found at: /sys/kernel/debug/oom.
The oom debugfs for this option are: arp_table_summary_ and
nd_table_summary_ and there is just one enable file for each.

Either option may be disabled or re-enabled using the debugfs entry for
the OOM debug option. The debugfs file to enable the ARP Table option
is found at: /sys/kernel/debug/oom/arp_table_summary_enabled
Similarly, the debugfs file to enable the ND Table option is found at:
/sys/kernel/debug/oom/nd_table_summary_enabled
For either option their enabled file's value determines whether the
facility is enabled or disabled for that option. A value of 1 is enabled
(default) and a value of 0 is disabled. When configured the default
setting is set to enabled. Each option will produce 1 line of output.

Content and format of ARP and Neighbour Discovery Tables Summary Output
-----------------------------------------------------------------------
  One line of output each for ARP and ND that includes:
  - Table name
  - Table size (max # entries)
  - Key Length
  - Entry Size
  - Number of Entries
  - Last Flush (in seconds)
  - hash grows
  - entry allocations
  - entry destroys
  - Number lookups
  - Number of lookup hits
  - Resolution failures
  - Garbage Collection Forced Runs
  - Table Full
  - Proxy Queue Length

Sample Output:
-------------
Here is sample output for both the ARP table and ND table:

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table: arp_tbl size:   256
 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s
 hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199
 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0

Jul 23 23:26:34 yuorsystem kernel: neighbour: Table:  nd_tbl size:   128
 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s
 hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0
 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0


Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
 include/net/neighbour.h | 12 +++++++
 mm/Kconfig.debug        | 26 ++++++++++++++
 mm/oom_kill_debug.c     | 38 ++++++++++++++++++++
 net/core/neighbour.c    | 78 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 154 insertions(+)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 50a67bd6a434..35fdecff2724 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -569,4 +569,16 @@ static inline void neigh_update_is_router(struct neighbour *neigh, u32 flags,
 		*notify = 1;
 	}
 }
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+/**
+ * Routine used to print arp table and neighbour table statistics.
+ * Output goes to dmesg along with all the other OOM related messages
+ * when the config options DEBUG_OOM_ARP_TBL and DEBUG_ND_TBL are set to
+ * yes, for the ARP table and Neighbour discovery table respectively.
+ */
+extern void neightbl_print_stats(const char * const tblname,
+				 struct neigh_table * const neightable);
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
+
 #endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fcbc5f9aa146..fe4bb5ce0a6d 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -163,3 +163,29 @@ config DEBUG_OOM_TASKS_SUMMARY
 	  A value of 1 is enabled (default) and a value of 0 is disabled.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_ARP_TBL
+	bool "Debug OOM ARP Table"
+	depends on DEBUG_OOM
+	help
+	  When enabled, documents kernel memory usage by the ARP Table
+	  entries at the time of an OOM event. Output is one line of
+	  summarzied ARP Table usage. If configured it is enabled/disabled
+	  by setting the enabled file entry in the debugfs OOM interface
+	  at: /sys/kernel/debug/oom/arp_table_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
+
+config DEBUG_OOM_ND_TBL
+	bool "Debug OOM ND Table"
+	depends on DEBUG_OOM
+	help
+	  When enabled, documents kernel memory usage by the ND Table
+	  entries at the time of an OOM event. Output is one line of
+	  summarzied ND Table usage. If configured it is enabled/disabled
+	  by setting the enabled file entry in the debugfs OOM interface
+	  at: /sys/kernel/debug/oom/nd_table_summary_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index 395b3307f822..c4a9117633fd 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -156,6 +156,16 @@
 #include <linux/sched/stat.h>
 #endif
 
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+#include <net/arp.h>
+#endif
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/ndisc.h>
+#endif
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+#include <net/neighbour.h>
+#endif
+
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
 
@@ -192,6 +202,18 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "tasks_summary_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+	{
+		.option_name	= "arp_table_summary_",
+		.support_tpercent = false,
+	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+	{
+		.option_name	= "nd_table_summary_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -203,6 +225,12 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 	TASKS_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ARP_TBL
+	ARP_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ND_TBL
+	ND_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -351,6 +379,16 @@ u32 oom_kill_debug_oom_event_is(void)
 		oom_kill_debug_system_summary_prt();
 #endif
 
+#if defined(CONFIG_INET) && defined(CONFIG_DEBUG_OOM_ARP_TBL)
+	if (oom_kill_debug_enabled(ARP_STATE))
+		neightbl_print_stats("arp_tbl", &arp_tbl);
+#endif
+
+#if defined(CONFIG_IPV6) && defined(CONFIG_DEBUG_OOM_ND_TBL)
+	if (oom_kill_debug_enabled(ND_STATE))
+		neightbl_print_stats("nd_tbl", &nd_tbl);
+#endif
+
 #ifdef CONFIG_DEBUG_OOM_TASKS_SUMMARY
 	if (oom_kill_debug_enabled(TASKS_STATE))
 		oom_kill_debug_tasks_summary_print();
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f79e61c570ea..9f5a579542a9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3735,3 +3735,81 @@ static int __init neigh_init(void)
 }
 
 subsys_initcall(neigh_init);
+
+#if defined(CONFIG_DEBUG_OOM_ARP_TBL) || defined(CONFIG_DEBUG_OOM_ND_TBL)
+void neightbl_print_stats(const char * const tblname,
+			  struct neigh_table * const tbl)
+{
+	struct neigh_hash_table *nht;
+	struct ndt_stats ndst;
+	u32 now;
+	u32 flush_delta;
+	u32 tblsize;
+	u16 key_len;
+	u16 entry_size;
+	u32 entries;
+	u32 last_flush;    /* delta to now in msecs */
+	u32 hash_shift;
+	u32 proxy_qlen;
+	int cpu;
+
+	read_lock_bh(&tbl->lock);
+	now = jiffies;
+	flush_delta = now - tbl->last_flush;
+
+	key_len = tbl->key_len;
+	if (tbl->entry_size)
+		entry_size = tbl->entry_size;
+	else
+		entry_size = ALIGN(offsetof(struct neighbour, primary_key) +
+				   key_len, NEIGH_PRIV_ALIGN);
+
+	entries = atomic_read(&tbl->entries);
+	if (entries == 0)
+		goto out_tbl_unlock;
+
+	/* last flush was last_flush seconds ago */
+	last_flush = jiffies_to_msecs(flush_delta) / 1000;
+	proxy_qlen = tbl->proxy_queue.qlen;
+
+	rcu_read_lock_bh();
+	nht = rcu_dereference_bh(tbl->nht);
+	if (nht)
+		hash_shift = nht->hash_shift + 1;
+	rcu_read_unlock_bh();
+	if (!nht)
+		goto out_tbl_unlock;
+
+	memset(&ndst, 0, sizeof(ndst));
+	for_each_possible_cpu(cpu) {
+		struct neigh_statistics *st;
+
+		st = per_cpu_ptr(tbl->stats, cpu);
+		ndst.ndts_allocs		+= st->allocs;
+		ndst.ndts_destroys		+= st->destroys;
+		ndst.ndts_hash_grows		+= st->hash_grows;
+		ndst.ndts_res_failed		+= st->res_failed;
+		ndst.ndts_lookups		+= st->lookups;
+		ndst.ndts_hits			+= st->hits;
+		ndst.ndts_periodic_gc_runs	+= st->periodic_gc_runs;
+		ndst.ndts_forced_gc_runs	+= st->forced_gc_runs;
+		ndst.ndts_table_fulls		+= st->table_fulls;
+	}
+
+	read_unlock_bh(&tbl->lock);
+	tblsize = (1 << hash_shift) * sizeof(struct neighbour *);
+	if (tblsize > PAGE_SIZE)
+		tblsize = get_order(tblsize);
+
+	pr_info("Table:%7s size:%5u keyLen:%2hu entrySize:%3hu entries:%5u lastFlush:%5us hGrows:%5llu allocs:%5llu destroys:%5llu lookups:%5llu hits:%5llu resFailed:%5llu gcRuns/Forced:%3llu / %2llu tblFull:%2llu proxyQlen:%2u\n",
+		tblname, tblsize, key_len, entry_size, entries, last_flush,
+		ndst.ndts_hash_grows, ndst.ndts_allocs, ndst.ndts_destroys,
+		ndst.ndts_lookups, ndst.ndts_hits, ndst.ndts_res_failed,
+		ndst.ndts_periodic_gc_runs, ndst.ndts_forced_gc_runs,
+		ndst.ndts_table_fulls, proxy_qlen);
+	return;
+
+out_tbl_unlock:
+	read_unlock_bh(&tbl->lock);
+}
+#endif /* CONFIG_DEBUG_OOM_ARP_TBL || CONFIG_DEBUG_OOM_ND_TBL */
-- 
2.20.1


^ permalink raw reply related

* [PATCH 10/10] mm/oom_debug: Add Enhanced Process Print Information
From: Edward Chron @ 2019-08-26 19:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, David Rientjes,
	Tetsuo Handa, Shakeel Butt, linux-mm, linux-kernel, colona,
	Edward Chron, David S. Miller, netdev
In-Reply-To: <20190826193638.6638-1-echron@arista.com>

Add OOM Debug code that prints additional detailed information about
users processes that were considered for OOM killing for any print
selected processes. The information is displayed for each user process
that OOM prints in the output.

This supplemental per user process information is very helpful for
determing how process memory is used to allow OOM event root cause
identifcation that might not otherwise be possible.

Configuring Enhanced Process Print Information
----------------------------------------------
The DEBUG_OOM_ENHANCED_PROCESS_PRINT is the config entry defined for
this OOM Debug option.  This option is dependent on the OOM Debug
option DEBUG_OOM_SELECT_PROCESS which adds code to allow processes
that are considered for OOM kill to be selectively printed, only
printing processes that use a specified minimum amount of memory.

The kernel configuration entry for this option can be found in the
config file at: Kernel hacking, Memory Debugging, Debug OOM,
Debug OOM Process Selection, Debug OOM Enhanced Process Print.
Both Debug OOM Process Selection and Debug OOM Enhanced Process Print
entries must be selected.

Dynamic disable or re-enable this OOM Debug option
--------------------------------------------------
This option may be disabled or re-enabled using the debugfs entry for
this OOM debug option. The debugfs file to enable this entry is found at:
/sys/kernel/debug/oom/process_enhanced_print_enabled where the enabled
file's value determines whether the facility is enabled or disabled.
A value of 1 is enabled (default) and a value of 0 is disabled.

Content and format of process record and Task state headers
-----------------------------------------------------------
Each OOM process entry printed include memory information about the
process. Memory usage is specified in KiB for memory values instead of
pages. Each entry includes the following fields:
pid, ppid, ruid, euid, tgid, State (S), the oom_score_adjust (Adjust),
task comm value (name), and also memory values (all in KB): VmemKiB,
MaxRssKiB, CurRssKiB, PteKiB, SwapKiB, socket pages (SockKiB), LibKiB,
TextPgKiB, HeapPgKiB, StackKiB, FileKiB and shared memory (ShmemKiB).
Counts of page reads (ReadPgs) and page faults (FaultPgs) are included.

Sample Output
-------------
OOM Process select print headers and line of process enhanced output:

Aug  6 09:37:21 egc103 kernel: Tasks state (memory values in KiB):
Aug  6 09:37:21 egc103 kernel: [  pid  ]    ppid    ruid    euid
    tgid S  utimeSec  stimeSec   VmemKiB MaxRssKiB CurRssKiB
    PteKiB   SwapKiB   SockKiB     LibKiB   TextKiB   HeapKiB
  StackKiB   FileKiB  ShmemKiB   ReadPgs  FaultPgs   LockKiB
 PinnedKiB Adjust name

Aug  6 09:37:21 egc103 kernel: [   7707]    7553   10383   10383
    7707 S     0.132     0.350   1056804   1054040   1052796
      2092         0         0       1944       684   1052860
       136         4         0         0         0         0
         0   1000 oomprocs


Signed-off-by: Edward Chron <echron@arista.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
 mm/Kconfig.debug    |  23 +++++
 mm/oom_kill.c       |  23 ++++-
 mm/oom_kill_debug.c | 236 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill_debug.h |   5 +
 4 files changed, 285 insertions(+), 2 deletions(-)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 4414e46f72c6..2bc843727968 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -320,3 +320,26 @@ config DEBUG_OOM_PROCESS_SELECT_PRINT
 	  print limit value of 10 or 1% of memory.
 
 	  If unsure, say N.
+
+config DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	bool "Debug OOM Enhanced Process Print"
+	depends on DEBUG_OOM_PROCESS_SELECT_PRINT
+	help
+	  Each OOM process entry printed include memory information about
+	  the process. Memory usage is specified in KiB (KB) for memory
+	  values, not pages. Each entry includes the following fields:
+	  pid, ppid, ruid, euid, tgid, State (S), utime in seconds,
+	  stime in seconds, the number of read pages (ReadPgs), number of
+	  page faults (FaultPgs), the number of lock pages (LockPgs), the
+	  oom_score_adjust value (Adjust), memory percentage used (MemPct),
+	  oom_score (Score), task comm value (name), and also memory values
+	  (all in KB): VmemKiB, MaxRssKiB, CurRssKiB, PteKiB, SwapKiB,
+	  socket pages (SockKiB), LibKiB, TextPgKiB, HeapPgKiB, StackKiB,
+	  FileKiB and shared memory pages (ShmemKiB).
+
+	  If the option is configured it is enabled/disabled by setting
+	  the value of the file entry in the debugfs OOM interface at:
+	  /sys/kernel/debug/oom/process_enhanced_print_enabled
+	  A value of 1 is enabled (default) and a value of 0 is disabled.
+
+	  If unsure, say N.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index cbea289c6345..cf37caea9c5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,6 +417,13 @@ static int dump_task(struct task_struct *p, void *arg)
 	}
 #endif
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	if (oom_kill_debug_enhanced_process_print_enabled()) {
+		dump_task_prt(task, rsspgs, swappgs, pgtbl);
+		task_unlock(task);
+		return 1;
+	}
+#endif
 	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
 		task->tgid, task->mm->total_vm, rsspgs, pgtbl, swappgs,
@@ -426,6 +433,19 @@ static int dump_task(struct task_struct *p, void *arg)
 	return 1;
 }
 
+static void dump_tasks_headers(void)
+{
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	if (oom_kill_debug_enhanced_process_print_enabled()) {
+		pr_info("Tasks state (memory values in KiB):\n");
+		pr_info("[  pid  ]    ppid    ruid    euid    tgid S  utimeSec  stimeSec   VmemKiB MaxRssKiB CurRssKiB    PteKiB   SwapKiB   SockKiB     LibKiB   TextKiB   HeapKiB  StackKiB   FileKiB  ShmemKiB     ReadPgs    FaultPgs   LockKiB PinnedKiB Adjust name\n");
+		return;
+	}
+#endif
+	pr_info("Tasks state (memory values in pages):\n");
+	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
 /**
@@ -443,8 +463,7 @@ static void dump_tasks(struct oom_control *oc)
 	u32 total = 0;
 	u32 prted = 0;
 
-	pr_info("Tasks state (memory values in pages):\n");
-	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+	dump_tasks_headers();
 
 #ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
 	oc->minpgs = oom_kill_debug_min_task_pages(oc->totalpages);
diff --git a/mm/oom_kill_debug.c b/mm/oom_kill_debug.c
index ad937b3d59f3..467f7add4397 100644
--- a/mm/oom_kill_debug.c
+++ b/mm/oom_kill_debug.c
@@ -171,6 +171,12 @@
 #ifdef CONFIG_DEBUG_OOM_VMALLOC_SELECT_PRINT
 #include <linux/vmalloc.h>
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+#include <linux/fdtable.h>
+#include <linux/net.h>
+#include <net/sock.h>
+#include <linux/sched/cputime.h>
+#endif
 
 #define OOMD_MAX_FNAME 48
 #define OOMD_MAX_OPTNAME 32
@@ -250,6 +256,12 @@ static struct oom_debug_option oom_debug_options_table[] = {
 		.option_name	= "slab_enhanced_print_",
 		.support_tpercent = false,
 	},
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	{
+		.option_name	= "process_enhanced_print_",
+		.support_tpercent = false,
+	},
 #endif
 	{}
 };
@@ -282,6 +294,9 @@ enum oom_debug_options_index {
 #endif
 #ifdef CONFIG_DEBUG_OOM_ENHANCED_SLAB_PRINT
 	ENHANCED_SLAB_STATE,
+#endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+	ENHANCED_PROCESS_STATE,
 #endif
 	OUT_OF_BOUNDS
 };
@@ -365,6 +380,12 @@ bool oom_kill_debug_enhanced_slab_print_information_enabled(void)
 	return oom_kill_debug_enabled(ENHANCED_SLAB_STATE);
 }
 #endif
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+bool oom_kill_debug_enhanced_process_print_enabled(void)
+{
+	return oom_kill_debug_enabled(ENHANCED_PROCESS_STATE);
+}
+#endif
 
 #ifdef CONFIG_DEBUG_OOM_SYSTEM_STATE
 /*
@@ -513,6 +534,221 @@ u32 oom_kill_debug_oom_event_is(void)
 	return oom_kill_debug_oom_events;
 }
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+/*
+ *  Account for socket(s) buffer memory in use by a task.
+ *  A task may have one or more sockets consuming socket buffer space.
+ *  Account for how much socket space each task has in use.
+ */
+static unsigned long account_for_socket_buffers(struct task_struct *task,
+						char *incomplete)
+{
+	unsigned long sockpgs = 0;
+	struct files_struct *files = task->files;
+	struct fdtable *fdt;
+	struct file **fds;
+	int openfilecount;
+	struct inode *inode;
+	struct socket *sock;
+	struct sock *sk;
+	unsigned long bytes;
+	int fdtsize;
+	int i;
+
+	/* Just to make sure the fds don't get closed */
+	atomic_inc(&files->count);
+	/* Make a best effort, but no reason to get hung up here */
+	if (!spin_trylock(&files->file_lock)) {
+		*incomplete = '*';
+		atomic_dec(&files->count);
+		return 0;
+	}
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	fdtsize = fdt->max_fds;
+	/* Determine how many words we need to check for open files */
+	for (i = fdtsize / BITS_PER_LONG; i > 0; ) {
+		if (fdt->open_fds[--i])
+			break;
+	}
+	openfilecount = (i + 1) * BITS_PER_LONG;  // Check each fd in the word
+	fds = fdt->fd;
+	for (i = openfilecount; i != 0; i--) {
+		struct file *fp = *fds++;
+
+		if (fp) {
+			/* Any continue case doesn't need to be counted */
+			if (fp->f_path.dentry == NULL)
+				continue;
+			inode = fp->f_path.dentry->d_inode;
+			if (inode == NULL || !S_ISSOCK(inode->i_mode))
+				continue;
+			sock = fp->private_data;
+			if (sock == NULL)
+				continue;
+			sk = sock->sk;
+			if (sk == NULL)
+				continue;
+			bytes = roundup(sk->sk_rcvbuf, PAGE_SIZE);
+			sockpgs = bytes / PAGE_SIZE;
+			bytes = roundup(sk->sk_sndbuf, PAGE_SIZE);
+			sockpgs += bytes / PAGE_SIZE;
+		}
+	}
+	rcu_read_unlock();
+
+	spin_unlock(&files->file_lock);
+	/* We're done looking at the fds */
+	atomic_dec(&files->count);
+
+	return sockpgs;
+}
+
+static u64 power10(u32 index)
+{
+	static u64 pwr10[11] = {1, 10, 100, 1000, 10000, 100000, 1000000,
+				10000000, 100000000, 1000000000,
+				10000000000};
+
+	return pwr10[index];
+}
+
+static u32 num_digits(u64 num)
+{
+	u32 i;
+
+	for (i = 1; i < 11; ++i) {
+		if (power10(i) > num)
+			return i;
+	}
+	return i;
+}
+
+static void digits_and_fraction(u64 num, u32 *p_digits, u32 *p_frac, u32 chars)
+{
+	*p_digits = num_digits(num);
+	// Allow for decimal place for fractional output
+	if (chars - 1 > *p_digits)
+		*p_frac = chars - 1 - *p_digits;
+	else
+		*p_frac = 0;
+}
+
+#define MAX_NUM_FIELD_SIZE	10
+/*
+ * Format timespec into seconds and possibly fraction, must fit in 9 bytes.
+ * Linux kernel doesn't support floating point so format as best we can.
+ * With 9 digits in seconds convers 31.7 years and where we can we provide
+ * fractions of a second up to miliseconds.
+ */
+static void timespec_format(u64 nsecs_time, char *p_time, size_t time_size)
+{
+	struct timespec64 tspec = ns_to_timespec64(nsecs_time);
+	u32 digits, fracs, bytes, min;
+	u64 fraction;
+
+	digits_and_fraction(tspec.tv_sec, &digits, &fracs, time_size);
+
+	bytes = sprintf(p_time, "%llu", tspec.tv_sec);
+
+	if (fracs > 0) {
+		u32 frsize = num_digits(tspec.tv_nsec);
+
+		p_time += bytes;
+		if (frsize >= 3) {
+			if (fracs >= 3)
+				min = frsize - 3;
+			else if (fracs >= 2)
+				min = frsize - 2;
+			else
+				min = frsize - 1;
+		} else if (frsize >= 2) {
+			if (fracs >= 2)
+				min = frsize - 2;
+			else
+				min = frsize - 1;
+		} else {
+			min = frsize - 1;
+		}
+		fraction = tspec.tv_nsec / power10(min);
+		sprintf(p_time, ".%llu", fraction);
+	}
+}
+
+/*
+ * Format utime, stime in seconds and possibly fractions, must fit in 9 bytes.
+ */
+static void time_format(struct task_struct *task, char *p_utime, char *p_stime)
+{
+	size_t num_size = MAX_NUM_FIELD_SIZE;
+	u64 utime, stime;
+
+	task_cputime_adjusted(task, &utime, &stime);
+	memset(p_utime, 0, num_size);
+	timespec_format(utime, p_utime, num_size - 1);
+	memset(p_stime, 0, num_size);
+	timespec_format(stime, p_stime, num_size - 1);
+}
+
+/* task_index_to_char kernel function is missing options so use this */
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+static const char task_to_char[] = TASK_STATE_TO_CHAR_STR;
+static const char get_task_state(struct task_struct *p_task, ulong state)
+{
+	int bit = state ? __ffs(state) + 1 : 0;
+
+	if (p_task->tgid == 0)
+		return 'I';
+	return bit < sizeof(task_to_char) - 1 ? task_to_char[bit] : '?';
+}
+
+/*
+ * Code that prints the information about the specified task.
+ * Assumes task lock is held at entry.
+ */
+void dump_task_prt(struct task_struct *task,
+		   unsigned long rsspg, unsigned long swappg,
+		   unsigned long pgtbl)
+{
+	char c_utime[MAX_NUM_FIELD_SIZE], c_stime[MAX_NUM_FIELD_SIZE];
+	unsigned long vmkb, sockkb, text, maxrsspg, pgtblpg;
+	unsigned long libkb, textkb, pgtblkb;
+	struct mm_struct *mm;
+	char incomp = ' ';
+	kuid_t ruid, euid;
+	char tstate;
+
+	mm = task->mm;
+	maxrsspg = rsspg;
+	pgtblpg = pgtbl >> PAGE_SHIFT;
+	ruid = __task_cred(task)->uid;
+	euid = __task_cred(task)->euid;
+	vmkb = K(mm->total_vm);
+	if (maxrsspg < mm->hiwater_rss)
+		maxrsspg = mm->hiwater_rss;
+	sockkb = K(account_for_socket_buffers(task, &incomp));
+	text = (PAGE_ALIGN(mm->end_code) -
+		 (mm->start_code & PAGE_MASK));
+	text = min(text, mm->exec_vm << PAGE_SHIFT);
+	textkb = text >> 10;
+	libkb = ((mm->exec_vm << PAGE_SHIFT) - text) >> 10;
+	pgtblkb = pgtbl >> 10;
+	tstate = get_task_state(task, task->state);
+	time_format(task, c_utime, c_stime);
+
+	pr_info("[%7d] %7d %7d %7d %7d %c %9s %9s %9lu %9lu %9lu %9lu %9ld %9lu%c %9lu %9lu %9lu %9lu %9lu %9lu %11lu %11lu %9lu %9llu  %5hd %s\n",
+		task->pid, task_ppid_nr(task), ruid.val, euid.val, task->tgid,
+		tstate, c_utime, c_stime, vmkb, K(maxrsspg), K(rsspg), pgtblkb,
+		K(swappg), sockkb, incomp, libkb, textkb, K(mm->data_vm),
+		K(mm->stack_vm), K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)), task->signal->cmaj_flt,
+		task->signal->cmin_flt,
+		K(mm->locked_vm), K((u64)atomic64_read(&mm->pinned_vm)),
+		task->signal->oom_score_adj, task->comm);
+}
+#endif /* CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT */
+
 static void __init oom_debug_init(void)
 {
 	/* Ensure we have a debugfs oom root directory */
diff --git a/mm/oom_kill_debug.h b/mm/oom_kill_debug.h
index a39bc275980e..faebb4c6097c 100644
--- a/mm/oom_kill_debug.h
+++ b/mm/oom_kill_debug.h
@@ -9,6 +9,11 @@
 #ifndef __MM_OOM_KILL_DEBUG_H__
 #define __MM_OOM_KILL_DEBUG_H__
 
+#ifdef CONFIG_DEBUG_OOM_ENHANCED_PROCESS_PRINT
+extern bool oom_kill_debug_enhanced_process_print_enabled(void);
+extern void dump_task_prt(struct task_struct *task, unsigned long rsspg,
+			  unsigned long swappg, unsigned long pgtbl);
+#endif
 #ifdef CONFIG_DEBUG_OOM_PROCESS_SELECT_PRINT
 extern unsigned long oom_kill_debug_min_task_pages(unsigned long totalpages);
 #endif
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v1 net-next 4/4] net: stmmac: setup higher frequency clk support for EHL & TGL
From: Florian Fainelli @ 2019-08-26 19:55 UTC (permalink / raw)
  To: Voon Weifeng, David S. Miller, Maxime Coquelin
  Cc: netdev, linux-kernel, Jose Abreu, Giuseppe Cavallaro, Andrew Lunn,
	Alexandre Torgue, Ong Boon Leong
In-Reply-To: <1566869891-29239-5-git-send-email-weifeng.voon@intel.com>

On 8/26/19 6:38 PM, Voon Weifeng wrote:
> EHL DW EQOS is running on a 200MHz clock. Setting up stmmac-clk,
> ptp clock and ptp_max_adj to 200MHz.
> 
> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 21 +++++++++++++++++++++
>  drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c |  3 +++
>  include/linux/stmmac.h                           |  1 +
>  3 files changed, 25 insertions(+)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> index e969dc9bb9f0..20906287b6d4 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> @@ -9,6 +9,7 @@
>    Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
>  *******************************************************************************/
>  
> +#include <linux/clk-provider.h>
>  #include <linux/pci.h>
>  #include <linux/dmi.h>
>  
> @@ -174,6 +175,19 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
>  	plat->axi->axi_blen[1] = 8;
>  	plat->axi->axi_blen[2] = 16;
>  
> +	plat->ptp_max_adj = plat->clk_ptp_rate;
> +
> +	/* Set system clock */
> +	plat->stmmac_clk = clk_register_fixed_rate(&pdev->dev,
> +						   "stmmac-clk", NULL, 0,
> +						   plat->clk_ptp_rate);
> +
> +	if (IS_ERR(plat->stmmac_clk)) {
> +		dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
> +		plat->stmmac_clk = NULL;

Don't you need to propagate at least EPROBE_DEFER here?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Jason Baron @ 2019-08-26 19:56 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Eric Dumazet,
	Vladimir Rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>



On 8/26/19 12:19 PM, Eric Dumazet wrote:
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
> 
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
> 
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
> 

Makes sense, thanks. I think this check for empty queue could also
benefit from and update to use tcp_write_queue_empty(). I will send a
follow-up for that.

Thanks,

-Jason



> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 77b485d60b9d0e00edc4e2f0d6c5bb3a9460b23b..61082065b26a068975c411b74eb46739ab0632ca 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -935,6 +935,22 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>  	return mss_now;
>  }
>  
> +/* In some cases, both sendpage() and sendmsg() could have added
> + * an skb to the write queue, but failed adding payload on it.
> + * We need to remove it to consume less memory, but more
> + * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
> + * users.
> + */
> +static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
> +{
> +	if (skb && !skb->len) {
> +		tcp_unlink_write_queue(skb, sk);
> +		if (tcp_write_queue_empty(sk))
> +			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> +		sk_wmem_free_skb(sk, skb);
> +	}
> +}
> +
>  ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>  			 size_t size, int flags)
>  {
> @@ -1064,6 +1080,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>  	return copied;
>  
>  do_error:
> +	tcp_remove_empty_skb(sk, tcp_write_queue_tail(sk));
>  	if (copied)
>  		goto out;
>  out_err:
> @@ -1388,18 +1405,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  	sock_zerocopy_put(uarg);
>  	return copied + copied_syn;
>  
> +do_error:
> +	skb = tcp_write_queue_tail(sk);
>  do_fault:
> -	if (!skb->len) {
> -		tcp_unlink_write_queue(skb, sk);
> -		/* It is the one place in all of TCP, except connection
> -		 * reset, where we can be unlinking the send_head.
> -		 */
> -		if (tcp_write_queue_empty(sk))
> -			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
> -		sk_wmem_free_skb(sk, skb);
> -	}
> +	tcp_remove_empty_skb(sk, skb);
>  
> -do_error:
>  	if (copied + copied_syn)
>  		goto out;
>  out_err:
> 

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: Eric Dumazet @ 2019-08-26 20:04 UTC (permalink / raw)
  To: Jason Baron, Eric Dumazet, David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Vladimir Rutsky
In-Reply-To: <58ae43f8-21e7-f08f-2632-ce567661d301@akamai.com>



On 8/26/19 9:56 PM, Jason Baron wrote:
> 
> 
> On 8/26/19 12:19 PM, Eric Dumazet wrote:
>> Vladimir Rutsky reported stuck TCP sessions after memory pressure
>> events. Edge Trigger epoll() user would never receive an EPOLLOUT
>> notification allowing them to retry a sendmsg().
>>
>> Jason tested the case of sk_stream_alloc_skb() returning NULL,
>> but there are other paths that could lead both sendmsg() and sendpage()
>> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>>
>> This patch makes sure we remove this empty skb so that
>> Jason code can detect that the queue is empty, and
>> call sk->sk_write_space(sk) accordingly.
>>
> 
> Makes sense, thanks. I think this check for empty queue could also
> benefit from and update to use tcp_write_queue_empty(). I will send a
> follow-up for that.
>

My plan (for net-next) is to submit something more like this one instead.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 051ef10374f6..908dbe91e04b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1068,7 +1068,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
                goto out;
 out_err:
        /* make sure we wake any epoll edge trigger waiter */
-       if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+       if (unlikely(tcp_rtx_and_write_queues_empty(sk) &&
                     err == -EAGAIN)) {
                sk->sk_write_space(sk);
                tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
@@ -1407,7 +1407,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
        sock_zerocopy_put_abort(uarg, true);
        err = sk_stream_error(sk, flags, err);
        /* make sure we wake any epoll edge trigger waiter */
-       if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+       if (unlikely(tcp_rtx_and_write_queues_empty(sk) &&
                     err == -EAGAIN)) {
                sk->sk_write_space(sk);
                tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);


^ permalink raw reply related

* Re: [PATCH v1 net-next 4/4] net: stmmac: setup higher frequency clk support for EHL & TGL
From: Andrew Lunn @ 2019-08-26 20:13 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Voon Weifeng, David S. Miller, Maxime Coquelin, netdev,
	linux-kernel, Jose Abreu, Giuseppe Cavallaro, Alexandre Torgue,
	Ong Boon Leong
In-Reply-To: <7d43e0c6-6f51-0d71-0af8-89f22b0234f9@gmail.com>

On Mon, Aug 26, 2019 at 12:55:31PM -0700, Florian Fainelli wrote:
> On 8/26/19 6:38 PM, Voon Weifeng wrote:
> > EHL DW EQOS is running on a 200MHz clock. Setting up stmmac-clk,
> > ptp clock and ptp_max_adj to 200MHz.
> > 
> > Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
> > Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 21 +++++++++++++++++++++
> >  drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c |  3 +++
> >  include/linux/stmmac.h                           |  1 +
> >  3 files changed, 25 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > index e969dc9bb9f0..20906287b6d4 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
> > @@ -9,6 +9,7 @@
> >    Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> >  *******************************************************************************/
> >  
> > +#include <linux/clk-provider.h>
> >  #include <linux/pci.h>
> >  #include <linux/dmi.h>
> >  
> > @@ -174,6 +175,19 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
> >  	plat->axi->axi_blen[1] = 8;
> >  	plat->axi->axi_blen[2] = 16;
> >  
> > +	plat->ptp_max_adj = plat->clk_ptp_rate;
> > +
> > +	/* Set system clock */
> > +	plat->stmmac_clk = clk_register_fixed_rate(&pdev->dev,
> > +						   "stmmac-clk", NULL, 0,
> > +						   plat->clk_ptp_rate);
> > +
> > +	if (IS_ERR(plat->stmmac_clk)) {
> > +		dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
> > +		plat->stmmac_clk = NULL;
> 
> Don't you need to propagate at least EPROBE_DEFER here?

Hi Florian

Isn't a fixed rate clock a complete fake. There is no hardware behind
it. So can it return EPROBE_DEFER?

    Andrew

^ permalink raw reply

* Re: [net-next 4/8] net/mlx5e: Add device out of buffer counter
From: Saeed Mahameed @ 2019-08-26 20:14 UTC (permalink / raw)
  To: jakub.kicinski@netronome.com
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Moshe Shemesh
In-Reply-To: <20190823111601.012fabf4@cakuba.netronome.com>

On Fri, 2019-08-23 at 11:16 -0700, Jakub Kicinski wrote:
> On Fri, 23 Aug 2019 06:00:45 +0000, Saeed Mahameed wrote:
> > On Thu, 2019-08-22 at 18:33 -0700, Jakub Kicinski wrote:
> > > On Thu, 22 Aug 2019 23:35:52 +0000, Saeed Mahameed wrote:  
> > > > From: Moshe Shemesh <moshe@mellanox.com>
> > > > 
> > > > Added the following packets drop counter:
> > > > Device out of buffer - counts packets which were dropped due to
> > > > full
> > > > device internal receive queue.
> > > > This counter will be shown on ethtool as a new counter called
> > > > dev_out_of_buffer.
> > > > The counter is read from FW by command QUERY_VNIC_ENV.
> > > > 
> > > > Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
> > > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>  
> > > 
> > > Sounds like rx_fifo_errors, no? Doesn't rx_fifo_errors count RX
> > > overruns?  
> > 
> > No, that is port buffer you are looking for and we got that fully
> > covered in mlx5. this is different.
> > 
> > This new counter is deep into the HW data path pipeline and it
> > covers
> > very rare and complex scenarios that got only recently introduced
> > with
> > swichdev mode and "some" lately added tunnels offloads that are
> > routed
> > between VFs/PFs.
> > 
> > Normally the HW is lossless once the packet passes port buffers
> > into
> > the data plane pipeline, let's call that "fast lane", BUT for sriov
> > configurations with switchdev mode enabled and some special hand
> > crafted tc tunnel offloads that requires hairpin between VFs/PFs,
> > the
> > hw might decide to send some traffic to a "service lane" which is
> > still
> > fast path but unlike the "fast lane" it handles traffic through "HW
> > internal" receive and send queues (just like we do with hairpin)
> > that
> > might drop packets. the whole thing is transparent to driver and it
> > is
> > HW implementation specific.
> 
> I see thanks for the explanation and sorry for the delayed response.
> Would it perhaps make sense to indicate the hairpin in the name?

We had some internal discussion and we couldn't come up with the
perfect name :)

hairpin is just an implementation detail, we don't want to exclusively
bind this counter to hairpin only flows, the problem is not with
hairpin, the actual problem is due to the use of internal RQs, for now
it only happens with "hairpin like" flows, but tomorrow it can happen
with a different scenario but same root cause (the use of internal
RQs), we want to have one counter to count internal drops due to
internal use of internal RQs.

so how about:
dev_internal_rq_oob: Device Internal RQ out of buffer
dev_internal_out_of_res: Device Internal out of resources (more generic
? too generic ?)

Any suggestion that you provide will be more than welcome.

> dev_out_of_buffer is quite a generic name, and there seems to be no
> doc, nor does the commit message explains it as well as you have..

Regarding documentation:
All mlx5 ethool counters are documented here
https://community.mellanox.com/s/article/understanding-mlx5-linux-counters-and-status-parameters

once we decide on the name, will add the new counter to the doc.




^ permalink raw reply

* Re: [PATCH] net/mlx5: fix a -Wstringop-truncation warning
From: Saeed Mahameed @ 2019-08-26 20:18 UTC (permalink / raw)
  To: cai@lca.pw, davem@davemloft.net
  Cc: Eran Ben Elisha, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Feras Daoud,
	leon@kernel.org, Moshe Shemesh
In-Reply-To: <20190823.151809.442664848668672070.davem@davemloft.net>

On Fri, 2019-08-23 at 15:18 -0700, David Miller wrote:
> Saeed, I assume I'll get this from you.

Yes, i will handle it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox