Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v4 1/2] dt-bindings: net: snps,dwmac: update reg minItems maxItems
From: Neil Armstrong @ 2019-08-20  7:57 UTC (permalink / raw)
  To: davem, robh+dt
  Cc: Neil Armstrong, martin.blumenstingl, devicetree, netdev,
	linux-amlogic, linux-arm-kernel, linux-kernel, Rob Herring,
	Maxime Ripard
In-Reply-To: <20190820075742.14857-1-narmstrong@baylibre.com>

The Amlogic Meson DWMAC glue bindings needs a second reg cells for the
glue registers, thus update the reg minItems/maxItems to allow more
than a single reg cell.

Also update the allwinner,sun7i-a20-gmac.yaml derivative schema to specify
maxItems to 1.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
---
 .../devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml      | 3 +++
 Documentation/devicetree/bindings/net/snps,dwmac.yaml          | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml
index 06b1cc8bea14..ef446ae166f3 100644
--- a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml
+++ b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml
@@ -17,6 +17,9 @@ properties:
   compatible:
     const: allwinner,sun7i-a20-gmac
 
+  reg:
+    maxItems: 1
+
   interrupts:
     maxItems: 1
 
diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
index 76fea2be66ac..4377f511a51d 100644
--- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml
+++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
@@ -61,7 +61,8 @@ properties:
         - snps,dwxgmac-2.10
 
   reg:
-    maxItems: 1
+    minItems: 1
+    maxItems: 2
 
   interrupts:
     minItems: 1
-- 
2.22.0


^ permalink raw reply related

* [PATCH net-next v4 0/2] dt-bindings: net: meson-dwmac: convert to yaml
From: Neil Armstrong @ 2019-08-20  7:57 UTC (permalink / raw)
  To: davem, robh+dt
  Cc: Neil Armstrong, martin.blumenstingl, devicetree, netdev,
	linux-amlogic, linux-arm-kernel, linux-kernel

This patchsets converts the Amlogic Meson DWMAC glue bindings over to
YAML schemas using the already converted dwmac bindings.

The first patch is needed because the Amlogic glue needs a supplementary
reg cell to access the DWMAC glue registers.

Changes since v3:
- Specified net-next target tree

Changes since v2:
- Added review tags
- Updated allwinner,sun7i-a20-gmac.yaml reg maxItems

Neil Armstrong (2):
  dt-bindings: net: snps,dwmac: update reg minItems maxItems
  dt-bindings: net: meson-dwmac: convert to yaml

 .../net/allwinner,sun7i-a20-gmac.yaml         |   3 +
 .../bindings/net/amlogic,meson-dwmac.yaml     | 113 ++++++++++++++++++
 .../devicetree/bindings/net/meson-dwmac.txt   |  71 -----------
 .../devicetree/bindings/net/snps,dwmac.yaml   |   8 +-
 4 files changed, 123 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml
 delete mode 100644 Documentation/devicetree/bindings/net/meson-dwmac.txt

-- 
2.22.0


^ permalink raw reply

* Re: [PATCH net-next 0/6] Dynamic toggling of vlan_filtering for SJA1105 DSA
From: Vivien Didelot @ 2019-08-20  8:04 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: f.fainelli, andrew, idosch, roopa, nikolay, davem, netdev,
	Vladimir Oltean
In-Reply-To: <20190820000002.9776-1-olteanv@gmail.com>

On Tue, 20 Aug 2019 02:59:56 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> This patchset addresses a few limitations in DSA and the bridge core
> that made it impossible for this sequence of commands to work:
> 
>   ip link add name br0 type bridge
>   ip link set dev swp2 master br0
>   echo 1 > /sys/class/net/br0/bridge/vlan_filtering
> 
> Only this sequence was previously working:
> 
>   ip link add name br0 type bridge vlan_filtering 1
>   ip link set dev swp2 master br0

This is not quite true, these sequences of commands do "work". What I see
though is that with the first sequence, the PVID 1 won't be programmed in
the hardware. But the second sequence does program the hardware.

But following bridge members will be correctly programmed with the VLAN
though. The sequence below programs the hardware with VLAN 1 for swp3 as
well as CPU and DSA ports, but not for swp2:

    ip link add name br0 type bridge
    ip link set dev swp2 master br0
    echo 1 > /sys/class/net/br0/bridge/vlan_filtering
    ip link set dev swp3 master br0

This is unfortunately also true for any 802.1Q VLANs. For example, only VID
43 is programmed with the following sequence, but not VID 1 and VID 42:

    ip link add name br0 type bridge
    ip link set dev swp2 master br0
    bridge vlan add dev swp2 vid 42
    echo 1 > /sys/class/net/br0/bridge/vlan_filtering
    bridge vlan add dev swp2 vid 43

So I understand that because VLANs are not propagated by DSA to the hardware
when VLAN filtering is disabled, a port may not be programmed with its
bridge's default PVID, and this is causing a problem for tag_8021q.

Please reword so that we understand better what is the issue being fixed here.

> 
> On SJA1105, the situation is further complicated by the fact that
> toggling vlan_filtering is causing a switch reset. However, the hardware
> state restoration logic is already there in the driver. It is a matter
> of the layers above which need a few fixups.
> 
> Also see this discussion thread:
> https://www.spinics.net/lists/netdev/msg581042.html
> 
> Patch 1/6 is not functionally related but also related to dsa_8021q
> handling of VLANs and this is a good opportunity to bring up the subject
> for discussion.

So please send 1/6 as a separate patch and bring up the discussion there.

Thanks,

	Vivien

^ permalink raw reply

* Re: [PATCH v5 15/17] mfd: ioc3: Add driver for SGI IOC3 chip
From: Thomas Bogendoerfer @ 2019-08-20  8:17 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: Jonathan Corbet, Ralf Baechle, Paul Burton, James Hogan,
	Dmitry Torokhov, Lee Jones, David S. Miller, Srinivas Kandagatla,
	Alessandro Zummo, Greg Kroah-Hartman, Jiri Slaby,
	Evgeniy Polyakov, linux-doc, linux-kernel, linux-mips,
	linux-input, netdev, linux-rtc, linux-serial
In-Reply-To: <20190820062308.GK3545@piout.net>

On Tue, 20 Aug 2019 08:23:08 +0200
Alexandre Belloni <alexandre.belloni@bootlin.com> wrote:
> On 19/08/2019 18:31:38+0200, Thomas Bogendoerfer wrote:
> > diff --git a/drivers/mfd/ioc3.c b/drivers/mfd/ioc3.c
> > new file mode 100644
> > index 000000000000..5bcb3461a189
> > --- /dev/null
> > +++ b/drivers/mfd/ioc3.c
> > @@ -0,0 +1,586 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * SGI IOC3 multifunction device driver
> > + *
> > + * Copyright (C) 2018, 2019 Thomas Bogendoerfer <tbogendoerfer@suse.de>
> > + *
> > + * Based on work by:
> > + *   Stanislaw Skowronek <skylark@unaligned.org>
> > + *   Joshua Kinard <kumba@gentoo.org>
> > + *   Brent Casavant <bcasavan@sgi.com> - IOC4 master driver
> > + *   Pat Gefre <pfg@sgi.com> - IOC3 serial port IRQ demuxer
> > + */
> > +
> > +#include <linux/delay.h>
> > +#include <linux/errno.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/mfd/core.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/platform_data/sgi-w1.h>
> > +#include <linux/rtc/ds1685.h>
> I don't think this include is necessary.

you are right. I'll move it to the patch where IP30 systemboard gets added.

Thanks,
Thomas.

-- 
SUSE Linux GmbH
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH v2 11/11] vsock_test: wait for the remote to close the connection
From: Stefan Hajnoczi @ 2019-08-20  8:28 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, kvm, Dexuan Cui, virtualization, David S. Miller,
	Jorgen Hansen, linux-kernel
In-Reply-To: <20190801152541.245833-12-sgarzare@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On Thu, Aug 01, 2019 at 05:25:41PM +0200, Stefano Garzarella wrote:
> +/* Wait for the remote to close the connection */
> +void vsock_wait_remote_close(int fd)
> +{
> +	struct epoll_event ev;
> +	int epollfd, nfds;
> +
> +	epollfd = epoll_create1(0);
> +	if (epollfd == -1) {
> +		perror("epoll_create1");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ev.events = EPOLLRDHUP | EPOLLHUP;
> +	ev.data.fd = fd;
> +	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
> +		perror("epoll_ctl");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	nfds = epoll_wait(epollfd, &ev, 1, TIMEOUT * 1000);
> +	if (nfds == -1) {
> +		perror("epoll_wait");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (nfds == 0) {
> +		fprintf(stderr, "epoll_wait timed out\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	assert(nfds == 1);
> +	assert(ev.events & (EPOLLRDHUP | EPOLLHUP));
> +	assert(ev.data.fd == fd);
> +
> +	close(epollfd);
> +}

Please use timeout_begin()/timeout_end() so that the test cannot hang.

> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> index 64adf45501ca..a664675bec5a 100644
> --- a/tools/testing/vsock/vsock_test.c
> +++ b/tools/testing/vsock/vsock_test.c
> @@ -84,6 +84,11 @@ static void test_stream_client_close_server(const struct test_opts *opts)
>  
>  	control_expectln("CLOSED");
>  
> +	/* Wait for the remote to close the connection, before check
> +	 * -EPIPE error on send.
> +	 */
> +	vsock_wait_remote_close(fd);

Is control_expectln("CLOSED") still necessary now that we're waiting for
the poll event?  The control message was an attempt to wait until the
other side closed the socket.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 10/11] vsock_test: skip read() in test_stream*close tests on a VMCI host
From: Stefan Hajnoczi @ 2019-08-20  8:32 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, kvm, Dexuan Cui, virtualization, David S. Miller,
	Jorgen Hansen, linux-kernel
In-Reply-To: <20190801152541.245833-11-sgarzare@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

On Thu, Aug 01, 2019 at 05:25:40PM +0200, Stefano Garzarella wrote:
> When VMCI transport is used, if the guest closes a connection,
> all data is gone and EOF is returned, so we should skip the read
> of data written by the peer before closing the connection.

All transports should aim for identical semantics.  I think virtio-vsock
should behave the same as VMCI since userspace applications should be
transport-independent.

Let's view this as a vsock bug.  Is it feasible to change the VMCI
behavior so it's more like TCP sockets?  If not, let's change the
virtio-vsock behavior to be compatible with VMCI.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net-next v3 3/4] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	David S. Miller
In-Reply-To: <20190820084833.6019-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
Changes in v3:
 - mv88e6xxx_smi_indirect_write: forward ptp_sts only on the last write
 
 drivers/net/dsa/mv88e6xxx/chip.h |  2 ++
 drivers/net/dsa/mv88e6xxx/ptp.c  | 11 +++++++----
 drivers/net/dsa/mv88e6xxx/smi.c  |  7 ++++++-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index a406be2f5652..1bfde0d8a5a3 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -275,6 +275,8 @@ struct mv88e6xxx_chip {
 	struct ptp_clock_info	ptp_clock_info;
 	struct delayed_work	tai_event_work;
 	struct ptp_pin_desc	pin_config[MV88E6XXX_MAX_GPIO];
+	struct ptp_system_timestamp *ptp_sts;
+
 	u16 trig_config;
 	u16 evcap_config;
 	u16 enable_count;
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index 073cbd0bb91b..cf6e52ee9e0a 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 	return 0;
 }
 
-static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
-				 struct timespec64 *ts)
+static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
+				  struct timespec64 *ts,
+				  struct ptp_system_timestamp *sts)
 {
 	struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
 	u64 ns;
 
 	mv88e6xxx_reg_lock(chip);
+	chip->ptp_sts = sts;
 	ns = timecounter_read(&chip->tstamp_tc);
+	chip->ptp_sts = NULL;
 	mv88e6xxx_reg_unlock(chip);
 
 	*ts = ns_to_timespec64(ns);
@@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
 	struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
 	struct timespec64 ts;
 
-	mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
+	mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
 
 	schedule_delayed_work(&chip->overflow_work,
 			      MV88E6XXX_TAI_OVERFLOW_PERIOD);
@@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
 	chip->ptp_clock_info.max_adj    = MV88E6XXX_MAX_ADJ_PPB;
 	chip->ptp_clock_info.adjfine	= mv88e6xxx_ptp_adjfine;
 	chip->ptp_clock_info.adjtime	= mv88e6xxx_ptp_adjtime;
-	chip->ptp_clock_info.gettime64	= mv88e6xxx_ptp_gettime;
+	chip->ptp_clock_info.gettimex64	= mv88e6xxx_ptp_gettimex;
 	chip->ptp_clock_info.settime64	= mv88e6xxx_ptp_settime;
 	chip->ptp_clock_info.enable	= ptp_ops->ptp_enable;
 	chip->ptp_clock_info.verify	= ptp_ops->ptp_verify;
diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
index 282fe08db050..1b5f78662158 100644
--- a/drivers/net/dsa/mv88e6xxx/smi.c
+++ b/drivers/net/dsa/mv88e6xxx/smi.c
@@ -45,7 +45,8 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
 {
 	int ret;
 
-	ret = mdiobus_write_nested(chip->bus, dev, reg, data);
+	ret = mdiobus_write_sts_nested(chip->bus, dev, reg, data,
+				       chip->ptp_sts);
 	if (ret < 0)
 		return ret;
 
@@ -130,6 +131,7 @@ static int mv88e6xxx_smi_indirect_read(struct mv88e6xxx_chip *chip,
 static int mv88e6xxx_smi_indirect_write(struct mv88e6xxx_chip *chip,
 					int dev, int reg, u16 data)
 {
+	struct ptp_system_timestamp *ptp_sts = chip->ptp_sts;
 	int err;
 
 	err = mv88e6xxx_smi_direct_wait(chip, chip->sw_addr,
@@ -137,11 +139,14 @@ static int mv88e6xxx_smi_indirect_write(struct mv88e6xxx_chip *chip,
 	if (err)
 		return err;
 
+	chip->ptp_sts = NULL;
 	err = mv88e6xxx_smi_direct_write(chip, chip->sw_addr,
 					 MV88E6XXX_SMI_DATA, data);
 	if (err)
 		return err;
 
+	/* Capture the PTP system timestamps only on *this* write operation */
+	chip->ptp_sts = ptp_sts;
 	err = mv88e6xxx_smi_direct_write(chip, chip->sw_addr,
 					 MV88E6XXX_SMI_CMD,
 					 MV88E6XXX_SMI_CMD_BUSY |
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v3 4/4] net: fec: add support for PTP system timestamping for MDIO devices
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Fugang Duan, Vladimir Oltean, David S. Miller
In-Reply-To: <20190820084833.6019-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.

The ptp_read_system_*ts functions already check the ptp_sts pointer.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c01d3ec3e9af..dd1253683ac0 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1815,10 +1815,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 	reinit_completion(&fep->mdio_done);
 
 	/* start a write op */
+	ptp_read_system_prets(bus->ptp_sts);
 	writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
 		FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
 		FEC_MMFR_TA | FEC_MMFR_DATA(value),
 		fep->hwp + FEC_MII_DATA);
+	ptp_read_system_postts(bus->ptp_sts);
 
 	/* wait for end of transfer */
 	time_left = wait_for_completion_timeout(&fep->mdio_done,
@@ -1956,7 +1958,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	struct device_node *node;
 	int err = -ENXIO;
-	u32 mii_speed, holdtime;
+	u32 mii_speed, mii_period, holdtime;
 
 	/*
 	 * The i.MX28 dual fec interfaces are not equal.
@@ -1993,6 +1995,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	 * document.
 	 */
 	mii_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ipg), 5000000);
+	mii_period = div_u64((u64)mii_speed * 2 * NSEC_PER_SEC, clk_get_rate(fep->clk_ipg));
 	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		mii_speed--;
 	if (mii_speed > 63) {
@@ -2034,6 +2037,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 		pdev->name, fep->dev_id + 1);
 	fep->mii_bus->priv = fep;
 	fep->mii_bus->parent = &pdev->dev;
+	fep->mii_bus->flags = MII_BUS_F_PTP_STS_SUPPORTED;
+	fep->mii_bus->ptp_sts_offset = 32 * mii_period;
 
 	node = of_get_child_by_name(pdev->dev.of_node, "mdio");
 	err = of_mdiobus_register(fep->mii_bus, node);
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Florian Fainelli, Heiner Kallweit, Vladimir Oltean,
	David S. Miller
In-Reply-To: <20190820084833.6019-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

The slow MDIO access introduces quite a big offset (~13us) to the PTP
system time synchronisation. With this patch the driver has the possibility
to set the correct offset which can then be compensated.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/phy/mdio_bus.c | 12 ++++++++++++
 include/linux/phy.h        |  8 ++++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 4dba2714495e..50a37cf46f96 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -739,6 +739,18 @@ int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
 	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
 		ptp_read_system_postts(sts);
 
+	/* PTP offset compensation:
+	 * After the MDIO access is completed (from the chip perspective), the
+	 * switch chip will snapshot the PHC timestamp. To make sure our system
+	 * timestamp corresponds to the PHC timestamp, we have to add the
+	 * duration of this MDIO access to sts->post_ts. Linuxptp's phc2sys
+	 * takes the average of pre_ts and post_ts to calculate the final
+	 * system timestamp. With this in mind, we have to add ptp_sts_offset
+	 * twice to post_ts, in order to not introduce an constant time offset.
+	 */
+	if (sts)
+		timespec64_add_ns(&sts->post_ts, 2 * bus->ptp_sts_offset);
+
 	return retval;
 }
 EXPORT_SYMBOL(__mdiobus_write_sts);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 0b33662e0320..615df9c7f2c3 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -283,8 +283,16 @@ struct mii_bus {
 	 * The ptp_read_system_*ts functions already check the ptp_sts pointer.
 	 * The MII_BUS_F_PTP_STS_SUPPORTED-bit must be set in flags, when the
 	 * MDIO bus driver takes the timestamps as described above.
+	 *
+	 * @ptp_sts_offset: This is the compensation offset for the system
+	 * timestamp which is introduced by the slow MDIO access duration. An
+	 * MDIO access consists of 32 clock cycles. Usually the MDIO bus runs
+	 * at ~2.5MHz, so we have to compensate ~12800ns offset.
+	 * Set the ptp_sts_offset to the exact duration of one MDIO frame
+	 * (= 32 * clock-period) in nano-seconds.
 	 */
 	struct ptp_system_timestamp *ptp_sts;
+	u32 ptp_sts_offset;
 };
 
 #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v3 1/4] net: mdio: add support for passing a PTP system timestamp to the mii_bus driver
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Florian Fainelli, Heiner Kallweit, Vladimir Oltean,
	David S. Miller
In-Reply-To: <20190820084833.6019-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.

This patch adds the required infrastructure to solve the problem described
above.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
Changes in v2:
 - Removed mdiobus_write_sts as there was no user
 - Removed ptp_sts_supported-boolean and introduced flags instead

 drivers/net/phy/mdio_bus.c | 76 ++++++++++++++++++++++++++++++++++++++
 include/linux/mdio.h       |  5 +++
 include/linux/phy.h        | 34 +++++++++++++++++
 3 files changed, 115 insertions(+)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index bd04fe762056..4dba2714495e 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -34,6 +34,7 @@
 #include <linux/phy.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
+#include <linux/ptp_clock_kernel.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/mdio.h>
@@ -697,6 +698,81 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
 }
 EXPORT_SYMBOL(mdiobus_write);
 
+/**
+ * __mdiobus_write_sts - Unlocked version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * Write a MDIO bus register and request the MDIO bus driver to take the
+ * system timestamps when sts-pointer is valid. When the bus driver doesn't
+ * support this, the timestamps are taken in this function instead.
+ *
+ * In order to improve the synchronisation precision of phc2sys (from
+ * the linuxptp project) for devices like switches which are attached
+ * to the MDIO bus, it is necessary the get the system timestamps as
+ * close as possible to the access which causes the PTP timestamp
+ * register to be snapshotted in the switch hardware. Usually this is
+ * triggered by an MDIO write access, the snapshotted timestamp is then
+ * transferred by several MDIO reads.
+ *
+ * Caller must hold the mdio bus lock.
+ *
+ * NOTE: MUST NOT be called from interrupt context.
+ */
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			struct ptp_system_timestamp *sts)
+{
+	int retval;
+
+	WARN_ON_ONCE(!mutex_is_locked(&bus->mdio_lock));
+
+	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
+		ptp_read_system_prets(sts);
+
+	bus->ptp_sts = sts;
+	retval = __mdiobus_write(bus, addr, regnum, val);
+	bus->ptp_sts = NULL;
+
+	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
+		ptp_read_system_postts(sts);
+
+	return retval;
+}
+EXPORT_SYMBOL(__mdiobus_write_sts);
+
+/**
+ * mdiobus_write_sts_nested - Nested version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * In case of nested MDIO bus access avoid lockdep false positives by
+ * using mutex_lock_nested().
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			     struct ptp_system_timestamp *sts)
+{
+	int retval;
+
+	BUG_ON(in_interrupt());
+
+	mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
+	retval = __mdiobus_write_sts(bus, addr, regnum, val, sts);
+	mutex_unlock(&bus->mdio_lock);
+
+	return retval;
+}
+EXPORT_SYMBOL(mdiobus_write_sts_nested);
+
 /**
  * mdio_bus_match - determine if given MDIO driver supports the given
  *		    MDIO device
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index e8242ad88c81..482388341c7b 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -9,6 +9,7 @@
 #include <uapi/linux/mdio.h>
 #include <linux/mod_devicetable.h>
 
+struct ptp_system_timestamp;
 struct gpio_desc;
 struct mii_bus;
 
@@ -305,11 +306,15 @@ static inline void mii_10gbt_stat_mod_linkmode_lpa_t(unsigned long *advertising,
 
 int __mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
 int __mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			struct ptp_system_timestamp *sts);
 
 int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
 int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum);
 int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
 int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			     struct ptp_system_timestamp *sts);
 
 int mdiobus_register_device(struct mdio_device *mdiodev);
 int mdiobus_unregister_device(struct mdio_device *mdiodev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index d26779f1fb6b..0b33662e0320 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -205,6 +205,13 @@ struct device;
 struct phylink;
 struct sk_buff;
 
+/* MII-bus flags:
+ * @MII_BUS_F_PTP_STS_SUPPORTED: The driver supports PTP system timestamping
+ */
+enum mii_bus_flags {
+	MII_BUS_F_PTP_STS_SUPPORTED = BIT(0)
+};
+
 /*
  * The Bus class for PHYs.  Devices which provide access to
  * PHYs should register using this structure
@@ -252,7 +259,34 @@ struct mii_bus {
 	int reset_delay_us;
 	/* RESET GPIO descriptor pointer */
 	struct gpio_desc *reset_gpiod;
+
+	/* Feature flags */
+	u32 flags;
+
+	/* PTP system timestamping support
+	 *
+	 * In order to improve the synchronisation precision of phc2sys (from
+	 * the linuxptp project) for devices like switches which are attached
+	 * to the MDIO bus, it is necessary the get the system timestamps as
+	 * close as possible to the access which causes the PTP timestamp
+	 * register to be snapshotted in the switch hardware. Usually this is
+	 * triggered by an MDIO write access, the snapshotted timestamp is then
+	 * transferred by several MDIO reads.
+	 *
+	 * The switch driver can use mdio_write_sts*() to pass through the
+	 * system timestamp pointer @ptp_sts to the MDIO bus driver. The bus
+	 * driver simply has to do the following calls in its write handler:
+	 *	ptp_read_system_prets(bus->ptp_sts);
+	 *	writel(value, mdio-register)
+	 *	ptp_read_system_postts(bus->ptp_sts);
+	 *
+	 * The ptp_read_system_*ts functions already check the ptp_sts pointer.
+	 * The MII_BUS_F_PTP_STS_SUPPORTED-bit must be set in flags, when the
+	 * MDIO bus driver takes the timestamps as described above.
+	 */
+	struct ptp_system_timestamp *ptp_sts;
 };
+
 #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
 
 struct mii_bus *mdiobus_alloc_size(size_t);
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v3 0/4] Improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Hubert Feurstein @ 2019-08-20  8:48 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Vivien Didelot, Florian Fainelli, Heiner Kallweit,
	Vladimir Oltean, Fugang Duan, David S. Miller

From: Hubert Feurstein <h.feurstein@gmail.com>

Changelog:
 v3: mv88e6xxx_smi_indirect_write: forward ptp_sts only on the last write
     Copied Miroslav Lichvar because of PTP offset compensation patch
 v2: Added patch for PTP offset compensation
     Removed mdiobus_write_sts as there was no user
     Removed ptp_sts_supported-boolean and introduced flags instead

With this patchset the phc2sys synchronisation precision improved to +/-555ns on
an IMX6DL with an MV88E6220 switch attached.

This patchset takes into account the comments from the following discussions:
- https://lkml.org/lkml/2019/8/2/1364
- https://lkml.org/lkml/2019/8/5/169

Patch 01 adds the required infrastructure in the MDIO layer.
Patch 02 adds additional PTP offset compensation.
Patch 03 adds support for the PTP_SYS_OFFSET_EXTENDED ioctl in the mv88e6xxx driver.
Patch 04 adds support for the PTP system timestamping in the imx-fec driver.

The following tests show the improvement caused by each patch. The system clock 
precision was set to 15ns instead of 333ns (as described in https://lkml.org/lkml/2019/8/2/1364).

Without this patchset applied, the phc2sys synchronisation performance was very 
poor:

  offset: min -27120 max 28840 mean 2.44 stddev 8040.78 count 1236
  delay:  min 282103 max 386385 mean 352568.03 stddev 27814.27 count 1236
  (test runtime 20 minutes)

Results after appling patch 01-03:

  offset: min -12316 max 13314 mean -9.38 stddev 4274.82 count 1022
  delay:  min 69977 max 96266 mean 87939.04 stddev 6466.17 count 1022
  (test runtime 16 minutes)

Results after appling patch 04:

  offset: min -788 max 528 mean -0.06 stddev 185.02 count 7171
  delay:  min 1773 max 2031 mean 1909.43 stddev 33.74 count 7171
  (test runtime 119 minutes)

Hubert Feurstein (4):
  net: mdio: add support for passing a PTP system timestamp to the
    mii_bus driver
  net: mdio: add PTP offset compensation to mdiobus_write_sts
  net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
  net: fec: add support for PTP system timestamping for MDIO devices

 drivers/net/dsa/mv88e6xxx/chip.h          |  2 +
 drivers/net/dsa/mv88e6xxx/ptp.c           | 11 +--
 drivers/net/dsa/mv88e6xxx/smi.c           |  7 +-
 drivers/net/ethernet/freescale/fec_main.c |  7 +-
 drivers/net/phy/mdio_bus.c                | 88 +++++++++++++++++++++++
 include/linux/mdio.h                      |  5 ++
 include/linux/phy.h                       | 42 +++++++++++
 7 files changed, 156 insertions(+), 6 deletions(-)

-- 
2.22.1


^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-20  8:58 UTC (permalink / raw)
  To: Parav Pandit, Alex Williamson, Jiri Pirko, David S . Miller,
	Kirti Wankhede, Cornelia Huck
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia@nvidia.com, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866148ABA3C4E48E73E95FCD1AD0@AM0PR05MB4866.eurprd05.prod.outlook.com>

+ Dave.

Hi Jiri, Dave, Alex, Kirti, Cornelia,

Please provide your feedback on it, how shall we proceed?

Short summary of requirements.
For a given mdev (mediated device [1]), there is one representor netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
And there is one netdevice for the actual mdev when mdev is probed.

(a) representor netdev and devlink port should be able derive phys_port_name().
So that representor netdev name can be built deterministically across reboots.

(b) for mdev's netdevice, mdev's device should have an attribute.
This attribute can be used by udev rules/systemd or something else to rename netdev name deterministically.

(c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ in drivers, uapi, netlink, boot config area and more.
Changing IFNAMSIZ for a mdev bus doesn't really look reasonable option to me.

Hence, I would like to discuss below options.

Option-1: mdev index
Introduce an optional mdev index/handle as u32 during mdev create time.
User passes mdev index/handle as input.

phys_port_name=mIndex=m%u
mdev_index will be available in sysfs as mdev attribute for udev to name the mdev's netdev.

example mdev create command:
UUID=$(uuidgen)
echo $UUID index=10 > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
example netdevs:
repnetdev=ens2f0_m10	/*ens2f0 is parent PF's netdevice */
mdev_netdev=enm10

Pros:
1. mdevctl and any other existing tools are unaffected.
2. netdev stack, ovs and other switching platforms are unaffected.
3. achieves unique phys_port_name for representor netdev
4. achieves unique mdev eth netdev name for the mdev using udev/systemd extension.
5. Aligns well with mdev and netdev subsystem and similar to existing sriov bdf's.

Option-2: shorter mdev name
Extend mdev to have shorter mdev device name in addition to UUID.
such as 'foo', 'bar'.
Mdev will continue to have UUID.
phys_port_name=mdev_name

Pros:
1. All same as option-1, except mdevctl needs upgrade for newer usage.
It is common practice to upgrade iproute2 package along with the kernel.
Similar practice to be done with mdevctl.
2. Newer users of mdevctl who wants to work with non_UUID names, will use newer mdevctl/tools.
Cons:
1. Dual naming scheme of mdev might affect some of the existing tools.
It's unclear how/if it actually affects.
mdevctl [2] is very recently developed and can be enhanced for dual naming scheme.

Option-3: mdev uuid alias
Instead of shorter mdev name or mdev index, have alpha-numeric name alias.
Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
example mdev create command:
UUID=$(uuidgen)
echo $UUID alias=foo > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
example netdevs:
examle netdevs:
repnetdev = ens2f0_mfoo
mdev_netdev=enmfoo

Pros:
1. All same as option-1.
2. Doesn't affect existing mdev naming scheme.
Cons:
1. Index scheme of option-1 is better which can number large number of mdevs with fewer characters, simplifying the management tool.

Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16 to 64 bytes phys_port_name=mdev_UUID_string mdev_netdev_name=enmUUID

Pros:
1. Doesn't require mdev extension
Cons:
1. netdev stack, driver, uapi, user space, boot config wide changes
2. Possible user space extensions who assumed name size being 16 characters
3. Single device type demands namesize change for all netdev types

[1] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
[2] https://github.com/mdevctl/mdevctl

Regards,
Parav Pandit

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Parav Pandit
> Sent: Wednesday, August 14, 2019 9:51 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko <jiri@mellanox.com>;
> netdev@vger.kernel.org
> Subject: RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> 
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, August 14, 2019 8:28 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko
> > <jiri@mellanox.com>; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >
> > On Wed, 14 Aug 2019 13:45:49 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > Sent: Wednesday, August 14, 2019 6:39 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti Wankhede
> > > > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > > > kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko
> > > > <jiri@mellanox.com>; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Wed, 14 Aug 2019 12:27:01 +0000 Parav Pandit
> > > > <parav@mellanox.com> wrote:
> > > >
> > > > > + Jiri, + netdev
> > > > > To get perspective on the ndo->phys_port_name for the
> > > > > representor netdev
> > > > of mdev.
> > > > >
> > > > > Hi Cornelia,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > > > Sent: Wednesday, August 14, 2019 1:32 PM
> > > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti
> > > > > > Wankhede <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > > > > > kernel@vger.kernel.org; cjia@nvidia.com
> > > > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > > >
> > > > > > On Wed, 14 Aug 2019 05:54:36 +0000 Parav Pandit
> > > > > > <parav@mellanox.com> wrote:
> > > > > >
> > > > > > > > > I get that part. I prefer to remove the UUID itself from
> > > > > > > > > the structure and therefore removing this API makes lot
> > > > > > > > > more
> > sense?
> > > > > > > >
> > > > > > > > Mdev and support tools around mdev are based on UUIDs
> > > > > > > > because it's
> > > > > > defined
> > > > > > > > in the documentation.
> > > > > > > When we introduce newer device naming scheme, it will update
> > > > > > > the
> > > > > > documentation also.
> > > > > > > May be that is the time to move to .rst format too.
> > > > > >
> > > > > > You are aware that there are existing tools that expect a uuid
> > > > > > naming scheme, right?
> > > > > >
> > > > > Yes, Alex mentioned too.
> > > > > The good tool that I am aware of is [1], which is 4 months old.
> > > > > Not sure if it is
> > > > part of any distros yet.
> > > > >
> > > > > README also says, that it is in 'early in development. So we
> > > > > have scope to
> > > > improve it for non UUID names, but lets discuss that more below.
> > > >
> > > > The up-to-date reference for mdevctl is
> > > > https://github.com/mdevctl/mdevctl. There is currently an effort
> > > > to get this packaged in Fedora.
> > > >
> > > Awesome.
> > >
> > > > >
> > > > > > >
> > > > > > > > I don't think it's as simple as saying "voila, UUID
> > > > > > > > dependencies are removed, users are free to use arbitrary
> > > > > > > > strings".  We'd need to create some kind of naming policy,
> > > > > > > > what characters are allows so that we can potentially
> > > > > > > > expand the creation parameters as has been proposed a
> > > > > > > > couple times, how do we deal with collisions and races,
> > > > > > > > and why should we make such a change when a UUID is a
> > > > > > > > perfectly reasonable devices name.  Thanks,
> > > > > > > >
> > > > > > > Sure, we should define a policy on device naming to be more
> relaxed.
> > > > > > > We have enough examples in-kernel.
> > > > > > > Few that I am aware of are netdev (vxlan, macvlan, ipvlan,
> > > > > > > lot more), rdma
> > > > > > etc which has arbitrary device names and ID based device names.
> > > > > > >
> > > > > > > Collisions and race is already taken care today in the mdev core.
> > > > > > > Same
> > > > > > unique device names continue.
> > > > > >
> > > > > > I'm still completely missing a rationale _why_ uuids are
> > > > > > supposedly bad/restricting/etc.
> > > > > There is nothing bad about uuid based naming.
> > > > > Its just too long name to derive phys_port_name of a netdev.
> > > > > In details below.
> > > > >
> > > > > For a given mdev of networking type, we would like to have
> > > > > (a) representor netdevice [2]
> > > > > (b) associated devlink port [3]
> > > > >
> > > > > Currently these representor netdevice exist only for the PCIe SR-IOV
> VFs.
> > > > > It is further getting extended for mdev without SR-IOV.
> > > > >
> > > > > Each of the devlink port is attached to representor netdevice [4].
> > > > >
> > > > > This netdevice phys_port_name should be a unique derived from
> > > > > some
> > > > property of mdev.
> > > > > Udev/systemd uses phys_port_name to derive unique representor
> > > > > netdev
> > > > name.
> > > > > This netdev name is further use by orchestration and switching
> > > > > software in
> > > > user space.
> > > > > One such distro supported switching software is ovs [4], which
> > > > > relies on the
> > > > persistent device name of the representor netdevice.
> > > >
> > > > Ok, let me rephrase this to check that I understand this correctly.
> > > > I'm not sure about some of the terms you use here (even after
> > > > looking at the linked doc/code), but that's probably still ok.
> > > >
> > > > We want to derive an unique (and probably persistent?) netdev name
> > > > so that userspace can refer to a representor netdevice. Makes sense.
> > > > For generating that name, udev uses the phys_port_name (which
> > > > represents the devlink port, IIUC). Also makes sense.
> > > >
> > > You understood it correctly.
> > >
> > > > >
> > > > > phys_port_name has limitation to be only 15 characters long.
> > > > > UUID doesn't fit in phys_port_name.
> > > >
> > > > Understood. But why do we need to derive the phys_port_name from
> > > > the mdev device name? This netdevice use case seems to be just one
> > > > use case for using mdev devices? If this is a specialized mdev
> > > > type for this setup, why not just expose a shorter identifier via an extra
> attribute?
> > > >
> > > Representor netdev, represents mdev's switch port (like PCI SRIOV
> > > VF's switch
> > port).
> > > So user must be able to relate this two objects in similar manner as
> > > SRIOV
> > VFs.
> > > Phys_port_name is derived from the PCI PF and VF numbering scheme.
> > > Similarly mdev's such port should be derived from mdev's
> id/name/attribute.
> > >
> > > > > Longer UUID names are creating snow ball effect, not just in
> > > > > networking stack
> > > > but many user space tools too.
> > > >
> > > > This snowball effect mainly comes from the device name ->
> > > > phys_port_name setup, IIUC.
> > > >
> > > Right.
> > >
> > > > > (as opposed to recently introduced mdevctl, are they more mdev
> > > > > tools which has dependency on UUID name?)
> > > >
> > > > I am aware that people have written scripts etc. to manage their mdevs.
> > > > Given that the mdev infrastructure has been around for quite some
> > > > time, I'd say the chance of some of those scripts relying on uuid
> > > > names is
> > non-zero.
> > > >
> > > Ok. but those scripts have never managed networking devices.
> > > So those scripts won't break because they will always create mdev
> > > devices
> > using UUID.
> > > When they use these new networking devices, they need more things
> > > than
> > their scripts.
> > > So user space upgrade for such mixed mode case is reasonable.
> >
> > Tools like mdevctl are agnostic of the type of mdev device they're
> > managing, it shouldn't matter than they've never managed a networking
> > mdev previously, it follows the standards of mdev management.
> >
> > > > >
> > > > > Instead of mdev subsystem creating such effect, one option we
> > > > > are
> > > > considering is to have shorter mdev names.
> > > > > (Similar to netdev, rdma, nvme devices).
> > > > > Such as mdev1, mdev2000 etc.
> >
> > Note that these are kernel generated names, as are the other examples.
> No. I probably gave the wrong examples.
> Mdev user provided names can be 'foo', 'bar', 'foo1'.
> 
> > In the case of mdev, the user is providing the UUID, which becomes the
> > device name.  When a user writes to the create attribute, there needs
> > to be determinism that the user can identify the device they created
> > vs another that may have been created concurrently.  I don't see that
> > we can put users in the path of managing device instance numbers.
> No. Its just user provided names.
> 
> >
> > > > > Second option I was considering is to have an optional alias for
> > > > > UUID based
> > > > mdev.
> > > > > This name alias is given at time of mdev creation.
> > > > > Devlink port's phys_port_name is derived out of this shorter
> > > > > mdev name
> > > > alias.
> > > > > This way, mdev remains to be UUID based with optional extension.
> > > > > However, I prefer first option to relax mdev naming scheme.
> > > >
> > > > Actually, I think that second option makes much more sense, as you
> > > > avoid potentially breaking existing tooling.
> > > Let's first understand of what exactly will break with existing tool
> > > if they see non_uuid based device.
> >
> > Do we really want a mixed namespace of device names, some UUID, some...
> > something else?  That seems like a mess.
> >
> So you prefer alias as an attribute? If so, it should be an optional additional
> parameter during create time, because it is desired to not invent new callbacks
> for such attributes setting and (and rewrite them).
> 
> > > Existing tooling continue to work with UUID devices.
> > > Do you have example of what can break if they see non_uuid based
> > > device name? I think you are clear, but to be sure, UUID based
> > > creation will continue to be there. Optionally mdev will be created
> > > with alpha-numeric string, if we don't it as additional attribute.
> >
> > I'm not onboard with a UUID being just one of the possible naming
> > strings via which we can create mdev devices.  I think that becomes
> > untenable for userspace.  I don't think a sufficient argument has been
> > made against the alias approach, which seems to keep the UUID as a
> > canonical name, providing a consistent namespace, augmented with user or
> kernel provided short alias.
> > Thanks,
> >
> If I understand you correctly, you prefer alias name approach to keep UUID
> naming scheme intact in mdev?
> 
> > Alex
> >
> > > > >
> > > > > > We want to uniquely identify a device, across different types
> > > > > > of vendor drivers. An uuid is a unique identifier and even a
> > > > > > well-defined one. Tools (e.g. mdevctl) are relying on it for
> > > > > > mdev devices
> > > > today.
> > > > > >
> > > > > > What is the problem you're trying to solve?
> > > > > Unique device naming is still achieved without UUID scheme by
> > > > > various
> > > > subsystems in kernel using alpha-numeric string.
> > > > > Having such string based continue to provide unique names.
> > > > >
> > > > > I hope I described the problem and two solutions above.
> > > > >
> > > > > [1] https://github.com/awilliam/mdevctl
> > > > > [2]
> > > > > https://elixir.bootlin.com/linux/v5.3-rc4/source/drivers/net/eth
> > > > > er
> > > > > net/
> > > > > mellanox/mlx5/core/en_rep.c [3]
> > > > > http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > > > > [4]
> > > > > https://elixir.bootlin.com/linux/v5.3-rc4/source/net/core/devlink.
> > > > > c#L6
> > > > > 921
> > > > > [5] https://www.openvswitch.org/
> > > > >
> > >


^ permalink raw reply

* Re: [PATCH -next] bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
From: Dan Carpenter @ 2019-08-20  8:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: YueHaibing, magnus.karlsson, jonathan.lemon, ast, daniel, kafai,
	songliubraving, yhs, john.fastabend, netdev, bpf, kernel-janitors
In-Reply-To: <93fafdab-8fb3-0f2b-8f36-0cf297db3cd9@intel.com>

On Tue, Aug 20, 2019 at 09:28:26AM +0200, Björn Töpel wrote:
> For future patches: Prefix AF_XDP socket work with "xsk:" and use "PATCH
> bpf-next" to let the developers know what tree you're aiming for.

There are over 300 trees in linux-next.  It impossible to try remember
everyone's trees.  No one else has this requirement.

Maybe add it as an option to get_maintainer.pl --tree <hash> then that
would be very easy.

regards,
dan carpenter


^ permalink raw reply

* [PATCH net] gve: Copy and paste buy in gve_get_stats()
From: Dan Carpenter @ 2019-08-20  9:00 UTC (permalink / raw)
  To: Catherine Sullivan
  Cc: Sagi Shahar, Jon Olson, David S. Miller, Willem de Bruijn,
	Luigi Rizzo, Chuhong Yuan, netdev, kernel-janitors

There is a copy and paste error so we have "rx" where "tx" was intended
in the priv->tx[] array.

Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 drivers/net/ethernet/google/gve/gve_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 497298752381..aca95f64bde8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -50,7 +50,7 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
 				  u64_stats_fetch_begin(&priv->tx[ring].statss);
 				s->tx_packets += priv->tx[ring].pkt_done;
 				s->tx_bytes += priv->tx[ring].bytes_done;
-			} while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+			} while (u64_stats_fetch_retry(&priv->tx[ring].statss,
 						       start));
 		}
 	}
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH 2/3] macb: Update compatibility string for SiFive FU540-C000
From: Andreas Schwab @ 2019-08-20  9:10 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Nicolas Ferre, David Miller, Yash Shah, Rob Herring, netdev,
	devicetree, linux-kernel@vger.kernel.org List, linux-riscv,
	Mark Rutland, Palmer Dabbelt, Albert Ou, Petr Štetiar,
	Sachin Ghadi
In-Reply-To: <alpine.DEB.2.21.9999.1908131142150.5033@viisi.sifive.com>

On Aug 13 2019, Paul Walmsley <paul.walmsley@sifive.com> wrote:

> Dave, Nicolas,
>
> On Mon, 22 Jul 2019, Yash Shah wrote:
>
>> On Fri, Jul 19, 2019 at 5:36 PM <Nicolas.Ferre@microchip.com> wrote:
>> >
>> > On 19/07/2019 at 13:10, Yash Shah wrote:
>> > > Update the compatibility string for SiFive FU540-C000 as per the new
>> > > string updated in the binding doc.
>> > > Reference: https://lkml.org/lkml/2019/7/17/200
>> >
>> > Maybe referring to lore.kernel.org is better:
>> > https://lore.kernel.org/netdev/CAJ2_jOFEVZQat0Yprg4hem4jRrqkB72FKSeQj4p8P5KA-+rgww@mail.gmail.com/
>> 
>> Sure. Will keep that in mind for future reference.
>> 
>> >
>> > > Signed-off-by: Yash Shah <yash.shah@sifive.com>
>> >
>> > Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
>> 
>> Thanks.
>
> Am assuming you'll pick this up for the -net tree for v5.4-rc1 or earlier.
> If not, please let us know.

This is still missing in v5.4-rc5, which means that networking is broken.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:10 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190819173810.GK2588@breakpoint.cc>

Thanks for your response.


> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Monday, August 19, 2019 11:08 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: Help needed - Kernel lockup while running ipsec
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > Hi
> >
> > With kernel 4.14.122, I am getting a kernel softlockup while running single
> static ipsec tunnel.
> > The problem reproduces mostly after running 8-10 hours of ipsec encap
> test (on my dual core arm board).
> >
> > I found that in function xfrm_policy_lookup_bytype(), the policy in variable
> 'ret' shows refcnt=0 under problem situation.
> > This creates an infinite loop in  xfrm_policy_lookup_bytype() and hence the
> lockup.
> >
> > Can some body please provide me pointers about 'refcnt'?
> > Is it legitimate for 'refcnt' to become '0'? Under what condition can it
> become '0'?
> 
> Yes, when policy is destroyed and the last user calls
> xfrm_pol_put() which will invoke call_rcu to free the structure.

It seems that policy reference count never gets decremented during packet ipsec encap.
It is getting incremented for every frame that hits the policy.
In setkey -DP output, I see refcnt to be wrapping around after '0'.

Is this designed to be like this or is it weird?


^ permalink raw reply

* Re: [PATCH 2/2] dt-bindings: can: flexcan: add can wakeup property
From: Sean Nyekjaer @ 2019-08-20  9:10 UTC (permalink / raw)
  To: linux-can, mkl; +Cc: Rob Herring, netdev, robh+dt, devicetree
In-Reply-To: <20190429173930.GA11283@bogus>



On 29/04/2019 19.39, Rob Herring wrote:
> On Tue,  9 Apr 2019 10:39:49 +0200, Sean Nyekjaer wrote:
>> add wakeup-source boolean property.
>>
>> Signed-off-by: Sean Nyekjaer <sean@geanix.com>
>> ---
>>   Documentation/devicetree/bindings/net/can/fsl-flexcan.txt | 2 ++
>>   1 file changed, 2 insertions(+)
>>
> 
> Reviewed-by: Rob Herring <robh@kernel.org>
>
This doesn't seem to be applied. PATCH 1/2 in this series is applied.

In any case which repo does this patch belong to?

/Sean

^ permalink raw reply

* [PATCH v2 net] gve: Copy and paste bug in gve_get_stats()
From: Dan Carpenter @ 2019-08-20  9:11 UTC (permalink / raw)
  To: Catherine Sullivan
  Cc: Sagi Shahar, Jon Olson, David S. Miller, Willem de Bruijn,
	Luigi Rizzo, Chuhong Yuan, netdev, kernel-janitors
In-Reply-To: <20190820090053.GA24410@mwanda>

There is a copy and paste error so we have "rx" where "tx" was intended
in the priv->tx[] array.

Fixes: f5cedc84a30d ("gve: Add transmit and receive support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
v2: fix a typo in the subject: buy -> bug (Thanks Walter Harms)

 drivers/net/ethernet/google/gve/gve_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 497298752381..aca95f64bde8 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -50,7 +50,7 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
 				  u64_stats_fetch_begin(&priv->tx[ring].statss);
 				s->tx_packets += priv->tx[ring].pkt_done;
 				s->tx_bytes += priv->tx[ring].bytes_done;
-			} while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+			} while (u64_stats_fetch_retry(&priv->tx[ring].statss,
 						       start));
 		}
 	}
-- 
2.20.1

^ permalink raw reply related

* Re: general protection fault in xsk_poll
From: Björn Töpel @ 2019-08-20  9:17 UTC (permalink / raw)
  To: Hillf Danton, syzbot
  Cc: ast, bpf, daniel, davem, hawk, jakub.kicinski, john.fastabend,
	jonathan.lemon, kafai, linux-kernel, magnus.karlsson, netdev,
	songliubraving, syzkaller-bugs, xdp-newbies, yhs
In-Reply-To: <20190820033154.9112-1-hdanton@sina.com>

On 2019-08-20 05:31, Hillf Danton wrote:
> 
> On Mon, 19 Aug 2019 18:18:06 -0700
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    da657043 Add linux-next specific files for 20190819
>> git tree:       linux-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=16af124c600000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=739a9b3ab3d8c770
>> dashboard link: https://syzkaller.appspot.com/bug?extid=c82697e3043781e08802
>> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=109e1922600000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1445bf02600000
>>
>> The bug was bisected to:
>>
>> commit 77cd0d7b3f257fd0e3096b4fdcff1a7d38e99e10
>> Author: Magnus Karlsson <magnus.karlsson@intel.com>
>> Date:   Wed Aug 14 07:27:17 2019 +0000
>>
>>       xsk: add support for need_wakeup flag in AF_XDP rings
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=15e1ea4c600000
>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=17e1ea4c600000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=13e1ea4c600000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
>> Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
>>
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 1 PID: 7959 Comm: syz-executor611 Not tainted 5.3.0-rc5-next-20190819 #68
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:xsk_poll+0x95/0x540 net/xdp/xsk.c:386
>> Code: 80 3c 02 00 0f 85 70 04 00 00 4c 8b a3 88 04 00 00 48 b8 00 00 00 00
>> 00 fc ff df 49 8d bc 24 96 00 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48
>> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 bf 03 00 00
>> RSP: 0018:ffff8880926f7850 EFLAGS: 00010207
>> RAX: dffffc0000000000 RBX: ffff88809a141700 RCX: ffffffff859b07aa
>> RDX: 0000000000000012 RSI: ffffffff859b07c4 RDI: 0000000000000096
>> RBP: ffff8880926f7880 R08: ffff88809698a580 R09: ffffed1013428329
>> R10: ffffed1013428328 R11: ffff88809a141947 R12: 0000000000000000
>> R13: 0000000000000304 R14: ffff888095d4d840 R15: ffff888092bdd020
>> FS:  0000555557529880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000020000280 CR3: 0000000098281000 CR4: 00000000001406e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>    sock_poll+0x15e/0x480 net/socket.c:1256
>>    vfs_poll include/linux/poll.h:90 [inline]
>>    do_pollfd fs/select.c:859 [inline]
>>    do_poll fs/select.c:907 [inline]
>>    do_sys_poll+0x7c2/0xde0 fs/select.c:1001
>>    __do_sys_ppoll fs/select.c:1101 [inline]
>>    __se_sys_ppoll fs/select.c:1081 [inline]
>>    __x64_sys_ppoll+0x259/0x310 fs/select.c:1081
>>    do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> RIP: 0033:0x440159
>> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
>> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
>> ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
>> RSP: 002b:00007ffd9fbd16e8 EFLAGS: 00000246 ORIG_RAX: 000000000000010f
>> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440159
>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000020000280
>> RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
>> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004019e0
>> R13: 0000000000401a70 R14: 0000000000000000 R15: 0000000000000000
>> Modules linked in:
>> ---[ end trace da907175426b4065 ]---
> 
> Add umem check.
> 
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -381,9 +381,9 @@ static unsigned int xsk_poll(struct file
>   	struct sock *sk = sock->sk;
>   	struct xdp_sock *xs = xdp_sk(sk);
>   	struct net_device *dev = xs->dev;
> -	struct xdp_umem *umem = xs->umem;
> +	struct xdp_umem *umem = READ_ONCE(xs->umem);
>   
> -	if (umem->need_wakeup)
> +	if (umem && umem->need_wakeup)
>   		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
>   						umem->need_wakeup);
>   
> --
> 

Thanks!

What do you think about making it a bit more generic, like:

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ee4428a892fa..08bed5e92af4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -356,13 +356,20 @@ static int xsk_generic_xmit(struct sock *sk, 
struct msghdr *m,
  	return err;
  }

+static bool xsk_is_bound(struct xdp_sock *xs)
+{
+	struct net_device *dev = READ_ONCE(xs->dev);
+
+	return dev && xs->state == XSK_BOUND;
+}
+
  static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t 
total_len)
  {
  	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
  	struct sock *sk = sock->sk;
  	struct xdp_sock *xs = xdp_sk(sk);

-	if (unlikely(!xs->dev))
+	if (unlikely(!xsk_is_bound(xs)))
  		return -ENXIO;
  	if (unlikely(!(xs->dev->flags & IFF_UP)))
  		return -ENETDOWN;
@@ -383,6 +390,9 @@ static unsigned int xsk_poll(struct file *file, 
struct socket *sock,
  	struct net_device *dev = xs->dev;
  	struct xdp_umem *umem = xs->umem;

+	if (unlikely(!xsk_is_bound(xs)))
+		return mask;
+
  	if (umem->need_wakeup)
  		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
  						umem->need_wakeup);
@@ -417,7 +427,7 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
  {
  	struct net_device *dev = xs->dev;

-	if (!dev || xs->state != XSK_BOUND)
+	if (!xsk_is_bound(xs))
  		return;

  	xs->state = XSK_UNBOUND;

^ permalink raw reply related

* Re: Help needed - Kernel lockup while running ipsec
From: Florian Westphal @ 2019-08-20  9:23 UTC (permalink / raw)
  To: Vakul Garg; +Cc: Florian Westphal, netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620C6E770C97AB14A04A1D98BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>

Vakul Garg <vakul.garg@nxp.com> wrote:
> > > With kernel 4.14.122, I am getting a kernel softlockup while running single
> > static ipsec tunnel.
> > > The problem reproduces mostly after running 8-10 hours of ipsec encap
> > test (on my dual core arm board).
> > >
> > > I found that in function xfrm_policy_lookup_bytype(), the policy in variable
> > 'ret' shows refcnt=0 under problem situation.
> > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and hence the
> > lockup.
> > >
> > > Can some body please provide me pointers about 'refcnt'?
> > > Is it legitimate for 'refcnt' to become '0'? Under what condition can it
> > become '0'?
> > 
> > Yes, when policy is destroyed and the last user calls
> > xfrm_pol_put() which will invoke call_rcu to free the structure.
> 
> It seems that policy reference count never gets decremented during packet ipsec encap.
> It is getting incremented for every frame that hits the policy.
> In setkey -DP output, I see refcnt to be wrapping around after '0'.

Thats a bug.  Does this affect 4.14 only or does this happen on current
tree as well?

^ permalink raw reply

* Re: [PATCH -next] bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
From: Björn Töpel @ 2019-08-20  9:25 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Björn Töpel, YueHaibing, Karlsson, Magnus,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Netdev,
	bpf, kernel-janitors
In-Reply-To: <20190820085547.GE4451@kadam>

On Tue, 20 Aug 2019 at 10:59, Dan Carpenter <dan.carpenter@oracle.com> wrote:
>
> On Tue, Aug 20, 2019 at 09:28:26AM +0200, Björn Töpel wrote:
> > For future patches: Prefix AF_XDP socket work with "xsk:" and use "PATCH
> > bpf-next" to let the developers know what tree you're aiming for.
>
> There are over 300 trees in linux-next.  It impossible to try remember
> everyone's trees.  No one else has this requirement.
>

Net/bpf are different, and I wanted to point that out to lessen the
burden for the maintainers. It's documented in:

Documentation/bpf/bpf_devel_QA.rst.
Documentation/networking/netdev-FAQ.rst

> Maybe add it as an option to get_maintainer.pl --tree <hash> then that
> would be very easy.
>

Yes, improved tooling would help.


Cheers,
Björn

> regards,
> dan carpenter
>

^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:26 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190820092303.GM2588@breakpoint.cc>



> -----Original Message-----
> From: Florian Westphal <fw@strlen.de>
> Sent: Tuesday, August 20, 2019 2:53 PM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> Subject: Re: Help needed - Kernel lockup while running ipsec
> 
> Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > running single
> > > static ipsec tunnel.
> > > > The problem reproduces mostly after running 8-10 hours of ipsec
> > > > encap
> > > test (on my dual core arm board).
> > > >
> > > > I found that in function xfrm_policy_lookup_bytype(), the policy
> > > > in variable
> > > 'ret' shows refcnt=0 under problem situation.
> > > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and
> > > > hence the
> > > lockup.
> > > >
> > > > Can some body please provide me pointers about 'refcnt'?
> > > > Is it legitimate for 'refcnt' to become '0'? Under what condition
> > > > can it
> > > become '0'?
> > >
> > > Yes, when policy is destroyed and the last user calls
> > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> >
> > It seems that policy reference count never gets decremented during packet
> ipsec encap.
> > It is getting incremented for every frame that hits the policy.
> > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> 
> Thats a bug.  Does this affect 4.14 only or does this happen on current tree
> as well?
 
I am yet to try it on 4.19.
Can you help me with the right fix? Which part of code should it get decremented?
I am not conversant with xfrm code.


^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-20  9:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620487074796FBC015AFD098BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>



> > -----Original Message-----
> > From: Florian Westphal <fw@strlen.de>
> > Sent: Tuesday, August 20, 2019 2:53 PM
> > To: Vakul Garg <vakul.garg@nxp.com>
> > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > Subject: Re: Help needed - Kernel lockup while running ipsec
> >
> > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > running single
> > > > static ipsec tunnel.
> > > > > The problem reproduces mostly after running 8-10 hours of ipsec
> > > > > encap
> > > > test (on my dual core arm board).
> > > > >
> > > > > I found that in function xfrm_policy_lookup_bytype(), the policy
> > > > > in variable
> > > > 'ret' shows refcnt=0 under problem situation.
> > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype() and
> > > > > hence the
> > > > lockup.
> > > > >
> > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > Is it legitimate for 'refcnt' to become '0'? Under what condition
> > > > > can it
> > > > become '0'?
> > > >
> > > > Yes, when policy is destroyed and the last user calls
> > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > >
> > > It seems that policy reference count never gets decremented during
> packet
> > ipsec encap.
> > > It is getting incremented for every frame that hits the policy.
> > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> >
> > Thats a bug.  Does this affect 4.14 only or does this happen on current tree
> > as well?
> 
> I am yet to try it on 4.19.

Correction: I am yet to try it on current tree.

> Can you help me with the right fix? Which part of code should it get
> decremented?
> I am not conversant with xfrm code.


^ permalink raw reply

* [PATCH bpf-next v2 0/5] bpf: list BTF objects loaded on system
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This set adds a new command BPF_BTF_GET_NEXT_ID to the bpf() system call,
adds the relevant API function in libbpf, and uses it in bpftool to list
all BTF objects loaded on the system (and to dump the ids of maps and
programs associated with them, if any).

The main motivation of listing BTF objects is introspection and debugging
purposes. By getting BPF program and map information, it should already be
possible to list all BTF objects associated to at least one map or one
program. But there may be unattached BTF objects, held by a file descriptor
from a user space process only, and we may want to list them too.

As a side note, it also turned useful for examining the BTF objects
attached to offloaded programs, which would not show in program information
because the BTF id is not copied when retrieving such info. A fix is in
progress on that side.

v2:
- Rebase patch with new libbpf function on top of Andrii's changes
  regarding libbpf versioning.

Quentin Monnet (5):
  bpf: add new BPF_BTF_GET_NEXT_ID syscall command
  tools: bpf: synchronise BPF UAPI header with tools
  libbpf: refactor bpf_*_get_next_id() functions
  libbpf: add bpf_btf_get_next_id() to cycle through BTF objects
  tools: bpftool: implement "bpftool btf show|list"

 include/linux/bpf.h                           |   3 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/btf.c                              |   4 +-
 kernel/bpf/syscall.c                          |   4 +
 .../bpf/bpftool/Documentation/bpftool-btf.rst |   7 +
 tools/bpf/bpftool/bash-completion/bpftool     |  20 +-
 tools/bpf/bpftool/btf.c                       | 342 +++++++++++++++++-
 tools/include/uapi/linux/bpf.h                |   1 +
 tools/lib/bpf/bpf.c                           |  24 +-
 tools/lib/bpf/bpf.h                           |   1 +
 tools/lib/bpf/libbpf.map                      |   2 +
 11 files changed, 389 insertions(+), 20 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH bpf-next v2 1/5] bpf: add new BPF_BTF_GET_NEXT_ID syscall command
From: Quentin Monnet @ 2019-08-20  9:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190820093154.14042-1-quentin.monnet@netronome.com>

Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/linux/bpf.h      | 3 +++
 include/uapi/linux/bpf.h | 1 +
 kernel/bpf/btf.c         | 4 ++--
 kernel/bpf/syscall.c     | 4 ++++
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 15ae49862b82..5b9d22338606 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,9 @@ struct seq_file;
 struct btf;
 struct btf_type;
 
+extern struct idr btf_idr;
+extern spinlock_t btf_idr_lock;
+
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
 	/* funcs callable from userspace (via syscall) */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0ef594ac3899..8aa6126f0b6e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_BTF_GET_NEXT_ID,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5fcc7a17eb5a..e716a64b2f7f 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -195,8 +195,8 @@
 	     i < btf_type_vlen(struct_type);					\
 	     i++, member++)
 
-static DEFINE_IDR(btf_idr);
-static DEFINE_SPINLOCK(btf_idr_lock);
+DEFINE_IDR(btf_idr);
+DEFINE_SPINLOCK(btf_idr_lock);
 
 struct btf {
 	void *data;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf8052b016e7..c0f62fd67c6b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2884,6 +2884,10 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 		err = bpf_obj_get_next_id(&attr, uattr,
 					  &map_idr, &map_idr_lock);
 		break;
+	case BPF_BTF_GET_NEXT_ID:
+		err = bpf_obj_get_next_id(&attr, uattr,
+					  &btf_idr, &btf_idr_lock);
+		break;
 	case BPF_PROG_GET_FD_BY_ID:
 		err = bpf_prog_get_fd_by_id(&attr);
 		break;
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox