Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 6/8] MIPS: mscc: Add switch to ocelot
From: Andrew Lunn @ 2018-03-23 22:06 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Alexandre Belloni, David S . Miller, Allan Nielsen,
	razvan.stefanescu, po.liu, Thomas Petazzoni, netdev, devicetree,
	linux-kernel, linux-mips, James Hogan
In-Reply-To: <dcac43b7-2eb7-d409-a77c-4f671a8cfc3d@gmail.com>

> > That is the trade off of having a standalone MDIO bus driver.  Maybe
> > add a phandle to the internal MDIO bus? The switch driver could then
> > follow the phandle, and direct connect the internal PHYs?
> 
> This is more or less what patch 7 does, right?

Patch 7 does it in DT. I'm suggesting it could be done in C. It is
hard wired, so there is no need to describe it in DT. Use the phandle
to get the mdio bus, mdiobus_get_phy(, port) to get the phydev and
then use phy_connect().

     Andrew

^ permalink raw reply

* Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
From: Alexander Duyck @ 2018-03-23 21:53 UTC (permalink / raw)
  To: Sinan Kaya, intel-wired-lan
  Cc: sulrich, Netdev, Timur Tabi, Jeff Kirsher, linux-arm-msm,
	linux-arm-kernel
In-Reply-To: <1521831180-25014-1-git-send-email-okaya@codeaurora.org>

On Fri, Mar 23, 2018 at 11:52 AM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing the
> register write.
>
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
>
> I did a regex search for wmb() followed by writel() in each drivers
> directory.
> I scrubbed the ones I care about in this series.
>
> I considered "ease of change", "popular usage" and "performance critical
> path" as the determining criteria for my filtering.
>
> We used relaxed API heavily on ARM for a long time but
> it did not exist on other architectures. For this reason, relaxed
> architectures have been paying double penalty in order to use the common
> drivers.
>
> Now that relaxed API is present on all architectures, we can go and scrub
> all drivers to see what needs to change and what can remain.
>
> We start with mostly used ones and hope to increase the coverage over time.
> It will take a while to cover all drivers.
>
> Feel free to apply patches individually.

I looked over the set and they seem good.

Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>

>
> Changes since v6:
> clean up between 2..6 and then make your Alex's changes on 1 and 7
>     The mmiowb shouldn't be needed for Rx. Only one CPU will be running
>     NAPI for the queue and we will synchronize this with a full writel
>     anyway when we re-enable the interrupts.
>
> Sinan Kaya (7):
>   i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbe: eliminate duplicate barriers on weakly-ordered archs
>   igbvf: eliminate duplicate barriers on weakly-ordered archs
>   igb: eliminate duplicate barriers on weakly-ordered archs
>   fm10k: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbevf: keep writel() closer to wmb()
>   ixgbevf: eliminate duplicate barriers on weakly-ordered archs
>
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c     |  4 ++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c       | 14 ++++++++++----
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c     |  4 ++--
>  drivers/net/ethernet/intel/igb/igb_main.c         |  4 ++--
>  drivers/net/ethernet/intel/igbvf/netdev.c         |  4 ++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  8 ++++----
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |  5 -----
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 11 ++++++++---
>  8 files changed, 30 insertions(+), 24 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply

* Re: [PATCH net-next 3/8] net: mscc: Add MDIO driver
From: Florian Fainelli @ 2018-03-23 21:51 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips
In-Reply-To: <20180323201117.8416-4-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> Add a driver for the Microsemi MII Management controller (MIIM) found on
> Microsemi SoCs.
> On Ocelot, there are two controllers, one is connected to the internal
> PHYs, the other one can communicate with external PHYs.
> 
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  drivers/net/ethernet/Kconfig          |   1 +
>  drivers/net/ethernet/Makefile         |   1 +
>  drivers/net/ethernet/mscc/Kconfig     |  22 ++++
>  drivers/net/ethernet/mscc/Makefile    |   2 +
>  drivers/net/ethernet/mscc/mscc_miim.c | 210 ++++++++++++++++++++++++++++++++++
>  5 files changed, 236 insertions(+)
>  create mode 100644 drivers/net/ethernet/mscc/Kconfig
>  create mode 100644 drivers/net/ethernet/mscc/Makefile
>  create mode 100644 drivers/net/ethernet/mscc/mscc_miim.c
> 
> diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
> index b6cf4b6962f5..adf643484198 100644
> --- a/drivers/net/ethernet/Kconfig
> +++ b/drivers/net/ethernet/Kconfig
> @@ -115,6 +115,7 @@ source "drivers/net/ethernet/mediatek/Kconfig"
>  source "drivers/net/ethernet/mellanox/Kconfig"
>  source "drivers/net/ethernet/micrel/Kconfig"
>  source "drivers/net/ethernet/microchip/Kconfig"
> +source "drivers/net/ethernet/mscc/Kconfig"
>  source "drivers/net/ethernet/moxa/Kconfig"
>  source "drivers/net/ethernet/myricom/Kconfig"
>  
> diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
> index 3cdf01e96e0b..ed7df22de7ff 100644
> --- a/drivers/net/ethernet/Makefile
> +++ b/drivers/net/ethernet/Makefile
> @@ -56,6 +56,7 @@ obj-$(CONFIG_NET_VENDOR_MEDIATEK) += mediatek/
>  obj-$(CONFIG_NET_VENDOR_MELLANOX) += mellanox/
>  obj-$(CONFIG_NET_VENDOR_MICREL) += micrel/
>  obj-$(CONFIG_NET_VENDOR_MICROCHIP) += microchip/
> +obj-$(CONFIG_NET_VENDOR_MICROSEMI) += mscc/
>  obj-$(CONFIG_NET_VENDOR_MOXART) += moxa/
>  obj-$(CONFIG_NET_VENDOR_MYRI) += myricom/
>  obj-$(CONFIG_FEALNX) += fealnx.o
> diff --git a/drivers/net/ethernet/mscc/Kconfig b/drivers/net/ethernet/mscc/Kconfig
> new file mode 100644
> index 000000000000..2330de6e7bb6
> --- /dev/null
> +++ b/drivers/net/ethernet/mscc/Kconfig
> @@ -0,0 +1,22 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +config NET_VENDOR_MICROSEMI
> +	bool "Microsemi devices"
> +	default y
> +	help
> +	  If you have a network (Ethernet) card belonging to this class, say Y.
> +
> +	  Note that the answer to this question doesn't directly affect the
> +	  kernel: saying N will just cause the configurator to skip all
> +	  the questions about Microsemi devices.
> +
> +if NET_VENDOR_MICROSEMI
> +
> +config MSCC_MIIM
> +	tristate "Microsemi MIIM interface support"
> +	depends on HAS_IOMEM
> +	select PHYLIB
> +	help
> +	  This driver supports the MIIM (MDIO) interface found in the network
> +	  switches of the Microsemi SoCs
> +
> +endif # NET_VENDOR_MICROSEMI
> diff --git a/drivers/net/ethernet/mscc/Makefile b/drivers/net/ethernet/mscc/Makefile
> new file mode 100644
> index 000000000000..4570e8fa4711
> --- /dev/null
> +++ b/drivers/net/ethernet/mscc/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +obj-$(CONFIG_MSCC_MIIM) += mscc_miim.o
> diff --git a/drivers/net/ethernet/mscc/mscc_miim.c b/drivers/net/ethernet/mscc/mscc_miim.c
> new file mode 100644
> index 000000000000..95b8d102c90f
> --- /dev/null
> +++ b/drivers/net/ethernet/mscc/mscc_miim.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +/*
> + * Driver for the MDIO interface of Microsemi network switches.
> + *
> + * Author: Alexandre Belloni <alexandre.belloni@bootlin.com>
> + * Copyright (c) 2017 Microsemi Corporation
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/phy.h>
> +#include <linux/platform_device.h>
> +#include <linux/bitops.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/of_mdio.h>
> +
> +#define MSCC_MIIM_REG_STATUS		0x0
> +#define		MSCC_MIIM_STATUS_STAT_BUSY	BIT(3)
> +#define MSCC_MIIM_REG_CMD		0x8
> +#define		MSCC_MIIM_CMD_OPR_WRITE		BIT(1)
> +#define		MSCC_MIIM_CMD_OPR_READ		BIT(2)
> +#define		MSCC_MIIM_CMD_WRDATA_SHIFT	4
> +#define		MSCC_MIIM_CMD_REGAD_SHIFT	20
> +#define		MSCC_MIIM_CMD_PHYAD_SHIFT	25
> +#define		MSCC_MIIM_CMD_VLD		BIT(31)
> +#define MSCC_MIIM_REG_DATA		0xC
> +#define		MSCC_MIIM_DATA_ERROR		(BIT(16) | BIT(17))
> +
> +#define MSCC_PHY_REG_PHY_CFG	0x0
> +#define		PHY_CFG_PHY_ENA		(BIT(0) | BIT(1) | BIT(2) | BIT(3))
> +#define		PHY_CFG_PHY_COMMON_RESET BIT(4)
> +#define		PHY_CFG_PHY_RESET	(BIT(5) | BIT(6) | BIT(7) | BIT(8))
> +#define MSCC_PHY_REG_PHY_STATUS	0x4
> +
> +struct mscc_miim_dev {
> +	struct mutex lock;
> +	void __iomem *regs;
> +	void __iomem *phy_regs;
> +};
> +
> +static int mscc_miim_wait_ready(struct mii_bus *bus)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	u32 val;
> +
> +	readl_poll_timeout(miim->regs + MSCC_MIIM_REG_STATUS, val,
> +			   !(val & MSCC_MIIM_STATUS_STAT_BUSY), 100, 250000);
> +	if (val & MSCC_MIIM_STATUS_STAT_BUSY)
> +		return -ETIMEDOUT;
> +
> +	return 0;
> +}
> +
> +static int mscc_miim_read(struct mii_bus *bus, int mii_id, int regnum)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	u32 val;
> +	int ret;
> +
> +	mutex_lock(&miim->lock);

What is this lock for considering that bus->lock should always be
acquired when doing these operations? As Andrew pointed out, needs to be
initialized with mutex_init(), but likely you would drop it.

> +
> +	ret = mscc_miim_wait_ready(bus);
> +	if (ret)
> +		goto out;
> +
> +	writel(MSCC_MIIM_CMD_VLD | (mii_id << MSCC_MIIM_CMD_PHYAD_SHIFT) |
> +	       (regnum << MSCC_MIIM_CMD_REGAD_SHIFT) | MSCC_MIIM_CMD_OPR_READ,
> +	       miim->regs + MSCC_MIIM_REG_CMD);
> +
> +	ret = mscc_miim_wait_ready(bus);
> +	if (ret)
> +		goto out;

Your example had an interrupt specified, can't you use that instead of
polling?

> +
> +	val = readl(miim->regs + MSCC_MIIM_REG_DATA);
> +	if (val & MSCC_MIIM_DATA_ERROR) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = val & 0xFFFF;
> +out:
> +	mutex_unlock(&miim->lock);
> +	return ret;
> +}
> +
> +static int mscc_miim_write(struct mii_bus *bus, int mii_id,
> +			   int regnum, u16 value)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	int ret;
> +
> +	mutex_lock(&miim->lock);
> +
> +	ret = mscc_miim_wait_ready(bus);
> +	if (ret < 0)
> +		goto out;
> +
> +	writel(MSCC_MIIM_CMD_VLD | (mii_id << MSCC_MIIM_CMD_PHYAD_SHIFT) |
> +	       (regnum << MSCC_MIIM_CMD_REGAD_SHIFT) |
> +	       (value << MSCC_MIIM_CMD_WRDATA_SHIFT) |
> +	       MSCC_MIIM_CMD_OPR_WRITE,
> +	       miim->regs + MSCC_MIIM_REG_CMD);
> +
> +out:
> +	mutex_unlock(&miim->lock);
> +	return ret;
> +}
> +
> +static int mscc_miim_reset(struct mii_bus *bus)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	int i;

unsigned int i

> +
> +	if (miim->phy_regs) {
> +		writel(0, miim->phy_regs + MSCC_PHY_REG_PHY_CFG);
> +		writel(0x1ff, miim->phy_regs + MSCC_PHY_REG_PHY_CFG);
> +		mdelay(500);
> +	}
> +
> +	for (i = 0; i < PHY_MAX_ADDR; i++) {
> +		if (mscc_miim_read(bus, i, MII_PHYSID1) < 0)
> +			bus->phy_mask |= BIT(i);
> +	}

What is this used for? You have an OF MDIO bus which would create a
phy_device for each node specified, is this a similar workaround to what
drivers/net/phy/mdio-bcm-unimac.c has to do? If so, please document it
as such.

Other than that, this looks quite good!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present
From: Chas Williams @ 2018-03-23 21:50 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: David Miller, netdev, Stephen Hemminger
In-Reply-To: <80b5cdab-9d5e-ef84-7cc0-2fe517001ed3@cumulusnetworks.com>

On Fri, Mar 23, 2018 at 12:20 PM, Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
> On 23/03/18 18:17, David Miller wrote:
>> From: Chas Williams <3chas3@gmail.com>
>> Date: Thu, 22 Mar 2018 11:34:06 -0400
>>
>>> If the bridge is allowing multiple VLANs, some VLANs may have
>>> different MTUs.  Instead of choosing the minimum MTU for the
>>> bridge interface, choose the maximum MTU of the bridge members.
>>> With this the user only needs to set a larger MTU on the member
>>> ports that are participating in the large MTU VLANS.
>>>
>>> Signed-off-by: Chas Williams <3chas3@gmail.com>
>>
>> Applied, thanks.
>>
>
> Argh, this will break on builds without vlans because br->vlan_enabled shouldn't
> be accessed directly. I missed that when reviewing.
> I'll send a follow up fix in a second that uses br_vlan_enabled().
>

Sorry I missed that one.  Thanks for covering!

^ permalink raw reply

* Re: [PATCH net-next 2/8] dt-bindings: net: add DT bindings for Microsemi MIIM
From: Florian Fainelli @ 2018-03-23 21:46 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips,
	Rob Herring
In-Reply-To: <20180323201117.8416-3-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> DT bindings for the Microsemi MII Management Controller found on Microsemi
> SoCs
> 
> Cc: Rob Herring <robh+dt@kernel.org>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  .../devicetree/bindings/net/mscc-miim.txt          | 25 ++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mscc-miim.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/mscc-miim.txt b/Documentation/devicetree/bindings/net/mscc-miim.txt
> new file mode 100644
> index 000000000000..711ac9ab853c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/mscc-miim.txt
> @@ -0,0 +1,25 @@
> +Microsemi MII Management Controller (MIIM) / MDIO
> +=================================================
> +
> +Properties:
> +- compatible: must be "mscc,ocelot-miim"
> +- reg: The base address of the MDIO bus controller register bank. Optionally, a
> +  second register bank can be defined if there is an associated reset register
> +  for internal PHYs
> +- #address-cells: Must be <1>.
> +- #size-cells: Must be <0>.  MDIO addresses have no size component.

Missing interrupt property documentation (sorry), other than that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: rcu: Add might_sleep() check to synchronize_rcu()
From: Thomas Gleixner @ 2018-03-23 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul E. McKenney, Peter Zijlstra, Josh Triplett,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, netdev,
	James Chapman
In-Reply-To: <20180323174023.60579f45@gandalf.local.home>

On Fri, 23 Mar 2018, Steven Rostedt wrote:
> On Fri, 23 Mar 2018 22:33:29 +0100 (CET)
> [  150.741223]  [<ffffffff81290287>] synchronize_rcu+0x27/0x90
> [  150.746908]  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
> 
> Looks like __l2tp_session_unhash() is the real culprit here.

Yes. I reported that to netdev already.

> [  150.753281]  [<ffffffff8358891f>] ? __l2tp_session_unhash+0x1bf/0x550
> [  150.759828]  [<ffffffff8114596a>] ? __local_bh_enable_ip+0x6a/0xd0
> [  150.766123]  [<ffffffff8358ddb0>] ? l2tp_udp_encap_recv+0xd90/0xd90
> [  150.772497]  [<ffffffff83588e97>] l2tp_tunnel_closeall+0x1e7/0x3a0
> [  150.778782]  [<ffffffff835897be>] l2tp_tunnel_destruct+0x30e/0x5a0
> [  150.785067]  [<ffffffff8358965a>] ? l2tp_tunnel_destruct+0x1aa/0x5a0
> [  150.791537]  [<ffffffff835894b0>] ? l2tp_tunnel_del_work+0x460/0x460
> [  150.797997]  [<ffffffff82ee8053>] __sk_destruct+0x53/0x570
> [  150.803588]  [<ffffffff81293918>] rcu_process_callbacks+0x898/0x1300
> [  150.810048]  [<ffffffff812939f7>] ? rcu_process_callbacks+0x977/0x1300
> [  150.816684]  [<ffffffff82ee8000>] ? __sk_dst_check+0x240/0x240
> [  150.822625]  [<ffffffff838be5d6>] __do_softirq+0x206/0x951
> [  150.828223]  [<ffffffff81147315>] irq_exit+0x165/0x190
> [  150.833557]  [<ffffffff838bd1eb>] smp_apic_timer_interrupt+0x7b/0xa0
> [  150.840018]  [<ffffffff838b9470>] apic_timer_interrupt+0xa0/0xb0
> [  150.846132]  <EOI> [  150.848166]  [<ffffffff838b6756>] ? native_safe_halt+0x6/0x10
> [  150.854036]  [<ffffffff8123bf2d>] ? trace_hardirqs_on+0xd/0x10
> [  150.859973]  [<ffffffff838b5d85>] default_idle+0x55/0x360
> [  150.865478]  [<ffffffff8106be0a>] arch_cpu_idle+0xa/0x10
> 
> I think you want this instead, as __l2tp_session_unhash is what looks
> like might be hiding the call to synchronize_rcu(). It's not called in
> all instances, and I don't think your patch would have triggered the
> issues before hand. You want this:
>
> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index 194a7483bb93..857b494bee29 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -1677,6 +1677,8 @@ void __l2tp_session_unhash(struct l2tp_session *session)
>  {
>  	struct l2tp_tunnel *tunnel = session->tunnel;
>  
> +	might_sleep();
> +
>  	/* Remove the session from core hashes */
>  	if (tunnel) {
>  		/* Remove from the per-tunnel hash */

That too :)

^ permalink raw reply

* Re: [PATCH net-next 6/8] MIPS: mscc: Add switch to ocelot
From: Florian Fainelli @ 2018-03-23 21:44 UTC (permalink / raw)
  To: Andrew Lunn, Alexandre Belloni
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, netdev, devicetree, linux-kernel, linux-mips,
	James Hogan
In-Reply-To: <20180323213344.GV24361@lunn.ch>

On 03/23/2018 02:33 PM, Andrew Lunn wrote:
> On Fri, Mar 23, 2018 at 10:22:30PM +0100, Alexandre Belloni wrote:
>> On 23/03/2018 at 14:17:48 -0700, Florian Fainelli wrote:
>>> On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
>>>> +
>>>> +			phy0: ethernet-phy@0 {
>>>> +				reg = <0>;
>>>> +			};
>>>> +			phy1: ethernet-phy@1 {
>>>> +				reg = <1>;
>>>> +			};
>>>> +			phy2: ethernet-phy@2 {
>>>> +				reg = <2>;
>>>> +			};
>>>> +			phy3: ethernet-phy@3 {
>>>> +				reg = <3>;
>>>> +			};
>>>
>>> These PHYs should be defined at the board DTS level.
>>
>> Those are internal PHYs, present on the SoC, I doubt anyone will have
>> anything different while using the same SoC.
> 
> With DSA, there is no need to list internal PHYs.
> 
> That is the trade off of having a standalone MDIO bus driver.  Maybe
> add a phandle to the internal MDIO bus? The switch driver could then
> follow the phandle, and direct connect the internal PHYs?

This is more or less what patch 7 does, right?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 5/8] net: mscc: Add initial Ocelot switch support
From: Florian Fainelli @ 2018-03-23 21:41 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips
In-Reply-To: <20180323201117.8416-6-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> Add a driver for Microsemi Ocelot Ethernet switch support.
> 
> This makes two modules:
> mscc_ocelot_common handles all the common features that doesn't depend on
> how the switch is integrated in the SoC. Currently, it handles offloading
> bridging to the hardware. ocelot_io.c handles register accesses. This is
> unfortunately needed because the register layout is packed and then depends
> on the number of ports available on the switch. The register definition
> files are automatically generated.
> 
> ocelot_board handles the switch integration on the SoC and on the board.
> 
> Frame injection and extraction to/from the CPU port is currently done using
> register accesses which is quite slow. DMA is possible but the port is not
> able to absorb the whole switch bandwidth.
> 
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>

Random drive by comments because this is quite a number of lines to review!

Overall, looks quite good for a first version. Out of curiosity, is
there a particular switch test you ran this driver against? LNST?

> +static int ocelot_mact_learn(struct ocelot *ocelot, int port,
> +			     const unsigned char mac[ETH_ALEN],
> +			     unsigned int vid,
> +			     enum macaccess_entry_type type)
> +{
> +	u32 macl = 0, mach = 0;
> +
> +	/* Set the MAC address to learn and the vlan associated in a format
> +	 * understood by the hardware.
> +	 */
> +	mach |= vid    << 16;
> +	mach |= mac[0] << 8;
> +	mach |= mac[1] << 0;
> +	macl |= mac[2] << 24;
> +	macl |= mac[3] << 16;
> +	macl |= mac[4] << 8;
> +	macl |= mac[5] << 0;
> +
> +	ocelot_write(ocelot, macl, ANA_TABLES_MACLDATA);
> +	ocelot_write(ocelot, mach, ANA_TABLES_MACHDATA);

You are repeating this in the function right below, can you factor it
somehow into a common function that this one, and the one right below
could call?

[snip]

> +static void ocelot_port_adjust_link(struct net_device *dev)
> +{

This is fine for now, but I would suggest implementing PHYLINK to be
future proof.

[snip]

> +static int ocelot_port_stop(struct net_device *dev)
> +{
> +	struct ocelot_port *port = netdev_priv(dev);
> +
> +	phy_disconnect(port->phy);
> +
> +	dev->phydev = NULL;

You don't have anything else to do, like disabling the port so it
possibly saves power or anything, aside from the PHY which will be
suspended here.

[snip]

> +static int ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct ocelot_port *port = netdev_priv(dev);
> +	struct ocelot *ocelot = port->ocelot;
> +	u32 val, ifh[IFH_LEN];
> +	struct frame_info info = {};
> +	u8 grp = 0; /* Send everything on CPU group 0 */
> +	int i, count, last;

unsigned int for these types.

> +
> +	val = ocelot_read(ocelot, QS_INJ_STATUS);
> +	if (!(val & QS_INJ_STATUS_FIFO_RDY(BIT(grp))) ||
> +	    (val & QS_INJ_STATUS_WMARK_REACHED(BIT(grp))))
> +		return NETDEV_TX_BUSY;
> +
> +	ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
> +			 QS_INJ_CTRL_SOF, QS_INJ_CTRL, grp);
> +
> +	info.port = BIT(port->chip_port);
> +	info.cpuq = 0xff;
> +	ocelot_gen_ifh(ifh, &info);
> +
> +	for (i = 0; i < IFH_LEN; i++)
> +		ocelot_write_rix(ocelot, ifh[i], QS_INJ_WR, grp);
> +
> +	count = (skb->len + 3) / 4;
> +	last = skb->len % 4;
> +	for (i = 0; i < count; i++) {
> +		ocelot_write_rix(ocelot, cpu_to_le32(((u32 *)skb->data)[i]),
> +				 QS_INJ_WR, grp);
> +	}
> +
> +	/* Add padding */
> +	while (i < (OCELOT_BUFFER_CELL_SZ / 4)) {
> +		ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
> +		i++;
> +	}
> +
> +	/* Indicate EOF and valid bytes in last word */
> +	ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
> +			 QS_INJ_CTRL_VLD_BYTES(skb->len < OCELOT_BUFFER_CELL_SZ ? 0 : last) |
> +			 QS_INJ_CTRL_EOF,
> +			 QS_INJ_CTRL, grp);
> +
> +	/* Add dummy CRC */
> +	ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
> +	skb_tx_timestamp(skb);
> +
> +	dev->stats.tx_packets++;
> +	dev->stats.tx_bytes += skb->len;
> +	dev_kfree_skb_any(skb);

No interrupt to indicate transmit completion?


> +static int ocelot_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
> +			  struct net_device *dev, const unsigned char *addr,
> +			  u16 vid, u16 flags)
> +{
> +	struct ocelot_port *port = netdev_priv(dev);
> +	struct ocelot *ocelot = port->ocelot;
> +
> +	if (!vid) {
> +		if (!port->vlan_aware)
> +			/* If the bridge is not VLAN aware and no VID was
> +			 * provided, set it to 1 as bridges have a default VID
> +			 * of 1. Otherwise the MAC entry wouldn't match incoming
> +			 * packets as the VID would differ (0 != 1).
> +			 */
> +			vid = 1;
> +		else
> +			/* If the bridge is VLAN aware a VID must be provided as
> +			 * otherwise the learnt entry wouldn't match any frame.
> +			 */
> +			return -EINVAL;
> +	}

So if we are targeting vid = 0 we end-up with vid = 1 possibly?

[snip]

> +static int ocelot_port_attr_stp_state_set(struct ocelot_port *ocelot_port,
> +					  struct switchdev_trans *trans,
> +					  u8 state)
> +{
> +	struct ocelot *ocelot = ocelot_port->ocelot;
> +	u32 port_cfg;
> +	int port, i;
> +
> +	if (switchdev_trans_ph_prepare(trans))
> +		return 0;
> +
> +	if (!(BIT(ocelot_port->chip_port) & ocelot->bridge_mask))
> +		return 0;
> +
> +	port_cfg = ocelot_read_gix(ocelot, ANA_PORT_PORT_CFG,
> +				   ocelot_port->chip_port);
> +
> +	switch (state) {
> +	case BR_STATE_FORWARDING:
> +		ocelot->bridge_fwd_mask |= BIT(ocelot_port->chip_port);
> +		/* Fallthrough */
> +	case BR_STATE_LEARNING:
> +		port_cfg |= ANA_PORT_PORT_CFG_LEARN_ENA;
> +		break;
> +
> +	default:
> +		port_cfg &= ~ANA_PORT_PORT_CFG_LEARN_ENA;
> +		ocelot->bridge_fwd_mask &= ~BIT(ocelot_port->chip_port);

Missing break, even if this is the default case.

> +	}
> +
> +	ocelot_write_gix(ocelot, port_cfg, ANA_PORT_PORT_CFG,
> +			 ocelot_port->chip_port);
> +
> +	/* Apply FWD mask. The loop is needed to add/remove the current port as
> +	 * a source for the other ports.
> +	 */
> +	for (port = 0; port < ocelot->num_phys_ports; port++) {
> +		if (ocelot->bridge_fwd_mask & BIT(port)) {
> +			unsigned long mask = ocelot->bridge_fwd_mask & ~BIT(port);
> +
> +			for (i = 0; i < ocelot->num_phys_ports; i++) {
> +				unsigned long bond_mask = ocelot->lags[i];
> +
> +				if (!bond_mask)
> +					continue;
> +
> +				if (bond_mask & BIT(port)) {
> +					mask &= ~bond_mask;
> +					break;
> +				}
> +			}
> +
> +			ocelot_write_rix(ocelot,
> +					 BIT(ocelot->num_phys_ports) | mask,
> +					 ANA_PGID_PGID, PGID_SRC + port);
> +		} else {
> +			/* Only the CPU port, this is compatible with link
> +			 * aggregation.
> +			 */
> +			ocelot_write_rix(ocelot,
> +					 BIT(ocelot->num_phys_ports),
> +					 ANA_PGID_PGID, PGID_SRC + port);
> +		}

All of this sounds like it should be moved into the br_join/leave, this
does not appear to be the right place to do that.

[snip]

> +static int ocelot_port_attr_set(struct net_device *dev,
> +				const struct switchdev_attr *attr,
> +				struct switchdev_trans *trans)
> +{
> +	struct ocelot_port *ocelot_port = netdev_priv(dev);
> +	int err = 0;

Should not this be EOPNOTSUPP by default so your cases below are
properly handled, like BRIDGE_FLAGS, MROUTER etc.

> +
> +	switch (attr->id) {
> +	case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
> +		ocelot_port_attr_stp_state_set(ocelot_port, trans,
> +					       attr->u.stp_state);
> +		break;
> +	case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
> +		break;
> +	case SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME:
> +		ocelot_port_attr_ageing_set(ocelot_port, attr->u.ageing_time);
> +		break;
> +	case SWITCHDEV_ATTR_ID_PORT_MROUTER:
> +		break;
> +	case SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED:
> +		ocelot_port_attr_mc_set(ocelot_port, !attr->u.mc_disabled);
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	return err;
> +}
> +
> +static struct ocelot_multicast *ocelot_multicast_get(struct ocelot *ocelot,
> +						     const unsigned char *addr,
> +						     u16 vid)
> +{
> +	struct ocelot_multicast *mc;
> +
> +	list_for_each_entry(mc, &ocelot->multicast, list) {
> +		if (ether_addr_equal(mc->addr, addr) && mc->vid == vid)
> +			return mc;
> +	}
> +
> +	return NULL;
> +}


> +static irqreturn_t ocelot_xtr_irq_handler(int irq, void *arg)
> +{
> +	struct ocelot *ocelot = arg;
> +	int i = 0, grp = 0;
> +	int err = 0;
> +
> +	if (!(ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)))
> +		return IRQ_NONE;
> +
> +	do {
> +		struct sk_buff *skb;
> +		struct net_device *dev;
> +		u32 *buf;
> +		int sz, len;
> +		u32 ifh[4];
> +		u32 val;
> +		struct frame_info info;
> +
> +		for (i = 0; i < IFH_LEN; i++) {
> +			err = ocelot_rx_frame_word(ocelot, grp, true, &ifh[i]);
> +			if (err != 4)
> +				break;
> +		}

NAPI maybe?

[snip]


> +	ocelot->targets[SYS] = ocelot_io_platform_init(ocelot, pdev, "sys");
> +	if (IS_ERR(ocelot->targets[SYS]))
> +		return PTR_ERR(ocelot->targets[SYS]);

You can clearly make this in a loop instead of repeating this section,
you just need an array of register names to be looking for.

[snip]

> +	if (np) {

Please rework the indentation here, check for !np

> +		for_each_child_of_node(np, portnp) {

for_each_available_child_of_node() you should be able to mark specific
ports as being disabled and skip over these accordingly.


[snip]
> +int ocelot_regfields_init(struct ocelot *ocelot,
> +			  const struct reg_field *const regfields)
> +{
> +	int i;

unsigned int i
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 6/8] MIPS: mscc: Add switch to ocelot
From: Andrew Lunn @ 2018-03-23 21:33 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: Florian Fainelli, David S . Miller, Allan Nielsen,
	razvan.stefanescu, po.liu, Thomas Petazzoni, netdev, devicetree,
	linux-kernel, linux-mips, James Hogan
In-Reply-To: <20180323212230.GA12808@piout.net>

On Fri, Mar 23, 2018 at 10:22:30PM +0100, Alexandre Belloni wrote:
> On 23/03/2018 at 14:17:48 -0700, Florian Fainelli wrote:
> > On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> > > +
> > > +			phy0: ethernet-phy@0 {
> > > +				reg = <0>;
> > > +			};
> > > +			phy1: ethernet-phy@1 {
> > > +				reg = <1>;
> > > +			};
> > > +			phy2: ethernet-phy@2 {
> > > +				reg = <2>;
> > > +			};
> > > +			phy3: ethernet-phy@3 {
> > > +				reg = <3>;
> > > +			};
> > 
> > These PHYs should be defined at the board DTS level.
> 
> Those are internal PHYs, present on the SoC, I doubt anyone will have
> anything different while using the same SoC.

With DSA, there is no need to list internal PHYs.

That is the trade off of having a standalone MDIO bus driver.  Maybe
add a phandle to the internal MDIO bus? The switch driver could then
follow the phandle, and direct connect the internal PHYs?

       Andrew

^ permalink raw reply

* Re: [PATCH net-next 5/8] net: mscc: Add initial Ocelot switch support
From: Andrew Lunn @ 2018-03-23 21:25 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Florian Fainelli, netdev, devicetree,
	linux-kernel, linux-mips
In-Reply-To: <20180323201117.8416-6-alexandre.belloni@bootlin.com>

Hi Alexandre

> +static int ocelot_port_bridge_join(struct ocelot_port *ocelot_port,
> +				   struct net_device *bridge)
> +{
> +	struct ocelot *ocelot = ocelot_port->ocelot;
> +
> +	if (!ocelot->bridge_mask) {
> +		ocelot->hw_bridge_dev = bridge;
> +	} else {
> +		if (ocelot->hw_bridge_dev != bridge)
> +			return -ENODEV; //TODO: do something clever here
> +	}
> +
> +	ocelot->bridge_mask |= BIT(ocelot_port->chip_port);
> +
> +	return 0;
> +}

I could be missing something here, but this looks way too simple.

By default, you have individual ports. They can just forward frames
between the host CPU and the port. There is no port-to-port traffic
allowed.

You then create a bridge, and add ports to the bridge. When you add a
port, you need to program the hardware that it can now forward frames
from this port to other parts in the same bridge. You also need to
tell other parts in the same bridge it can forward frames to it.  You
might also need to tell the port which forwarding database it should
use, since each bridge might have its own.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 6/8] MIPS: mscc: Add switch to ocelot
From: Alexandre Belloni @ 2018-03-23 21:22 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Andrew Lunn, netdev, devicetree, linux-kernel,
	linux-mips, James Hogan
In-Reply-To: <e488fd29-0094-d005-a078-873f6f5add13@gmail.com>

On 23/03/2018 at 14:17:48 -0700, Florian Fainelli wrote:
> On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> > +
> > +			phy0: ethernet-phy@0 {
> > +				reg = <0>;
> > +			};
> > +			phy1: ethernet-phy@1 {
> > +				reg = <1>;
> > +			};
> > +			phy2: ethernet-phy@2 {
> > +				reg = <2>;
> > +			};
> > +			phy3: ethernet-phy@3 {
> > +				reg = <3>;
> > +			};
> 
> These PHYs should be defined at the board DTS level.

Those are internal PHYs, present on the SoC, I doubt anyone will have
anything different while using the same SoC.


-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next 6/8] MIPS: mscc: Add switch to ocelot
From: Florian Fainelli @ 2018-03-23 21:17 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips,
	James Hogan
In-Reply-To: <20180323201117.8416-7-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> Ocelot has an integrated switch, add support for it.
> 
> Cc: James Hogan <jhogan@kernel.org>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  arch/mips/boot/dts/mscc/ocelot.dtsi | 84 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
> 
> diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi b/arch/mips/boot/dts/mscc/ocelot.dtsi
> index dd239cab2f9d..22a86373b1c9 100644
> --- a/arch/mips/boot/dts/mscc/ocelot.dtsi
> +++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
> @@ -91,6 +91,69 @@
>  			status = "disabled";
>  		};
>  
> +		switch@1010000 {
> +			#address-cells = <1>;
> +			#size-cells = <0>;
> +			compatible = "mscc,ocelot-switch";
> +			reg = <0x1010000 0x10000>,
> +			      <0x1030000 0x10000>,
> +			      <0x1080000 0x100>,
> +			      <0x10d0000 0x10000>,
> +			      <0x11e0000 0x100>,
> +			      <0x11f0000 0x100>,
> +			      <0x1200000 0x100>,
> +			      <0x1210000 0x100>,
> +			      <0x1220000 0x100>,
> +			      <0x1230000 0x100>,
> +			      <0x1240000 0x100>,
> +			      <0x1250000 0x100>,
> +			      <0x1260000 0x100>,
> +			      <0x1270000 0x100>,
> +			      <0x1280000 0x100>,
> +			      <0x1800000 0x80000>,
> +			      <0x1880000 0x10000>;
> +			reg-names = "sys", "rew", "qs", "hsio", "port0",
> +				    "port1", "port2", "port3", "port4", "port5",
> +				    "port6", "port7", "port8", "port9", "port10",
> +				    "qsys", "ana";
> +			interrupts = <21 22>;
> +			interrupt-names = "xtr", "inj";

See my comment about the binding patch, this should be moved to a ports
subnode so it is conforming to the existing DSA binding and makes it a
lot easier to have all ports disabled by default at the .dsti level by
not defini

> +
> +			port0: port@0 {
> +				reg = <0>;
> +			};
> +			port1: port@1 {
> +				reg = <1>;
> +			};
> +			port2: port@2 {
> +				reg = <2>;
> +			};
> +			port3: port@3 {
> +				reg = <3>;
> +			};
> +			port4: port@4 {
> +				reg = <4>;
> +			};
> +			port5: port@5 {
> +				reg = <5>;
> +			};
> +			port6: port@6 {
> +				reg = <6>;
> +			};
> +			port7: port@7 {
> +				reg = <7>;
> +			};
> +			port8: port@8 {
> +				reg = <8>;
> +			};
> +			port9: port@9 {
> +				reg = <9>;
> +			};
> +			port10: port@10 {
> +				reg = <10>;
> +			};
> +		};
> +
>  		reset@1070008 {
>  			compatible = "mscc,ocelot-chip-reset";
>  			reg = <0x1070008 0x4>;
> @@ -113,5 +176,26 @@
>  				function = "uart2";
>  			};
>  		};
> +
> +		mdio0: mdio@107009c {
> +			#address-cells = <1>;
> +			#size-cells = <0>;
> +			compatible = "mscc,ocelot-miim";
> +			reg = <0x107009c 0x36>, <0x10700f0 0x8>;
> +			interrupts = <14>;

status = "disabled" by default?

> +
> +			phy0: ethernet-phy@0 {
> +				reg = <0>;
> +			};
> +			phy1: ethernet-phy@1 {
> +				reg = <1>;
> +			};
> +			phy2: ethernet-phy@2 {
> +				reg = <2>;
> +			};
> +			phy3: ethernet-phy@3 {
> +				reg = <3>;
> +			};

These PHYs should be defined at the board DTS level.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 4/8] dt-bindings: net: add DT bindings for Microsemi Ocelot Switch
From: Florian Fainelli @ 2018-03-23 21:11 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips,
	Rob Herring
In-Reply-To: <20180323201117.8416-5-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> DT bindings for the Ethernet switch found on Microsemi Ocelot platforms.
> 
> Cc: Rob Herring <robh+dt@kernel.org>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  .../devicetree/bindings/net/mscc-ocelot.txt        | 62 ++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mscc-ocelot.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
> new file mode 100644
> index 000000000000..ee092a85b5a0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
> @@ -0,0 +1,62 @@
> +Microsemi Ocelot network Switch
> +===============================
> +
> +The Microsemi Ocelot network switch can be found on Microsemi SoCs (VSC7513,
> +VSC7514)
> +
> +Required properties:
> +- compatible: Should be "mscc,ocelot-switch"
> +- reg: Must contain an (offset, length) pair of the register set for each
> +  entry in reg-names.
> +- reg-names: Must include the following entries:
> +  - "sys"
> +  - "rew"
> +  - "qs"
> +  - "hsio"
> +  - "qsys"
> +  - "ana"
> +  - "portX" with X from 0 to the number of last port index available on that
> +    switch
> +- interrupts: Should contain the switch interrupts for frame extraction and
> +  frame injection
> +- interrupt-names: should contain the interrupt names: "xtr", "inj"

You are not documenting the "ports" subnode(s).Please move the
individual ports definition under a ports subnode, mainly for two reasons:

- it makes it easy at the .dtsi level to have all ports disabled by default

- this makes you strictly conforming to the DSA binding for Ethernet
switches and this is good for consistency (both parsing code and just
representation).
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 1/8] net: phy: Add initial support for Microsemi Ocelot internal PHYs.
From: Florian Fainelli @ 2018-03-23 21:08 UTC (permalink / raw)
  To: Alexandre Belloni, David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, netdev, devicetree, linux-kernel, linux-mips,
	Raju Lakkaraju
In-Reply-To: <20180323201117.8416-2-alexandre.belloni@bootlin.com>

On 03/23/2018 01:11 PM, Alexandre Belloni wrote:
> Add Microsemi Ocelot internal PHY ids. For now, simply use the genphy
> functions but more features are available.
> 
> Cc: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  drivers/net/phy/mscc.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
> index 650c2667d523..e1ab3acd1cdb 100644
> --- a/drivers/net/phy/mscc.c
> +++ b/drivers/net/phy/mscc.c
> @@ -91,6 +91,7 @@ enum rgmii_rx_clock_delay {
>  #define SECURE_ON_PASSWD_LEN_4		  0x4000
>  
>  /* Microsemi PHY ID's */
> +#define PHY_ID_OCELOT			  0x00070540
>  #define PHY_ID_VSC8530			  0x00070560
>  #define PHY_ID_VSC8531			  0x00070570
>  #define PHY_ID_VSC8540			  0x00070760
> @@ -658,6 +659,19 @@ static int vsc85xx_probe(struct phy_device *phydev)
>  
>  /* Microsemi VSC85xx PHYs */
>  static struct phy_driver vsc85xx_driver[] = {
> +{
> +	.phy_id		= PHY_ID_OCELOT,
> +	.name		= "Microsemi OCELOT",
> +	.phy_id_mask    = 0xfffffff0,
> +	.features	= PHY_GBIT_FEATURES,
> +	.soft_reset	= &genphy_soft_reset,
> +	.config_init	= &genphy_config_init,
> +	.config_aneg	= &genphy_config_aneg,
> +	.aneg_done	= &genphy_aneg_done,
> +	.read_status	= &genphy_read_status,
> +	.suspend	= &genphy_suspend,
> +	.resume		= &genphy_resume,

With the exception of config_init(), suspend and resume, everything else
is already the default when you don't provide a callback. To echo to
what Andrew wrote already, if the purpose is just to show a nice name,
and do nothing else, consider using the Generic PHY driver (default).
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 4/8] dt-bindings: net: add DT bindings for Microsemi Ocelot Switch
From: Andrew Lunn @ 2018-03-23 21:01 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Florian Fainelli, netdev, devicetree,
	linux-kernel, linux-mips, Rob Herring
In-Reply-To: <20180323201117.8416-5-alexandre.belloni@bootlin.com>

On Fri, Mar 23, 2018 at 09:11:13PM +0100, Alexandre Belloni wrote:
> DT bindings for the Ethernet switch found on Microsemi Ocelot platforms.
> 
> Cc: Rob Herring <robh+dt@kernel.org>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  .../devicetree/bindings/net/mscc-ocelot.txt        | 62 ++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mscc-ocelot.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
> new file mode 100644
> index 000000000000..ee092a85b5a0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt
> @@ -0,0 +1,62 @@
> +Microsemi Ocelot network Switch
> +===============================
> +
> +The Microsemi Ocelot network switch can be found on Microsemi SoCs (VSC7513,
> +VSC7514)
> +
> +Required properties:
> +- compatible: Should be "mscc,ocelot-switch"
> +- reg: Must contain an (offset, length) pair of the register set for each
> +  entry in reg-names.
> +- reg-names: Must include the following entries:
> +  - "sys"
> +  - "rew"
> +  - "qs"
> +  - "hsio"
> +  - "qsys"
> +  - "ana"
> +  - "portX" with X from 0 to the number of last port index available on that
> +    switch
> +- interrupts: Should contain the switch interrupts for frame extraction and
> +  frame injection
> +- interrupt-names: should contain the interrupt names: "xtr", "inj"
> +
> +Example:
> +
> +	switch@1010000 {
> +		#address-cells = <1>;
> +		#size-cells = <0>;
> +		compatible = "mscc,ocelot-switch";
> +		reg = <0x1010000 0x10000>,
> +		      <0x1030000 0x10000>,
> +		      <0x1080000 0x100>,
> +		      <0x10d0000 0x10000>,
> +		      <0x11e0000 0x100>,
> +		      <0x11f0000 0x100>,
> +		      <0x1200000 0x100>,
> +		      <0x1210000 0x100>,
> +		      <0x1220000 0x100>,
> +		      <0x1230000 0x100>,
> +		      <0x1240000 0x100>,
> +		      <0x1250000 0x100>,
> +		      <0x1260000 0x100>,
> +		      <0x1270000 0x100>,
> +		      <0x1280000 0x100>,
> +		      <0x1800000 0x80000>,
> +		      <0x1880000 0x10000>;
> +		reg-names = "sys", "rew", "qs", "hsio", "port0",
> +			    "port1", "port2", "port3", "port4", "port5",
> +			    "port6", "port7", "port8", "port9", "port10",
> +			    "qsys", "ana";
> +		interrupts = <21 22>;
> +		interrupt-names = "xtr", "inj";
> +
> +		port0: port@0 {
> +			reg = <0>;
> +			phy-handle = <&phy0>;
> +		};
> +		port1: port@1 {
> +			reg = <1>;
> +			phy-handle = <&phy1>;
> +		};

Hi Alexandre

Is there anything else in the switch which in the future might need
child nodes? At the moment, you can do
for_each_available_child_of_node() and walk the ports. But if you do
need to add some other sorts of children in the future it gets
messy. With DSA, we have a ports {} container.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/8] net: mscc: Add MDIO driver
From: Andrew Lunn @ 2018-03-23 20:49 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Florian Fainelli, netdev, devicetree,
	linux-kernel, linux-mips
In-Reply-To: <20180323201117.8416-4-alexandre.belloni@bootlin.com>

On Fri, Mar 23, 2018 at 09:11:12PM +0100, Alexandre Belloni wrote:
> Add a driver for the Microsemi MII Management controller (MIIM) found on
> Microsemi SoCs.
> On Ocelot, there are two controllers, one is connected to the internal
> PHYs, the other one can communicate with external PHYs.

Hi Alexandre

This looks to be standalone. Such drivers we try to put in
drivers/net/phy.

> +static int mscc_miim_read(struct mii_bus *bus, int mii_id, int regnum)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	u32 val;
> +	int ret;
> +
> +	mutex_lock(&miim->lock);

What are you locking against here?

And you don't appear to initialize the mutex anywhere.

> +static int mscc_miim_reset(struct mii_bus *bus)
> +{
> +	struct mscc_miim_dev *miim = bus->priv;
> +	int i;
> +
> +	if (miim->phy_regs) {
> +		writel(0, miim->phy_regs + MSCC_PHY_REG_PHY_CFG);
> +		writel(0x1ff, miim->phy_regs + MSCC_PHY_REG_PHY_CFG);
> +		mdelay(500);
> +	}
> +
> +	for (i = 0; i < PHY_MAX_ADDR; i++) {
> +		if (mscc_miim_read(bus, i, MII_PHYSID1) < 0)
> +			bus->phy_mask |= BIT(i);
> +	}

Why do this? Especially so for the external bus, where the PHYs might
have a GPIO reset line, and won't respond until the gpio is
released. The core code does that just before it scans the bus, or
just before it scans the particular address on the bus, depending on
the scope of the GPIO.

Otherwise, pretty good :-)

	   Andrew

^ permalink raw reply

* [PATCH v5 2/2] net: ethernet: nixge: Add support for National Instruments XGE netdev
From: Moritz Fischer @ 2018-03-23 20:41 UTC (permalink / raw)
  To: davem
  Cc: linux-kernel, devicetree, netdev, robh+dt, andrew, f.fainelli,
	Moritz Fischer
In-Reply-To: <20180323204128.29783-1-mdf@kernel.org>

Add support for the National Instruments XGE 1/10G network device.

It uses the EEPROM on the board via NVMEM.

Signed-off-by: Moritz Fischer <mdf@kernel.org>
---

Changes from v4:
- Worked on consistency for constants
- Removed unused constants
- Removed unused includes

Changes from v3:
- Added NIXGE prefix to MDIO constants
- Removed NIXGE_MAX_PHY_ADDR (unused)
- Consistency for NIXGE_MDIO_CXX_READ constants
- Use devm_mdiobus_alloc()
- Fixed order of netdev_unregister() and mdio_bus_unregister()
- Addressed feedback w.r.t. disconnecting the PHY
- Removed now superfluous max_frm_size member
- Fix SPDX vs module license to be 'GPL v2' 
  SPDX-License-Identifier: GPL-2.0

Changes from v2:
- Implement recv side NAPI
- Improved error handling
- Implemented C45 writes
- Added ethtool callbacks & blink functionality
- Improved nixge_ctrl_poll_timeout() macro
- Removed dev_dbg() for mdio accesses
- Added businfo to ethtool drvinfo

Changes from v1:
- Added dependency on ARCH_ZYNQ (Kbuild)
- Removed unused variables
- Use of_phy_connect as suggested
- Removed masking of (un)supported modes
- Added #define for some constants
- Removed empty pm functions
- Reworked mac_address handling
- Made nixge_mdio_*() static (sparse)
- Removed driver version
- Addressed timeout loop
- Adressed return values on timeout

---
 drivers/net/ethernet/Kconfig     |    1 +
 drivers/net/ethernet/Makefile    |    1 +
 drivers/net/ethernet/ni/Kconfig  |   27 +
 drivers/net/ethernet/ni/Makefile |    1 +
 drivers/net/ethernet/ni/nixge.c  | 1309 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 1339 insertions(+)
 create mode 100644 drivers/net/ethernet/ni/Kconfig
 create mode 100644 drivers/net/ethernet/ni/Makefile
 create mode 100644 drivers/net/ethernet/ni/nixge.c

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 074d760a568b..603a5704dab8 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -128,6 +128,7 @@ config FEALNX
 
 source "drivers/net/ethernet/natsemi/Kconfig"
 source "drivers/net/ethernet/netronome/Kconfig"
+source "drivers/net/ethernet/ni/Kconfig"
 source "drivers/net/ethernet/8390/Kconfig"
 
 config NET_NETX
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 135dae67d671..2bfd2eea50bf 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_NET_VENDOR_MYRI) += myricom/
 obj-$(CONFIG_FEALNX) += fealnx.o
 obj-$(CONFIG_NET_VENDOR_NATSEMI) += natsemi/
 obj-$(CONFIG_NET_VENDOR_NETRONOME) += netronome/
+obj-$(CONFIG_NET_VENDOR_NI) += ni/
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_NET_VENDOR_NUVOTON) += nuvoton/
 obj-$(CONFIG_NET_VENDOR_NVIDIA) += nvidia/
diff --git a/drivers/net/ethernet/ni/Kconfig b/drivers/net/ethernet/ni/Kconfig
new file mode 100644
index 000000000000..cd30f7de16de
--- /dev/null
+++ b/drivers/net/ethernet/ni/Kconfig
@@ -0,0 +1,27 @@
+#
+# National Instuments network device configuration
+#
+
+config NET_VENDOR_NI
+	bool "National Instruments Devices"
+	default y
+	---help---
+	  If you have a network (Ethernet) device belonging to this class, say Y.
+
+	  Note that the answer to this question doesn't directly affect the
+	  kernel: saying N will just cause the configurator to skip all
+	  the questions about National Instrument devices.
+	  If you say Y, you will be asked for your specific device in the
+	  following questions.
+
+if NET_VENDOR_NI
+
+config NI_XGE_MANAGEMENT_ENET
+	tristate "National Instruments XGE management enet support"
+	depends on ARCH_ZYNQ
+	select PHYLIB
+	---help---
+	  Simple LAN device for debug or management purposes. Can
+	  support either 10G or 1G PHYs via SFP+ ports.
+
+endif
diff --git a/drivers/net/ethernet/ni/Makefile b/drivers/net/ethernet/ni/Makefile
new file mode 100644
index 000000000000..99c664651c51
--- /dev/null
+++ b/drivers/net/ethernet/ni/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_NI_XGE_MANAGEMENT_ENET) += nixge.o
diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
new file mode 100644
index 000000000000..c71e59b10340
--- /dev/null
+++ b/drivers/net/ethernet/ni/nixge.c
@@ -0,0 +1,1309 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2016-2017, National Instruments Corp.
+ *
+ * Author: Moritz Fischer <mdf@kernel.org>
+ */
+
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/of_address.h>
+#include <linux/of_mdio.h>
+#include <linux/of_net.h>
+#include <linux/of_platform.h>
+#include <linux/of_irq.h>
+#include <linux/skbuff.h>
+#include <linux/phy.h>
+#include <linux/mii.h>
+#include <linux/nvmem-consumer.h>
+#include <linux/ethtool.h>
+#include <linux/iopoll.h>
+
+#define TX_BD_NUM		64
+#define RX_BD_NUM		128
+
+/* Axi DMA Register definitions */
+#define XAXIDMA_TX_CR_OFFSET	0x00 /* Channel control */
+#define XAXIDMA_TX_SR_OFFSET	0x04 /* Status */
+#define XAXIDMA_TX_CDESC_OFFSET	0x08 /* Current descriptor pointer */
+#define XAXIDMA_TX_TDESC_OFFSET	0x10 /* Tail descriptor pointer */
+
+#define XAXIDMA_RX_CR_OFFSET	0x30 /* Channel control */
+#define XAXIDMA_RX_SR_OFFSET	0x34 /* Status */
+#define XAXIDMA_RX_CDESC_OFFSET	0x38 /* Current descriptor pointer */
+#define XAXIDMA_RX_TDESC_OFFSET	0x40 /* Tail descriptor pointer */
+
+#define XAXIDMA_CR_RUNSTOP_MASK	0x1 /* Start/stop DMA channel */
+#define XAXIDMA_CR_RESET_MASK	0x4 /* Reset DMA engine */
+
+#define XAXIDMA_BD_CTRL_LENGTH_MASK	0x007FFFFF /* Requested len */
+#define XAXIDMA_BD_CTRL_TXSOF_MASK	0x08000000 /* First tx packet */
+#define XAXIDMA_BD_CTRL_TXEOF_MASK	0x04000000 /* Last tx packet */
+#define XAXIDMA_BD_CTRL_ALL_MASK	0x0C000000 /* All control bits */
+
+#define XAXIDMA_DELAY_MASK		0xFF000000 /* Delay timeout counter */
+#define XAXIDMA_COALESCE_MASK		0x00FF0000 /* Coalesce counter */
+
+#define XAXIDMA_DELAY_SHIFT		24
+#define XAXIDMA_COALESCE_SHIFT		16
+
+#define XAXIDMA_IRQ_IOC_MASK		0x00001000 /* Completion intr */
+#define XAXIDMA_IRQ_DELAY_MASK		0x00002000 /* Delay interrupt */
+#define XAXIDMA_IRQ_ERROR_MASK		0x00004000 /* Error interrupt */
+#define XAXIDMA_IRQ_ALL_MASK		0x00007000 /* All interrupts */
+
+/* Default TX/RX Threshold and waitbound values for SGDMA mode */
+#define XAXIDMA_DFT_TX_THRESHOLD	24
+#define XAXIDMA_DFT_TX_WAITBOUND	254
+#define XAXIDMA_DFT_RX_THRESHOLD	24
+#define XAXIDMA_DFT_RX_WAITBOUND	254
+
+#define XAXIDMA_BD_STS_ACTUAL_LEN_MASK	0x007FFFFF /* Actual len */
+#define XAXIDMA_BD_STS_COMPLETE_MASK	0x80000000 /* Completed */
+#define XAXIDMA_BD_STS_DEC_ERR_MASK	0x40000000 /* Decode error */
+#define XAXIDMA_BD_STS_SLV_ERR_MASK	0x20000000 /* Slave error */
+#define XAXIDMA_BD_STS_INT_ERR_MASK	0x10000000 /* Internal err */
+#define XAXIDMA_BD_STS_ALL_ERR_MASK	0x70000000 /* All errors */
+#define XAXIDMA_BD_STS_RXSOF_MASK	0x08000000 /* First rx pkt */
+#define XAXIDMA_BD_STS_RXEOF_MASK	0x04000000 /* Last rx pkt */
+#define XAXIDMA_BD_STS_ALL_MASK		0xFC000000 /* All status bits */
+
+#define NIXGE_REG_CTRL_OFFSET	0x4000
+#define NIXGE_REG_INFO		0x00
+#define NIXGE_REG_MAC_CTL	0x04
+#define NIXGE_REG_PHY_CTL	0x08
+#define NIXGE_REG_LED_CTL	0x0c
+#define NIXGE_REG_MDIO_DATA	0x10
+#define NIXGE_REG_MDIO_ADDR	0x14
+#define NIXGE_REG_MDIO_OP	0x18
+#define NIXGE_REG_MDIO_CTRL	0x1c
+
+#define NIXGE_ID_LED_CTL_EN	BIT(0)
+#define NIXGE_ID_LED_CTL_VAL	BIT(1)
+
+#define NIXGE_MDIO_CLAUSE45	BIT(12)
+#define NIXGE_MDIO_CLAUSE22	0
+#define NIXGE_MDIO_OP(n)     (((n) & 0x3) << 10)
+#define NIXGE_MDIO_OP_ADDRESS	0
+#define NIXGE_MDIO_C45_WRITE	BIT(0)
+#define NIXGE_MDIO_C45_READ	(BIT(1) | BIT(0))
+#define NIXGE_MDIO_C22_WRITE	BIT(0)
+#define NIXGE_MDIO_C22_READ	BIT(1)
+#define NIXGE_MDIO_ADDR(n)   (((n) & 0x1f) << 5)
+#define NIXGE_MDIO_MMD(n)    (((n) & 0x1f) << 0)
+
+#define NIXGE_REG_MAC_LSB	0x1000
+#define NIXGE_REG_MAC_MSB	0x1004
+
+/* Packet size info */
+#define NIXGE_HDR_SIZE		14 /* Size of Ethernet header */
+#define NIXGE_TRL_SIZE		4 /* Size of Ethernet trailer (FCS) */
+#define NIXGE_MTU		1500 /* Max MTU of an Ethernet frame */
+#define NIXGE_JUMBO_MTU		9000 /* Max MTU of a jumbo Eth. frame */
+
+#define NIXGE_MAX_FRAME_SIZE	 (NIXGE_MTU + NIXGE_HDR_SIZE + NIXGE_TRL_SIZE)
+#define NIXGE_MAX_JUMBO_FRAME_SIZE \
+	(NIXGE_JUMBO_MTU + NIXGE_HDR_SIZE + NIXGE_TRL_SIZE)
+
+struct nixge_hw_dma_bd {
+	u32 next;
+	u32 reserved1;
+	u32 phys;
+	u32 reserved2;
+	u32 reserved3;
+	u32 reserved4;
+	u32 cntrl;
+	u32 status;
+	u32 app0;
+	u32 app1;
+	u32 app2;
+	u32 app3;
+	u32 app4;
+	u32 sw_id_offset;
+	u32 reserved5;
+	u32 reserved6;
+};
+
+struct nixge_tx_skb {
+	struct sk_buff *skb;
+	dma_addr_t mapping;
+	size_t size;
+	bool mapped_as_page;
+};
+
+struct nixge_priv {
+	struct net_device *ndev;
+	struct napi_struct napi;
+	struct device *dev;
+
+	/* Connection to PHY device */
+	struct device_node *phy_node;
+	phy_interface_t		phy_mode;
+
+	int link;
+	unsigned int speed;
+	unsigned int duplex;
+
+	/* MDIO bus data */
+	struct mii_bus *mii_bus;	/* MII bus reference */
+
+	/* IO registers, dma functions and IRQs */
+	void __iomem *ctrl_regs;
+	void __iomem *dma_regs;
+
+	struct tasklet_struct dma_err_tasklet;
+
+	int tx_irq;
+	int rx_irq;
+	u32 last_link;
+
+	/* Buffer descriptors */
+	struct nixge_hw_dma_bd *tx_bd_v;
+	struct nixge_tx_skb *tx_skb;
+	dma_addr_t tx_bd_p;
+
+	struct nixge_hw_dma_bd *rx_bd_v;
+	dma_addr_t rx_bd_p;
+	u32 tx_bd_ci;
+	u32 tx_bd_tail;
+	u32 rx_bd_ci;
+
+	u32 coalesce_count_rx;
+	u32 coalesce_count_tx;
+};
+
+static void nixge_dma_write_reg(struct nixge_priv *priv, off_t offset, u32 val)
+{
+	writel(val, priv->dma_regs + offset);
+}
+
+static u32 nixge_dma_read_reg(const struct nixge_priv *priv, off_t offset)
+{
+	return readl(priv->dma_regs + offset);
+}
+
+static void nixge_ctrl_write_reg(struct nixge_priv *priv, off_t offset, u32 val)
+{
+	writel(val, priv->ctrl_regs + offset);
+}
+
+static u32 nixge_ctrl_read_reg(struct nixge_priv *priv, off_t offset)
+{
+	return readl(priv->ctrl_regs + offset);
+}
+
+#define nixge_ctrl_poll_timeout(priv, addr, val, cond, sleep_us, timeout_us) \
+	readl_poll_timeout((priv)->ctrl_regs + (addr), (val), (cond), \
+			   (sleep_us), (timeout_us))
+
+#define nixge_dma_poll_timeout(priv, addr, val, cond, sleep_us, timeout_us) \
+	readl_poll_timeout((priv)->dma_regs + (addr), (val), (cond), \
+			   (sleep_us), (timeout_us))
+
+static void nixge_hw_dma_bd_release(struct net_device *ndev)
+{
+	int i;
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	for (i = 0; i < RX_BD_NUM; i++) {
+		dma_unmap_single(ndev->dev.parent, priv->rx_bd_v[i].phys,
+				 NIXGE_MAX_JUMBO_FRAME_SIZE, DMA_FROM_DEVICE);
+		dev_kfree_skb((struct sk_buff *)
+			      (priv->rx_bd_v[i].sw_id_offset));
+	}
+
+	if (priv->rx_bd_v)
+		dma_free_coherent(ndev->dev.parent,
+				  sizeof(*priv->rx_bd_v) * RX_BD_NUM,
+				  priv->rx_bd_v,
+				  priv->rx_bd_p);
+
+	if (priv->tx_skb)
+		devm_kfree(ndev->dev.parent, priv->tx_skb);
+
+	if (priv->tx_bd_v)
+		dma_free_coherent(ndev->dev.parent,
+				  sizeof(*priv->tx_bd_v) * TX_BD_NUM,
+				  priv->tx_bd_v,
+				  priv->tx_bd_p);
+}
+
+static int nixge_hw_dma_bd_init(struct net_device *ndev)
+{
+	u32 cr;
+	int i;
+	struct sk_buff *skb;
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	/* Reset the indexes which are used for accessing the BDs */
+	priv->tx_bd_ci = 0;
+	priv->tx_bd_tail = 0;
+	priv->rx_bd_ci = 0;
+
+	/* Allocate the Tx and Rx buffer descriptors. */
+	priv->tx_bd_v = dma_zalloc_coherent(ndev->dev.parent,
+					    sizeof(*priv->tx_bd_v) * TX_BD_NUM,
+					    &priv->tx_bd_p, GFP_KERNEL);
+	if (!priv->tx_bd_v)
+		goto out;
+
+	priv->tx_skb = devm_kzalloc(ndev->dev.parent,
+				    sizeof(*priv->tx_skb) *
+				    TX_BD_NUM,
+				    GFP_KERNEL);
+	if (!priv->tx_skb)
+		goto out;
+
+	priv->rx_bd_v = dma_zalloc_coherent(ndev->dev.parent,
+					    sizeof(*priv->rx_bd_v) * RX_BD_NUM,
+					    &priv->rx_bd_p, GFP_KERNEL);
+	if (!priv->rx_bd_v)
+		goto out;
+
+	for (i = 0; i < TX_BD_NUM; i++) {
+		priv->tx_bd_v[i].next = priv->tx_bd_p +
+				      sizeof(*priv->tx_bd_v) *
+				      ((i + 1) % TX_BD_NUM);
+	}
+
+	for (i = 0; i < RX_BD_NUM; i++) {
+		priv->rx_bd_v[i].next = priv->rx_bd_p +
+				      sizeof(*priv->rx_bd_v) *
+				      ((i + 1) % RX_BD_NUM);
+
+		skb = netdev_alloc_skb_ip_align(ndev,
+						NIXGE_MAX_JUMBO_FRAME_SIZE);
+		if (!skb)
+			goto out;
+
+		priv->rx_bd_v[i].sw_id_offset = (u32)skb;
+		priv->rx_bd_v[i].phys =
+			dma_map_single(ndev->dev.parent,
+				       skb->data,
+				       NIXGE_MAX_JUMBO_FRAME_SIZE,
+				       DMA_FROM_DEVICE);
+		priv->rx_bd_v[i].cntrl = NIXGE_MAX_JUMBO_FRAME_SIZE;
+	}
+
+	/* Start updating the Rx channel control register */
+	cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+	/* Update the interrupt coalesce count */
+	cr = ((cr & ~XAXIDMA_COALESCE_MASK) |
+	      ((priv->coalesce_count_rx) << XAXIDMA_COALESCE_SHIFT));
+	/* Update the delay timer count */
+	cr = ((cr & ~XAXIDMA_DELAY_MASK) |
+	      (XAXIDMA_DFT_RX_WAITBOUND << XAXIDMA_DELAY_SHIFT));
+	/* Enable coalesce, delay timer and error interrupts */
+	cr |= XAXIDMA_IRQ_ALL_MASK;
+	/* Write to the Rx channel control register */
+	nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET, cr);
+
+	/* Start updating the Tx channel control register */
+	cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+	/* Update the interrupt coalesce count */
+	cr = (((cr & ~XAXIDMA_COALESCE_MASK)) |
+	      ((priv->coalesce_count_tx) << XAXIDMA_COALESCE_SHIFT));
+	/* Update the delay timer count */
+	cr = (((cr & ~XAXIDMA_DELAY_MASK)) |
+	      (XAXIDMA_DFT_TX_WAITBOUND << XAXIDMA_DELAY_SHIFT));
+	/* Enable coalesce, delay timer and error interrupts */
+	cr |= XAXIDMA_IRQ_ALL_MASK;
+	/* Write to the Tx channel control register */
+	nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET, cr);
+
+	/* Populate the tail pointer and bring the Rx Axi DMA engine out of
+	 * halted state. This will make the Rx side ready for reception.
+	 */
+	nixge_dma_write_reg(priv, XAXIDMA_RX_CDESC_OFFSET, priv->rx_bd_p);
+	cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+	nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET,
+			    cr | XAXIDMA_CR_RUNSTOP_MASK);
+	nixge_dma_write_reg(priv, XAXIDMA_RX_TDESC_OFFSET, priv->rx_bd_p +
+			    (sizeof(*priv->rx_bd_v) * (RX_BD_NUM - 1)));
+
+	/* Write to the RS (Run-stop) bit in the Tx channel control register.
+	 * Tx channel is now ready to run. But only after we write to the
+	 * tail pointer register that the Tx channel will start transmitting.
+	 */
+	nixge_dma_write_reg(priv, XAXIDMA_TX_CDESC_OFFSET, priv->tx_bd_p);
+	cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+	nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET,
+			    cr | XAXIDMA_CR_RUNSTOP_MASK);
+
+	return 0;
+out:
+	nixge_hw_dma_bd_release(ndev);
+	return -ENOMEM;
+}
+
+static void __nixge_device_reset(struct nixge_priv *priv, off_t offset)
+{
+	u32 status;
+	int err;
+	/* Reset Axi DMA. This would reset NIXGE Ethernet core as well.
+	 * The reset process of Axi DMA takes a while to complete as all
+	 * pending commands/transfers will be flushed or completed during
+	 * this reset process.
+	 */
+	nixge_dma_write_reg(priv, offset, XAXIDMA_CR_RESET_MASK);
+	err = nixge_dma_poll_timeout(priv, offset, status,
+				     !(status & XAXIDMA_CR_RESET_MASK), 10,
+				     1000);
+	if (err)
+		netdev_err(priv->ndev, "%s: DMA reset timeout!\n", __func__);
+}
+
+static void nixge_device_reset(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	__nixge_device_reset(priv, XAXIDMA_TX_CR_OFFSET);
+	__nixge_device_reset(priv, XAXIDMA_RX_CR_OFFSET);
+
+	if (nixge_hw_dma_bd_init(ndev))
+		netdev_err(ndev, "%s: descriptor allocation failed\n",
+			   __func__);
+
+	netif_trans_update(ndev);
+}
+
+static void nixge_handle_link_change(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	struct phy_device *phydev = ndev->phydev;
+
+	if (phydev->link != priv->link || phydev->speed != priv->speed ||
+	    phydev->duplex != priv->duplex) {
+		priv->link = phydev->link;
+		priv->speed = phydev->speed;
+		priv->duplex = phydev->duplex;
+		phy_print_status(phydev);
+	}
+}
+
+static void nixge_tx_skb_unmap(struct nixge_priv *priv,
+			       struct nixge_tx_skb *tx_skb)
+{
+	if (tx_skb->mapping) {
+		if (tx_skb->mapped_as_page)
+			dma_unmap_page(priv->ndev->dev.parent, tx_skb->mapping,
+				       tx_skb->size, DMA_TO_DEVICE);
+		else
+			dma_unmap_single(priv->ndev->dev.parent,
+					 tx_skb->mapping,
+					 tx_skb->size, DMA_TO_DEVICE);
+		tx_skb->mapping = 0;
+	}
+
+	if (tx_skb->skb) {
+		dev_kfree_skb_any(tx_skb->skb);
+		tx_skb->skb = NULL;
+	}
+}
+
+static void nixge_start_xmit_done(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	struct nixge_hw_dma_bd *cur_p;
+	struct nixge_tx_skb *tx_skb;
+	unsigned int status = 0;
+	u32 packets = 0;
+	u32 size = 0;
+
+	cur_p = &priv->tx_bd_v[priv->tx_bd_ci];
+	tx_skb = &priv->tx_skb[priv->tx_bd_ci];
+
+	status = cur_p->status;
+
+	while (status & XAXIDMA_BD_STS_COMPLETE_MASK) {
+		nixge_tx_skb_unmap(priv, tx_skb);
+		cur_p->status = 0;
+
+		size += status & XAXIDMA_BD_STS_ACTUAL_LEN_MASK;
+		packets++;
+
+		++priv->tx_bd_ci;
+		priv->tx_bd_ci %= TX_BD_NUM;
+		cur_p = &priv->tx_bd_v[priv->tx_bd_ci];
+		tx_skb = &priv->tx_skb[priv->tx_bd_ci];
+		status = cur_p->status;
+	}
+
+	ndev->stats.tx_packets += packets;
+	ndev->stats.tx_bytes += size;
+
+	if (packets)
+		netif_wake_queue(ndev);
+}
+
+static int nixge_check_tx_bd_space(struct nixge_priv *priv,
+				   int num_frag)
+{
+	struct nixge_hw_dma_bd *cur_p;
+
+	cur_p = &priv->tx_bd_v[(priv->tx_bd_tail + num_frag) % TX_BD_NUM];
+	if (cur_p->status & XAXIDMA_BD_STS_ALL_MASK)
+		return NETDEV_TX_BUSY;
+	return 0;
+}
+
+static int nixge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	struct nixge_hw_dma_bd *cur_p;
+	struct nixge_tx_skb *tx_skb;
+	dma_addr_t tail_p;
+	skb_frag_t *frag;
+	u32 num_frag;
+	u32 ii;
+
+	num_frag = skb_shinfo(skb)->nr_frags;
+	cur_p = &priv->tx_bd_v[priv->tx_bd_tail];
+	tx_skb = &priv->tx_skb[priv->tx_bd_tail];
+
+	if (nixge_check_tx_bd_space(priv, num_frag)) {
+		if (!netif_queue_stopped(ndev))
+			netif_stop_queue(ndev);
+		return NETDEV_TX_OK;
+	}
+
+	cur_p->phys = dma_map_single(ndev->dev.parent, skb->data,
+				     skb_headlen(skb), DMA_TO_DEVICE);
+	if (dma_mapping_error(ndev->dev.parent, cur_p->phys))
+		goto drop;
+
+	cur_p->cntrl = skb_headlen(skb) | XAXIDMA_BD_CTRL_TXSOF_MASK;
+
+	tx_skb->skb = NULL;
+	tx_skb->mapping = cur_p->phys;
+	tx_skb->size = skb_headlen(skb);
+	tx_skb->mapped_as_page = false;
+
+	for (ii = 0; ii < num_frag; ii++) {
+		++priv->tx_bd_tail;
+		priv->tx_bd_tail %= TX_BD_NUM;
+		cur_p = &priv->tx_bd_v[priv->tx_bd_tail];
+		tx_skb = &priv->tx_skb[priv->tx_bd_tail];
+		frag = &skb_shinfo(skb)->frags[ii];
+
+		cur_p->phys = skb_frag_dma_map(ndev->dev.parent, frag, 0,
+					       skb_frag_size(frag),
+					       DMA_TO_DEVICE);
+		if (dma_mapping_error(ndev->dev.parent, cur_p->phys))
+			goto frag_err;
+
+		cur_p->cntrl = skb_frag_size(frag);
+
+		tx_skb->skb = NULL;
+		tx_skb->mapping = cur_p->phys;
+		tx_skb->size = skb_frag_size(frag);
+		tx_skb->mapped_as_page = true;
+	}
+
+	/* last buffer of the frame */
+	tx_skb->skb = skb;
+
+	cur_p->cntrl |= XAXIDMA_BD_CTRL_TXEOF_MASK;
+	cur_p->app4 = (unsigned long)skb;
+
+	tail_p = priv->tx_bd_p + sizeof(*priv->tx_bd_v) * priv->tx_bd_tail;
+	/* Start the transfer */
+	nixge_dma_write_reg(priv, XAXIDMA_TX_TDESC_OFFSET, tail_p);
+	++priv->tx_bd_tail;
+	priv->tx_bd_tail %= TX_BD_NUM;
+
+	return NETDEV_TX_OK;
+frag_err:
+	for (; ii > 0; ii--) {
+		if (priv->tx_bd_tail)
+			priv->tx_bd_tail--;
+		else
+			priv->tx_bd_tail = TX_BD_NUM - 1;
+
+		tx_skb = &priv->tx_skb[priv->tx_bd_tail];
+		nixge_tx_skb_unmap(priv, tx_skb);
+
+		cur_p = &priv->tx_bd_v[priv->tx_bd_tail];
+		cur_p->status = 0;
+	}
+	dma_unmap_single(priv->ndev->dev.parent,
+			 tx_skb->mapping,
+			 tx_skb->size, DMA_TO_DEVICE);
+drop:
+	ndev->stats.tx_dropped++;
+	return NETDEV_TX_OK;
+}
+
+static int nixge_recv(struct net_device *ndev, int budget)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	struct sk_buff *skb, *new_skb;
+	struct nixge_hw_dma_bd *cur_p;
+	dma_addr_t tail_p = 0;
+	u32 packets = 0;
+	u32 length = 0;
+	u32 size = 0;
+
+	cur_p = &priv->rx_bd_v[priv->rx_bd_ci];
+
+	while ((cur_p->status & XAXIDMA_BD_STS_COMPLETE_MASK &&
+		budget > packets)) {
+		tail_p = priv->rx_bd_p + sizeof(*priv->rx_bd_v) *
+			 priv->rx_bd_ci;
+
+		skb = (struct sk_buff *)(cur_p->sw_id_offset);
+
+		length = cur_p->status & XAXIDMA_BD_STS_ACTUAL_LEN_MASK;
+		if (length > NIXGE_MAX_JUMBO_FRAME_SIZE)
+			length = NIXGE_MAX_JUMBO_FRAME_SIZE;
+
+		dma_unmap_single(ndev->dev.parent, cur_p->phys,
+				 NIXGE_MAX_JUMBO_FRAME_SIZE,
+				 DMA_FROM_DEVICE);
+
+		skb_put(skb, length);
+
+		skb->protocol = eth_type_trans(skb, ndev);
+		skb_checksum_none_assert(skb);
+
+		/* For now mark them as CHECKSUM_NONE since
+		 * we don't have offload capabilities
+		 */
+		skb->ip_summed = CHECKSUM_NONE;
+
+		napi_gro_receive(&priv->napi, skb);
+
+		size += length;
+		packets++;
+
+		new_skb = netdev_alloc_skb_ip_align(ndev,
+						    NIXGE_MAX_JUMBO_FRAME_SIZE);
+		if (!new_skb)
+			return packets;
+
+		cur_p->phys = dma_map_single(ndev->dev.parent, new_skb->data,
+					     NIXGE_MAX_JUMBO_FRAME_SIZE,
+					     DMA_FROM_DEVICE);
+		if (dma_mapping_error(ndev->dev.parent, cur_p->phys)) {
+			/* FIXME: bail out and clean up */
+			netdev_err(ndev, "Failed to map ...\n");
+		}
+		cur_p->cntrl = NIXGE_MAX_JUMBO_FRAME_SIZE;
+		cur_p->status = 0;
+		cur_p->sw_id_offset = (u32)new_skb;
+
+		++priv->rx_bd_ci;
+		priv->rx_bd_ci %= RX_BD_NUM;
+		cur_p = &priv->rx_bd_v[priv->rx_bd_ci];
+	}
+
+	ndev->stats.rx_packets += packets;
+	ndev->stats.rx_bytes += size;
+
+	if (tail_p)
+		nixge_dma_write_reg(priv, XAXIDMA_RX_TDESC_OFFSET, tail_p);
+
+	return packets;
+}
+
+static int nixge_poll(struct napi_struct *napi, int budget)
+{
+	struct nixge_priv *priv = container_of(napi, struct nixge_priv, napi);
+	int work_done;
+	u32 status, cr;
+
+	work_done = 0;
+
+	work_done = nixge_recv(priv->ndev, budget);
+	if (work_done < budget) {
+		napi_complete_done(napi, work_done);
+		status = nixge_dma_read_reg(priv, XAXIDMA_RX_SR_OFFSET);
+
+		if (status & (XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK)) {
+			/* If there's more, reschedule, but clear */
+			nixge_dma_write_reg(priv, XAXIDMA_RX_SR_OFFSET, status);
+			napi_reschedule(napi);
+		} else {
+			/* if not, turn on RX IRQs again ... */
+			cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+			cr |= (XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK);
+			nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET, cr);
+		}
+	}
+
+	return work_done;
+}
+
+static irqreturn_t nixge_tx_irq(int irq, void *_ndev)
+{
+	struct nixge_priv *priv = netdev_priv(_ndev);
+	struct net_device *ndev = _ndev;
+	unsigned int status;
+	u32 cr;
+
+	status = nixge_dma_read_reg(priv, XAXIDMA_TX_SR_OFFSET);
+	if (status & (XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK)) {
+		nixge_dma_write_reg(priv, XAXIDMA_TX_SR_OFFSET, status);
+		nixge_start_xmit_done(priv->ndev);
+		goto out;
+	}
+	if (!(status & XAXIDMA_IRQ_ALL_MASK)) {
+		netdev_err(ndev, "No interrupts asserted in Tx path\n");
+		return IRQ_NONE;
+	}
+	if (status & XAXIDMA_IRQ_ERROR_MASK) {
+		netdev_err(ndev, "DMA Tx error 0x%x\n", status);
+		netdev_err(ndev, "Current BD is at: 0x%x\n",
+			   (priv->tx_bd_v[priv->tx_bd_ci]).phys);
+
+		cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+		/* Disable coalesce, delay timer and error interrupts */
+		cr &= (~XAXIDMA_IRQ_ALL_MASK);
+		/* Write to the Tx channel control register */
+		nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET, cr);
+
+		cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+		/* Disable coalesce, delay timer and error interrupts */
+		cr &= (~XAXIDMA_IRQ_ALL_MASK);
+		/* Write to the Rx channel control register */
+		nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET, cr);
+
+		tasklet_schedule(&priv->dma_err_tasklet);
+		nixge_dma_write_reg(priv, XAXIDMA_TX_SR_OFFSET, status);
+	}
+out:
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t nixge_rx_irq(int irq, void *_ndev)
+{
+	struct nixge_priv *priv = netdev_priv(_ndev);
+	struct net_device *ndev = _ndev;
+	unsigned int status;
+	u32 cr;
+
+	status = nixge_dma_read_reg(priv, XAXIDMA_RX_SR_OFFSET);
+	if (status & (XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK)) {
+		/* Turn of IRQs because NAPI */
+		nixge_dma_write_reg(priv, XAXIDMA_RX_SR_OFFSET, status);
+		cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+		cr &= ~(XAXIDMA_IRQ_IOC_MASK | XAXIDMA_IRQ_DELAY_MASK);
+		nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET, cr);
+
+		if (napi_schedule_prep(&priv->napi))
+			__napi_schedule(&priv->napi);
+		goto out;
+	}
+	if (!(status & XAXIDMA_IRQ_ALL_MASK)) {
+		netdev_err(ndev, "No interrupts asserted in Rx path\n");
+		return IRQ_NONE;
+	}
+	if (status & XAXIDMA_IRQ_ERROR_MASK) {
+		netdev_err(ndev, "DMA Rx error 0x%x\n", status);
+		netdev_err(ndev, "Current BD is at: 0x%x\n",
+			   (priv->rx_bd_v[priv->rx_bd_ci]).phys);
+
+		cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+		/* Disable coalesce, delay timer and error interrupts */
+		cr &= (~XAXIDMA_IRQ_ALL_MASK);
+		/* Finally write to the Tx channel control register */
+		nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET, cr);
+
+		cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+		/* Disable coalesce, delay timer and error interrupts */
+		cr &= (~XAXIDMA_IRQ_ALL_MASK);
+		/* write to the Rx channel control register */
+		nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET, cr);
+
+		tasklet_schedule(&priv->dma_err_tasklet);
+		nixge_dma_write_reg(priv, XAXIDMA_RX_SR_OFFSET, status);
+	}
+out:
+	return IRQ_HANDLED;
+}
+
+static void nixge_dma_err_handler(unsigned long data)
+{
+	struct nixge_priv *lp = (struct nixge_priv *)data;
+	struct nixge_hw_dma_bd *cur_p;
+	struct nixge_tx_skb *tx_skb;
+	u32 cr, i;
+
+	__nixge_device_reset(lp, XAXIDMA_TX_CR_OFFSET);
+	__nixge_device_reset(lp, XAXIDMA_RX_CR_OFFSET);
+
+	for (i = 0; i < TX_BD_NUM; i++) {
+		cur_p = &lp->tx_bd_v[i];
+		tx_skb = &lp->tx_skb[i];
+		nixge_tx_skb_unmap(lp, tx_skb);
+
+		cur_p->phys = 0;
+		cur_p->cntrl = 0;
+		cur_p->status = 0;
+		cur_p->app0 = 0;
+		cur_p->app1 = 0;
+		cur_p->app2 = 0;
+		cur_p->app3 = 0;
+		cur_p->app4 = 0;
+		cur_p->sw_id_offset = 0;
+	}
+
+	for (i = 0; i < RX_BD_NUM; i++) {
+		cur_p = &lp->rx_bd_v[i];
+		cur_p->status = 0;
+		cur_p->app0 = 0;
+		cur_p->app1 = 0;
+		cur_p->app2 = 0;
+		cur_p->app3 = 0;
+		cur_p->app4 = 0;
+	}
+
+	lp->tx_bd_ci = 0;
+	lp->tx_bd_tail = 0;
+	lp->rx_bd_ci = 0;
+
+	/* Start updating the Rx channel control register */
+	cr = nixge_dma_read_reg(lp, XAXIDMA_RX_CR_OFFSET);
+	/* Update the interrupt coalesce count */
+	cr = ((cr & ~XAXIDMA_COALESCE_MASK) |
+	      (XAXIDMA_DFT_RX_THRESHOLD << XAXIDMA_COALESCE_SHIFT));
+	/* Update the delay timer count */
+	cr = ((cr & ~XAXIDMA_DELAY_MASK) |
+	      (XAXIDMA_DFT_RX_WAITBOUND << XAXIDMA_DELAY_SHIFT));
+	/* Enable coalesce, delay timer and error interrupts */
+	cr |= XAXIDMA_IRQ_ALL_MASK;
+	/* Finally write to the Rx channel control register */
+	nixge_dma_write_reg(lp, XAXIDMA_RX_CR_OFFSET, cr);
+
+	/* Start updating the Tx channel control register */
+	cr = nixge_dma_read_reg(lp, XAXIDMA_TX_CR_OFFSET);
+	/* Update the interrupt coalesce count */
+	cr = (((cr & ~XAXIDMA_COALESCE_MASK)) |
+	      (XAXIDMA_DFT_TX_THRESHOLD << XAXIDMA_COALESCE_SHIFT));
+	/* Update the delay timer count */
+	cr = (((cr & ~XAXIDMA_DELAY_MASK)) |
+	      (XAXIDMA_DFT_TX_WAITBOUND << XAXIDMA_DELAY_SHIFT));
+	/* Enable coalesce, delay timer and error interrupts */
+	cr |= XAXIDMA_IRQ_ALL_MASK;
+	/* Finally write to the Tx channel control register */
+	nixge_dma_write_reg(lp, XAXIDMA_TX_CR_OFFSET, cr);
+
+	/* Populate the tail pointer and bring the Rx Axi DMA engine out of
+	 * halted state. This will make the Rx side ready for reception.
+	 */
+	nixge_dma_write_reg(lp, XAXIDMA_RX_CDESC_OFFSET, lp->rx_bd_p);
+	cr = nixge_dma_read_reg(lp, XAXIDMA_RX_CR_OFFSET);
+	nixge_dma_write_reg(lp, XAXIDMA_RX_CR_OFFSET,
+			    cr | XAXIDMA_CR_RUNSTOP_MASK);
+	nixge_dma_write_reg(lp, XAXIDMA_RX_TDESC_OFFSET, lp->rx_bd_p +
+			    (sizeof(*lp->rx_bd_v) * (RX_BD_NUM - 1)));
+
+	/* Write to the RS (Run-stop) bit in the Tx channel control register.
+	 * Tx channel is now ready to run. But only after we write to the
+	 * tail pointer register that the Tx channel will start transmitting
+	 */
+	nixge_dma_write_reg(lp, XAXIDMA_TX_CDESC_OFFSET, lp->tx_bd_p);
+	cr = nixge_dma_read_reg(lp, XAXIDMA_TX_CR_OFFSET);
+	nixge_dma_write_reg(lp, XAXIDMA_TX_CR_OFFSET,
+			    cr | XAXIDMA_CR_RUNSTOP_MASK);
+}
+
+static int nixge_open(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	struct phy_device *phy;
+	int ret;
+
+	nixge_device_reset(ndev);
+
+	phy = of_phy_connect(ndev, priv->phy_node,
+			     &nixge_handle_link_change, 0, priv->phy_mode);
+	if (!phy)
+		return -ENODEV;
+
+	phy_start(phy);
+
+	/* Enable tasklets for Axi DMA error handling */
+	tasklet_init(&priv->dma_err_tasklet, nixge_dma_err_handler,
+		     (unsigned long)priv);
+
+	napi_enable(&priv->napi);
+
+	/* Enable interrupts for Axi DMA Tx */
+	ret = request_irq(priv->tx_irq, nixge_tx_irq, 0, ndev->name, ndev);
+	if (ret)
+		goto err_tx_irq;
+	/* Enable interrupts for Axi DMA Rx */
+	ret = request_irq(priv->rx_irq, nixge_rx_irq, 0, ndev->name, ndev);
+	if (ret)
+		goto err_rx_irq;
+
+	netif_start_queue(ndev);
+
+	return 0;
+
+err_rx_irq:
+	free_irq(priv->tx_irq, ndev);
+err_tx_irq:
+	phy_stop(phy);
+	phy_disconnect(phy);
+	tasklet_kill(&priv->dma_err_tasklet);
+	netdev_err(ndev, "request_irq() failed\n");
+	return ret;
+}
+
+static int nixge_stop(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	u32 cr;
+
+	netif_stop_queue(ndev);
+	napi_disable(&priv->napi);
+
+	if (ndev->phydev) {
+		phy_stop(ndev->phydev);
+		phy_disconnect(ndev->phydev);
+	}
+
+	cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+	nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET,
+			    cr & (~XAXIDMA_CR_RUNSTOP_MASK));
+	cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+	nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET,
+			    cr & (~XAXIDMA_CR_RUNSTOP_MASK));
+
+	tasklet_kill(&priv->dma_err_tasklet);
+
+	free_irq(priv->tx_irq, ndev);
+	free_irq(priv->rx_irq, ndev);
+
+	nixge_hw_dma_bd_release(ndev);
+
+	return 0;
+}
+
+static int nixge_change_mtu(struct net_device *ndev, int new_mtu)
+{
+	if (netif_running(ndev))
+		return -EBUSY;
+
+	if ((new_mtu + NIXGE_HDR_SIZE + NIXGE_TRL_SIZE) >
+	     NIXGE_MAX_JUMBO_FRAME_SIZE)
+		return -EINVAL;
+
+	ndev->mtu = new_mtu;
+
+	return 0;
+}
+
+static s32 __nixge_hw_set_mac_address(struct net_device *ndev)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	nixge_ctrl_write_reg(priv, NIXGE_REG_MAC_LSB,
+			     (ndev->dev_addr[2]) << 24 |
+			     (ndev->dev_addr[3] << 16) |
+			     (ndev->dev_addr[4] << 8) |
+			     (ndev->dev_addr[5] << 0));
+
+	nixge_ctrl_write_reg(priv, NIXGE_REG_MAC_MSB,
+			     (ndev->dev_addr[1] | (ndev->dev_addr[0] << 8)));
+
+	return 0;
+}
+
+static int nixge_net_set_mac_address(struct net_device *ndev, void *p)
+{
+	int err;
+
+	err = eth_mac_addr(ndev, p);
+	if (!err)
+		__nixge_hw_set_mac_address(ndev);
+
+	return err;
+}
+
+static const struct net_device_ops nixge_netdev_ops = {
+	.ndo_open = nixge_open,
+	.ndo_stop = nixge_stop,
+	.ndo_start_xmit = nixge_start_xmit,
+	.ndo_change_mtu	= nixge_change_mtu,
+	.ndo_set_mac_address = nixge_net_set_mac_address,
+	.ndo_validate_addr = eth_validate_addr,
+};
+
+static void nixge_ethtools_get_drvinfo(struct net_device *ndev,
+				       struct ethtool_drvinfo *ed)
+{
+	strlcpy(ed->driver, "nixge", sizeof(ed->driver));
+	strlcpy(ed->bus_info, "platform", sizeof(ed->driver));
+}
+
+static int nixge_ethtools_get_coalesce(struct net_device *ndev,
+				       struct ethtool_coalesce *ecoalesce)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	u32 regval = 0;
+
+	regval = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
+	ecoalesce->rx_max_coalesced_frames = (regval & XAXIDMA_COALESCE_MASK)
+					     >> XAXIDMA_COALESCE_SHIFT;
+	regval = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
+	ecoalesce->tx_max_coalesced_frames = (regval & XAXIDMA_COALESCE_MASK)
+					     >> XAXIDMA_COALESCE_SHIFT;
+	return 0;
+}
+
+static int nixge_ethtools_set_coalesce(struct net_device *ndev,
+				       struct ethtool_coalesce *ecoalesce)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	if (netif_running(ndev)) {
+		netdev_err(ndev,
+			   "Please stop netif before applying configuration\n");
+		return -EBUSY;
+	}
+
+	if (ecoalesce->rx_coalesce_usecs ||
+	    ecoalesce->rx_coalesce_usecs_irq ||
+	    ecoalesce->rx_max_coalesced_frames_irq ||
+	    ecoalesce->tx_coalesce_usecs ||
+	    ecoalesce->tx_coalesce_usecs_irq ||
+	    ecoalesce->tx_max_coalesced_frames_irq ||
+	    ecoalesce->stats_block_coalesce_usecs ||
+	    ecoalesce->use_adaptive_rx_coalesce ||
+	    ecoalesce->use_adaptive_tx_coalesce ||
+	    ecoalesce->pkt_rate_low ||
+	    ecoalesce->rx_coalesce_usecs_low ||
+	    ecoalesce->rx_max_coalesced_frames_low ||
+	    ecoalesce->tx_coalesce_usecs_low ||
+	    ecoalesce->tx_max_coalesced_frames_low ||
+	    ecoalesce->pkt_rate_high ||
+	    ecoalesce->rx_coalesce_usecs_high ||
+	    ecoalesce->rx_max_coalesced_frames_high ||
+	    ecoalesce->tx_coalesce_usecs_high ||
+	    ecoalesce->tx_max_coalesced_frames_high ||
+	    ecoalesce->rate_sample_interval)
+		return -EOPNOTSUPP;
+	if (ecoalesce->rx_max_coalesced_frames)
+		priv->coalesce_count_rx = ecoalesce->rx_max_coalesced_frames;
+	if (ecoalesce->tx_max_coalesced_frames)
+		priv->coalesce_count_tx = ecoalesce->tx_max_coalesced_frames;
+
+	return 0;
+}
+
+static int nixge_ethtools_set_phys_id(struct net_device *ndev,
+				      enum ethtool_phys_id_state state)
+{
+	struct nixge_priv *priv = netdev_priv(ndev);
+	u32 ctrl;
+
+	ctrl = nixge_ctrl_read_reg(priv, NIXGE_REG_LED_CTL);
+	switch (state) {
+	case ETHTOOL_ID_ACTIVE:
+		ctrl |= NIXGE_ID_LED_CTL_EN;
+		/* Enable identification LED override*/
+		nixge_ctrl_write_reg(priv, NIXGE_REG_LED_CTL, ctrl);
+		return 2;
+
+	case ETHTOOL_ID_ON:
+		ctrl |= NIXGE_ID_LED_CTL_VAL;
+		nixge_ctrl_write_reg(priv, NIXGE_REG_LED_CTL, ctrl);
+		break;
+
+	case ETHTOOL_ID_OFF:
+		ctrl &= ~NIXGE_ID_LED_CTL_VAL;
+		nixge_ctrl_write_reg(priv, NIXGE_REG_LED_CTL, ctrl);
+		break;
+
+	case ETHTOOL_ID_INACTIVE:
+		/* Restore LED settings */
+		ctrl &= ~NIXGE_ID_LED_CTL_EN;
+		nixge_ctrl_write_reg(priv, NIXGE_REG_LED_CTL, ctrl);
+		break;
+	}
+
+	return 0;
+}
+
+static const struct ethtool_ops nixge_ethtool_ops = {
+	.get_drvinfo    = nixge_ethtools_get_drvinfo,
+	.get_coalesce   = nixge_ethtools_get_coalesce,
+	.set_coalesce   = nixge_ethtools_set_coalesce,
+	.set_phys_id    = nixge_ethtools_set_phys_id,
+	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
+	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
+	.get_link		= ethtool_op_get_link,
+};
+
+static int nixge_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+{
+	struct nixge_priv *priv = bus->priv;
+	u32 status, tmp;
+	int err;
+	u16 device;
+
+	if (reg & MII_ADDR_C45) {
+		device = (reg >> 16) & 0x1f;
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_ADDR, reg & 0xffff);
+
+		tmp = NIXGE_MDIO_CLAUSE45 | NIXGE_MDIO_OP(NIXGE_MDIO_OP_ADDRESS)
+			| NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_OP, tmp);
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_CTRL, 1);
+
+		err = nixge_ctrl_poll_timeout(priv, NIXGE_REG_MDIO_CTRL, status,
+					      !status, 10, 1000);
+		if (err) {
+			dev_err(priv->dev, "timeout setting address");
+			return err;
+		}
+
+		tmp = NIXGE_MDIO_CLAUSE45 | NIXGE_MDIO_OP(NIXGE_MDIO_C45_READ) |
+			NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+	} else {
+		device = reg & 0x1f;
+
+		tmp = NIXGE_MDIO_CLAUSE22 | NIXGE_MDIO_OP(NIXGE_MDIO_C22_READ) |
+			NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+	}
+
+	nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_OP, tmp);
+	nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_CTRL, 1);
+
+	err = nixge_ctrl_poll_timeout(priv, NIXGE_REG_MDIO_CTRL, status,
+				      !status, 10, 1000);
+	if (err) {
+		dev_err(priv->dev, "timeout setting read command");
+		return err;
+	}
+
+	status = nixge_ctrl_read_reg(priv, NIXGE_REG_MDIO_DATA);
+
+	return status;
+}
+
+static int nixge_mdio_write(struct mii_bus *bus, int phy_id, int reg, u16 val)
+{
+	struct nixge_priv *priv = bus->priv;
+	u32 status, tmp;
+	u16 device;
+	int err;
+
+	if (reg & MII_ADDR_C45) {
+		device = (reg >> 16) & 0x1f;
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_ADDR, reg & 0xffff);
+
+		tmp = NIXGE_MDIO_CLAUSE45 | NIXGE_MDIO_OP(NIXGE_MDIO_OP_ADDRESS)
+			| NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_OP, tmp);
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_CTRL, 1);
+
+		err = nixge_ctrl_poll_timeout(priv, NIXGE_REG_MDIO_CTRL, status,
+					      !status, 10, 1000);
+		if (err) {
+			dev_err(priv->dev, "timeout setting address");
+			return err;
+		}
+
+		tmp = NIXGE_MDIO_CLAUSE45 | NIXGE_MDIO_OP(NIXGE_MDIO_C45_WRITE)
+			| NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_DATA, val);
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_OP, tmp);
+		err = nixge_ctrl_poll_timeout(priv, NIXGE_REG_MDIO_CTRL, status,
+					      !status, 10, 1000);
+		if (err)
+			dev_err(priv->dev, "timeout setting write command");
+	} else {
+		device = reg & 0x1f;
+
+		tmp = NIXGE_MDIO_CLAUSE22 |
+			NIXGE_MDIO_OP(NIXGE_MDIO_C22_WRITE) |
+			NIXGE_MDIO_ADDR(phy_id) | NIXGE_MDIO_MMD(device);
+
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_DATA, val);
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_OP, tmp);
+		nixge_ctrl_write_reg(priv, NIXGE_REG_MDIO_CTRL, 1);
+
+		err = nixge_ctrl_poll_timeout(priv, NIXGE_REG_MDIO_CTRL, status,
+					      !status, 10, 1000);
+		if (err)
+			dev_err(priv->dev, "timeout setting write command");
+	}
+
+	return err;
+}
+
+static int nixge_mdio_setup(struct nixge_priv *priv, struct device_node *np)
+{
+	struct mii_bus *bus;
+
+	bus = devm_mdiobus_alloc(priv->dev);
+	if (!bus)
+		return -ENOMEM;
+
+	snprintf(bus->id, MII_BUS_ID_SIZE, "%s-mii", dev_name(priv->dev));
+	bus->priv = priv;
+	bus->name = "nixge_mii_bus";
+	bus->read = nixge_mdio_read;
+	bus->write = nixge_mdio_write;
+	bus->parent = priv->dev;
+
+	priv->mii_bus = bus;
+
+	return of_mdiobus_register(bus, np);
+}
+
+static void *nixge_get_nvmem_address(struct device *dev)
+{
+	struct nvmem_cell *cell;
+	size_t cell_size;
+	char *mac;
+
+	cell = nvmem_cell_get(dev, "address");
+	if (IS_ERR(cell))
+		return cell;
+
+	mac = nvmem_cell_read(cell, &cell_size);
+	nvmem_cell_put(cell);
+
+	return mac;
+}
+
+static int nixge_probe(struct platform_device *pdev)
+{
+	struct nixge_priv *priv;
+	struct net_device *ndev;
+	struct resource *dmares;
+	const char *mac_addr;
+	int err;
+
+	ndev = alloc_etherdev(sizeof(*priv));
+	if (!ndev)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, ndev);
+	SET_NETDEV_DEV(ndev, &pdev->dev);
+
+	ndev->features = NETIF_F_SG;
+	ndev->netdev_ops = &nixge_netdev_ops;
+	ndev->ethtool_ops = &nixge_ethtool_ops;
+
+	/* MTU range: 64 - 9000 */
+	ndev->min_mtu = 64;
+	ndev->max_mtu = NIXGE_JUMBO_MTU;
+
+	mac_addr = nixge_get_nvmem_address(&pdev->dev);
+	if (mac_addr && is_valid_ether_addr(mac_addr))
+		ether_addr_copy(ndev->dev_addr, mac_addr);
+	else
+		eth_hw_addr_random(ndev);
+
+	priv = netdev_priv(ndev);
+	priv->ndev = ndev;
+	priv->dev = &pdev->dev;
+
+	netif_napi_add(ndev, &priv->napi, nixge_poll, NAPI_POLL_WEIGHT);
+
+	dmares = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	priv->dma_regs = devm_ioremap_resource(&pdev->dev, dmares);
+	if (IS_ERR(priv->dma_regs)) {
+		netdev_err(ndev, "failed to map dma regs\n");
+		return PTR_ERR(priv->dma_regs);
+	}
+	priv->ctrl_regs = priv->dma_regs + NIXGE_REG_CTRL_OFFSET;
+	__nixge_hw_set_mac_address(ndev);
+
+	priv->tx_irq = platform_get_irq_byname(pdev, "tx");
+	if (priv->tx_irq < 0) {
+		netdev_err(ndev, "could not find 'tx' irq");
+		return priv->tx_irq;
+	}
+
+	priv->rx_irq = platform_get_irq_byname(pdev, "rx");
+	if (priv->rx_irq < 0) {
+		netdev_err(ndev, "could not find 'rx' irq");
+		return priv->rx_irq;
+	}
+
+	priv->coalesce_count_rx = XAXIDMA_DFT_RX_THRESHOLD;
+	priv->coalesce_count_tx = XAXIDMA_DFT_TX_THRESHOLD;
+
+	err = nixge_mdio_setup(priv, pdev->dev.of_node);
+	if (err) {
+		netdev_err(ndev, "error registering mdio bus");
+		goto free_netdev;
+	}
+
+	priv->phy_mode = of_get_phy_mode(pdev->dev.of_node);
+	if (priv->phy_mode < 0) {
+		netdev_err(ndev, "not find \"phy-mode\" property\n");
+		err = -EINVAL;
+		goto unregister_mdio;
+	}
+
+	priv->phy_node = of_parse_phandle(pdev->dev.of_node, "phy-handle", 0);
+	if (!priv->phy_node) {
+		netdev_err(ndev, "not find \"phy-handle\" property\n");
+		err = -EINVAL;
+		goto unregister_mdio;
+	}
+
+	err = register_netdev(priv->ndev);
+	if (err) {
+		netdev_err(ndev, "register_netdev() error (%i)\n", err);
+		goto unregister_mdio;
+	}
+
+	return 0;
+
+unregister_mdio:
+	mdiobus_unregister(priv->mii_bus);
+
+free_netdev:
+	free_netdev(ndev);
+
+	return err;
+}
+
+static int nixge_remove(struct platform_device *pdev)
+{
+	struct net_device *ndev = platform_get_drvdata(pdev);
+	struct nixge_priv *priv = netdev_priv(ndev);
+
+	unregister_netdev(ndev);
+
+	mdiobus_unregister(priv->mii_bus);
+
+	free_netdev(ndev);
+
+	return 0;
+}
+
+/* Match table for of_platform binding */
+static const struct of_device_id nixge_dt_ids[] = {
+	{ .compatible = "ni,xge-enet-2.00", },
+	{},
+};
+MODULE_DEVICE_TABLE(of, nixge_dt_ids);
+
+static struct platform_driver nixge_driver = {
+	.probe		= nixge_probe,
+	.remove		= nixge_remove,
+	.driver		= {
+		.name		= "nixge",
+		.of_match_table	= of_match_ptr(nixge_dt_ids),
+	},
+};
+module_platform_driver(nixge_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("National Instruments XGE Management MAC");
+MODULE_AUTHOR("Moritz Fischer <mdf@kernel.org>");
-- 
2.16.2

^ permalink raw reply related

* Re: syzbot rcu/debugobjects warning
From: Thomas Gleixner @ 2018-03-23 20:41 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Paul McKenney, LKML, Todd Poynor, netdev
In-Reply-To: <CAJWu+ooHzLpJZxsFq7hifmaEh7fOdYsExKz6pGLgSPQUxAB4Nw@mail.gmail.com>

On Fri, 23 Mar 2018, Joel Fernandes wrote:
> On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Thu, 22 Mar 2018, Joel Fernandes wrote:
> Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
> (The kernel logs are toward the end with the above).

And that is interesting:

[  150.629667]  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
[  150.637051]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
[  150.643431]  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8
[  150.648416]  [<ffffffff8142fa15>] ? percpu_up_read_preempt_enable.constprop.53+0xd7/0xd7
[  150.656611]  [<ffffffff81430835>] ? load_image_and_restore+0xf9/0xf9
[  150.663070]  [<ffffffff81269efd>] ? vprintk_default+0x1d/0x30
[  150.668925]  [<ffffffff81131879>] ? __warn+0x1a9/0x1e0
[  150.674170]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
[  150.680543]  [<ffffffff81131894>] __warn+0x1c4/0x1e0
[  150.685614]  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40
[  150.691972]  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40
[  150.698174]  [<ffffffff81dfea90>] ? debug_object_fixup+0x30/0x30
[  150.704283]  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
[  150.710917]  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0
[  150.716508]  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
[  150.723054]  [<ffffffff81290150>] ? rcu_pm_notify+0xc0/0xc0
[  150.728735]  [<ffffffff81292bc0>] ? __call_rcu.constprop.72+0x910/0x910
[  150.735459]  [<ffffffff81235221>] ? __lock_is_held+0xa1/0xf0
[  150.741223]  [<ffffffff81290287>] synchronize_rcu+0x27/0x90

So this calls synchronize_rcu from a rcu callback. That's a nono. This is
on the back of an interrupt in softirq context and __wait_rcu_gp() can
sleep, which is obviously a bad idea in softirq context....

Cc'ed netdev ....

And that also explains the debug object splat because this is not running
on the task stack. It's running on the softirq stack ....

[  150.746908]  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
[  150.753281]  [<ffffffff8358891f>] ? __l2tp_session_unhash+0x1bf/0x550
[  150.759828]  [<ffffffff8114596a>] ? __local_bh_enable_ip+0x6a/0xd0
[  150.766123]  [<ffffffff8358ddb0>] ? l2tp_udp_encap_recv+0xd90/0xd90
[  150.772497]  [<ffffffff83588e97>] l2tp_tunnel_closeall+0x1e7/0x3a0
[  150.778782]  [<ffffffff835897be>] l2tp_tunnel_destruct+0x30e/0x5a0
[  150.785067]  [<ffffffff8358965a>] ? l2tp_tunnel_destruct+0x1aa/0x5a0
[  150.791537]  [<ffffffff835894b0>] ? l2tp_tunnel_del_work+0x460/0x460
[  150.797997]  [<ffffffff82ee8053>] __sk_destruct+0x53/0x570
[  150.803588]  [<ffffffff81293918>] rcu_process_callbacks+0x898/0x1300
[  150.810048]  [<ffffffff812939f7>] ? rcu_process_callbacks+0x977/0x1300
[  150.816684]  [<ffffffff82ee8000>] ? __sk_dst_check+0x240/0x240
[  150.822625]  [<ffffffff838be5d6>] __do_softirq+0x206/0x951
[  150.828223]  [<ffffffff81147315>] irq_exit+0x165/0x190
[  150.833557]  [<ffffffff838bd1eb>] smp_apic_timer_interrupt+0x7b/0xa0
[  150.840018]  [<ffffffff838b9470>] apic_timer_interrupt+0xa0/0xb0
[  150.846132]  <EOI> [  150.848166]  [<ffffffff838b6756>] ? native_safe_halt+0x6/0x10
[  150.854036]  [<ffffffff8123bf2d>] ? trace_hardirqs_on+0xd/0x10
[  150.859973]  [<ffffffff838b5d85>] default_idle+0x55/0x360
[  150.865478]  [<ffffffff8106be0a>] arch_cpu_idle+0xa/0x10
[  150.870896]  [<ffffffff838b6b96>] default_idle_call+0x36/0x60
[  150.876751]  [<ffffffff81226cb0>] cpu_startup_entry+0x2b0/0x380
[  150.882787]  [<ffffffff81226a00>] ? cpu_in_idle+0x20/0x20
[  150.888291]  [<ffffffff812d2343>] ? clockevents_register_device+0x123/0x200
[  150.895358]  [<ffffffff810b0693>] start_secondary+0x303/0x3e0
[  150.901209]  [<ffffffff810b0390>] ? set_cpu_sibling_map+0x11f0/0x11f0

Thanks,

	tglx

^ permalink raw reply

* [PATCH v5 1/2] dt-bindings: net: Add bindings for National Instruments XGE netdev
From: Moritz Fischer @ 2018-03-23 20:41 UTC (permalink / raw)
  To: davem
  Cc: linux-kernel, devicetree, netdev, robh+dt, andrew, f.fainelli,
	Moritz Fischer

This adds bindings for the NI XGE 1G/10G network device.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Moritz Fischer <mdf@kernel.org>
---
Changes from v4:
- None

Changes from v3:
- Added Rob's Reviewed-by

Changes from v2:
- Addressed Rob's comments w.r.t to IRQ names and typo

Changes from v1:
- Corrected from nixge -> nixge.txt

---
 Documentation/devicetree/bindings/net/nixge.txt | 32 +++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/nixge.txt

diff --git a/Documentation/devicetree/bindings/net/nixge.txt b/Documentation/devicetree/bindings/net/nixge.txt
new file mode 100644
index 000000000000..e55af7f0881a
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/nixge.txt
@@ -0,0 +1,32 @@
+* NI XGE Ethernet controller
+
+Required properties:
+- compatible: Should be "ni,xge-enet-2.00"
+- reg: Address and length of the register set for the device
+- interrupts: Should contain tx and rx interrupt
+- interrupt-names: Should be "rx" and "tx"
+- phy-mode: See ethernet.txt file in the same directory.
+- phy-handle: See ethernet.txt file in the same directory.
+- nvmem-cells: Phandle of nvmem cell containing the MAC address
+- nvmem-cell-names: Should be "address"
+
+Examples (10G generic PHY):
+	nixge0: ethernet@40000000 {
+		compatible = "ni,xge-enet-2.00";
+		reg = <0x40000000 0x6000>;
+
+		nvmem-cells = <&eth1_addr>;
+		nvmem-cell-names = "address";
+
+		interrupts = <0 29 IRQ_TYPE_LEVEL_HIGH>, <0 30 IRQ_TYPE_LEVEL_HIGH>;
+		interrupt-names = "rx", "tx";
+		interrupt-parent = <&intc>;
+
+		phy-mode = "xgmii";
+		phy-handle = <&ethernet_phy1>;
+
+		ethernet_phy1: ethernet-phy@4 {
+			compatible = "ethernet-phy-ieee802.3-c45";
+			reg = <4>;
+		};
+	};
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH net-next 1/8] net: phy: Add initial support for Microsemi Ocelot internal PHYs.
From: Andrew Lunn @ 2018-03-23 20:29 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Florian Fainelli, netdev, devicetree,
	linux-kernel, linux-mips, Raju Lakkaraju
In-Reply-To: <20180323201117.8416-2-alexandre.belloni@bootlin.com>

On Fri, Mar 23, 2018 at 09:11:10PM +0100, Alexandre Belloni wrote:
> Add Microsemi Ocelot internal PHY ids. For now, simply use the genphy
> functions but more features are available.
> 
> Cc: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> ---
>  drivers/net/phy/mscc.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
> index 650c2667d523..e1ab3acd1cdb 100644
> --- a/drivers/net/phy/mscc.c
> +++ b/drivers/net/phy/mscc.c
> @@ -91,6 +91,7 @@ enum rgmii_rx_clock_delay {
>  #define SECURE_ON_PASSWD_LEN_4		  0x4000
>  
>  /* Microsemi PHY ID's */
> +#define PHY_ID_OCELOT			  0x00070540
>  #define PHY_ID_VSC8530			  0x00070560
>  #define PHY_ID_VSC8531			  0x00070570
>  #define PHY_ID_VSC8540			  0x00070760
> @@ -658,6 +659,19 @@ static int vsc85xx_probe(struct phy_device *phydev)
>  
>  /* Microsemi VSC85xx PHYs */
>  static struct phy_driver vsc85xx_driver[] = {
> +{
> +	.phy_id		= PHY_ID_OCELOT,
> +	.name		= "Microsemi OCELOT",
> +	.phy_id_mask    = 0xfffffff0,
> +	.features	= PHY_GBIT_FEATURES,

No interrupt support? In fact, the only advantage i see this brings
over the generic driver is that the name Microsemi OCELOT is printed.

     Andrew

> +	.soft_reset	= &genphy_soft_reset,
> +	.config_init	= &genphy_config_init,
> +	.config_aneg	= &genphy_config_aneg,
> +	.aneg_done	= &genphy_aneg_done,
> +	.read_status	= &genphy_read_status,
> +	.suspend	= &genphy_suspend,
> +	.resume		= &genphy_resume,
> +},

^ permalink raw reply

* Re: [PATCH V3 net-next 06/14] net/tls: Add generic NIC offload infrastructure
From: Shannon Nelson @ 2018-03-23 20:21 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <20180322223351.31801-7-saeedm@mellanox.com>

On 3/22/2018 3:33 PM, Saeed Mahameed wrote:
> From: Ilya Lesokhin <ilyal@mellanox.com>
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption

s/devices/device/

> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.

s/transmit/transmits/

> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>   include/net/tls.h             |  73 +++-
>   net/tls/Kconfig               |  10 +
>   net/tls/Makefile              |   2 +
>   net/tls/tls_device.c          | 756 ++++++++++++++++++++++++++++++++++++++++++
>   net/tls/tls_device_fallback.c | 412 +++++++++++++++++++++++
>   net/tls/tls_main.c            |  33 +-
>   6 files changed, 1279 insertions(+), 7 deletions(-)
>   create mode 100644 net/tls/tls_device.c
>   create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..4f6a6f98d62b 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>   	struct scatterlist sg_aead_out[2];
>   };
>   
> +struct tls_record_info {
> +	struct list_head list;
> +	u32 end_seq;
> +	int len;
> +	int num_frags;
> +	skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> +	struct crypto_aead *aead_send;
> +	spinlock_t lock;	/* protects records list */
> +	struct list_head records_list;
> +	struct tls_record_info *open_record;
> +	struct tls_record_info *retransmit_hint;
> +	u64 hint_record_sn;
> +	u64 unacked_record_sn;
> +
> +	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> +	void (*sk_destruct)(struct sock *sk);
> +	u8 driver_state[];
> +	/* The TLS layer reserves room for driver specific state
> +	 * Currently the belief is that there is not enough
> +	 * driver specific state to justify another layer of indirection
> +	 */
> +#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
> +};
> +
> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
> +	 TLS_DRIVER_STATE_SIZE)
> +
>   enum {
>   	TLS_PENDING_CLOSED_RECORD
>   };
> @@ -87,6 +118,10 @@ struct tls_context {
>   		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
>   	};
>   
> +	struct list_head list;
> +	struct net_device *netdev;
> +	refcount_t refcount;
> +
>   	void *priv_ctx;
>   
>   	u8 tx_conf:2;
> @@ -131,9 +166,28 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
>   void tls_sw_close(struct sock *sk, long timeout);
>   void tls_sw_free_tx_resources(struct sock *sk);
>   
> -void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
> -void tls_icsk_clean_acked(struct sock *sk);
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags);
> +void tls_device_sk_destruct(struct sock *sk);
> +void tls_device_init(void);
> +void tls_device_cleanup(void);
> +
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn);
> +
> +static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
> +{
> +	return rec->len == 0;
> +}
> +
> +static inline u32 tls_record_start_seq(struct tls_record_info *rec)
> +{
> +	return rec->end_seq - rec->len;
> +}
>   
> +void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
>   int tls_push_sg(struct sock *sk, struct tls_context *ctx,
>   		struct scatterlist *sg, u16 first_offset,
>   		int flags);
> @@ -170,6 +224,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
>   	return tls_ctx->pending_open_record_frags;
>   }
>   
> +static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
> +{
> +	return sk_fullsock(sk) &&
> +	       /* matches smp_store_release in tls_set_device_offload */
> +	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
> +}
> +
>   static inline void tls_err_abort(struct sock *sk)
>   {
>   	sk->sk_err = EBADMSG;
> @@ -257,4 +318,12 @@ static inline struct tls_offload_context *tls_offload_ctx(
>   int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
>   		      unsigned char *record_type);
>   
> +struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> +				      struct net_device *dev,
> +				      struct sk_buff *skb);
> +
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info);
> +
>   #endif /* _TLS_OFFLOAD_H */
> diff --git a/net/tls/Kconfig b/net/tls/Kconfig
> index eb583038c67e..9d3ef820bb16 100644
> --- a/net/tls/Kconfig
> +++ b/net/tls/Kconfig
> @@ -13,3 +13,13 @@ config TLS
>   	encryption handling of the TLS protocol to be done in-kernel.
>   
>   	If unsure, say N.
> +
> +config TLS_DEVICE
> +	bool "Transport Layer Security HW offload"
> +	depends on TLS
> +	select SOCK_VALIDATE_XMIT
> +	default n
> +	---help---
> +	Enable kernel support for HW offload of the TLS protocol.
> +
> +	If unsure, say N.
> diff --git a/net/tls/Makefile b/net/tls/Makefile
> index a930fd1c4f7b..4d6b728a67d0 100644
> --- a/net/tls/Makefile
> +++ b/net/tls/Makefile
> @@ -5,3 +5,5 @@
>   obj-$(CONFIG_TLS) += tls.o
>   
>   tls-y := tls_main.o tls_sw.o
> +
> +tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> new file mode 100644
> index 000000000000..34555ac0b959
> --- /dev/null
> +++ b/net/tls/tls_device.c
> @@ -0,0 +1,756 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.

Maybe add the appropriate SPDX tag to the top of this new file?

> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/module.h>
> +#include <net/tcp.h>
> +#include <net/inet_common.h>
> +#include <linux/highmem.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +
> +/* device_offload_lock is used to synchronize tls_dev_add
> + * against NETDEV_DOWN notifications.
> + */
> +static DECLARE_RWSEM(device_offload_lock);
> +
> +static void tls_device_gc_task(struct work_struct *work);
> +
> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> +static LIST_HEAD(tls_device_gc_list);
> +static LIST_HEAD(tls_device_list);
> +static DEFINE_SPINLOCK(tls_device_lock);
> +
> +static void tls_device_free_ctx(struct tls_context *ctx)
> +{
> +	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
> +
> +	kfree(offlad_ctx);

Don't misspell a variable name, please either use something like 
offload_ctx or shortened to olc.

> +	kfree(ctx);
> +}
> +
> +static void tls_device_gc_task(struct work_struct *work)
> +{
> +	struct tls_context *ctx, *tmp;
> +	unsigned long flags;
> +	LIST_HEAD(gc_list);
> +
> +

Drop the extra blank line

> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	list_splice_init(&tls_device_gc_list, &gc_list);
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
> +		struct net_device *netdev = ctx->netdev;
> +
> +		if (netdev) {
> +			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +							TLS_OFFLOAD_CTX_DIR_TX);

Perhaps it will be clear in later code, but are you guaranteed there are 
good ops and function pointers here, or should there be a check like in 
many API calls like this?  Maybe
		if (netdev) {
			if (netdev->tlsdev_ops &&
			    netdev->tlsdev_ops->tls_dev_del)


> +			dev_put(netdev);
> +		}
> +
> +		list_del(&ctx->list);
> +		tls_device_free_ctx(ctx);
> +	}
> +}
> +
> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	list_move_tail(&ctx->list, &tls_device_gc_list);
> +
> +	/* schedule_work inside the spinlock
> +	 * to make sure tls_device_down waits for that work.
> +	 */
> +	schedule_work(&tls_device_gc_work);
> +
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +}
> +
> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;

This initialization is unnecessary;

> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}
> +
> +static void destroy_record(struct tls_record_info *record)
> +{
> +	int nr_frags = record->num_frags;
> +	skb_frag_t *frag;
> +
> +	while (nr_frags-- > 0) {
> +		frag = &record->frags[nr_frags];
> +		__skb_frag_unref(frag);
> +	}
> +	kfree(record);
> +}
> +
> +static void delete_all_records(struct tls_offload_context *offload_ctx)
> +{
> +	struct tls_record_info *info, *temp;
> +
> +	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
> +		list_del(&info->list);
> +		destroy_record(info);
> +	}
> +
> +	offload_ctx->retransmit_hint = NULL;
> +}
> +
> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_record_info *info, *temp;
> +	struct tls_offload_context *ctx;
> +	u64 deleted_records = 0;
> +	unsigned long flags;
> +
> +	if (!tls_ctx)
> +		return;
> +
> +	ctx = tls_offload_ctx(tls_ctx);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	info = ctx->retransmit_hint;
> +	if (info && !before(acked_seq, info->end_seq)) {
> +		ctx->retransmit_hint = NULL;
> +		list_del(&info->list);
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
> +		if (before(acked_seq, info->end_seq))
> +			break;
> +		list_del(&info->list);
> +
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	ctx->unacked_record_sn += deleted_records;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +}
> +
> +/* At this point, there should be no references on this
> + * socket and no in-flight SKBs associated with this
> + * socket, so it is safe to free all the resources.
> + */
> +void tls_device_sk_destruct(struct sock *sk)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +
> +	if (ctx->open_record)
> +		destroy_record(ctx->open_record);
> +
> +	delete_all_records(ctx);
> +	crypto_free_aead(ctx->aead_send);
> +	ctx->sk_destruct(sk);
> +	static_branch_dec(&clean_acked_data_enabled);
> +
> +	if (refcount_dec_and_test(&tls_ctx->refcount))
> +		tls_device_queue_ctx_destruction(tls_ctx);
> +}
> +EXPORT_SYMBOL(tls_device_sk_destruct);
> +
> +static inline void tls_append_frag(struct tls_record_info *record,

I think Dave has already mentioned this, but you can drop all the 
"inline" tags.

> +				   struct page_frag *pfrag,
> +				   int size)
> +{
> +	skb_frag_t *frag;
> +
> +	frag = &record->frags[record->num_frags - 1];
> +	if (frag->page.p == pfrag->page &&
> +	    frag->page_offset + frag->size == pfrag->offset) {
> +		frag->size += size;
> +	} else {
> +		++frag;

Should this get checked against MAX_SKB_FRAGS to be sure we haven't gone 
off the end of the array?

> +		frag->page.p = pfrag->page;
> +		frag->page_offset = pfrag->offset;
> +		frag->size = size;
> +		++record->num_frags;
> +		get_page(pfrag->page);
> +	}
> +
> +	pfrag->offset += size;
> +	record->len += size;
> +}
> +
> +static inline int tls_push_record(struct sock *sk,
> +				  struct tls_context *ctx,
> +				  struct tls_offload_context *offload_ctx,
> +				  struct tls_record_info *record,
> +				  struct page_frag *pfrag,
> +				  int flags,
> +				  unsigned char record_type)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct page_frag dummy_tag_frag;
> +	skb_frag_t *frag;
> +	int i;
> +
> +	/* fill prepand */

s/prepand/prepend/

> +	frag = &record->frags[0];
> +	tls_fill_prepend(ctx,
> +			 skb_frag_address(frag),
> +			 record->len - ctx->prepend_size,
> +			 record_type);
> +
> +	/* HW doesn't care about the data in the tag, because it fills it. */
> +	dummy_tag_frag.page = skb_frag_page(frag);
> +	dummy_tag_frag.offset = 0;
> +
> +	tls_append_frag(record, &dummy_tag_frag, ctx->tag_size);
> +	record->end_seq = tp->write_seq + record->len;
> +	spin_lock_irq(&offload_ctx->lock);
> +	list_add_tail(&record->list, &offload_ctx->records_list);
> +	spin_unlock_irq(&offload_ctx->lock);
> +	offload_ctx->open_record = NULL;
> +	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
> +	tls_advance_record_sn(sk, ctx);
> +
> +	for (i = 0; i < record->num_frags; i++) {
> +		frag = &record->frags[i];
> +		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
> +		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
> +			    frag->size, frag->page_offset);
> +		sk_mem_charge(sk, frag->size);
> +		get_page(skb_frag_page(frag));
> +	}
> +	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
> +
> +	/* all ready, send */
> +	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
> +}
> +
> +static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
> +					struct page_frag *pfrag,
> +					size_t prepend_size)
> +{
> +	struct tls_record_info *record;
> +	skb_frag_t *frag;
> +
> +	record = kmalloc(sizeof(*record), GFP_KERNEL);
> +	if (!record)
> +		return -ENOMEM;
> +
> +	frag = &record->frags[0];
> +	__skb_frag_set_page(frag, pfrag->page);
> +	frag->page_offset = pfrag->offset;
> +	skb_frag_size_set(frag, prepend_size);
> +
> +	get_page(pfrag->page);
> +	pfrag->offset += prepend_size;
> +
> +	record->num_frags = 1;
> +	record->len = prepend_size;
> +	offload_ctx->open_record = record;
> +	return 0;
> +}
> +
> +static inline int tls_do_allocation(struct sock *sk,
> +				    struct tls_offload_context *offload_ctx,
> +				    struct page_frag *pfrag,
> +				    size_t prepend_size)
> +{
> +	int ret;
> +
> +	if (!offload_ctx->open_record) {
> +		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
> +						   sk->sk_allocation))) {
> +			sk->sk_prot->enter_memory_pressure(sk);
> +			sk_stream_moderate_sndbuf(sk);
> +			return -ENOMEM;
> +		}
> +
> +		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
> +		if (ret)
> +			return ret;
> +
> +		if (pfrag->size > pfrag->offset)
> +			return 0;
> +	}
> +
> +	if (!sk_page_frag_refill(sk, pfrag))
> +		return -ENOMEM;

If a new record was created and then this fails, do you need to free the 
new record?

> +
> +	return 0;
> +}
> +
> +static int tls_push_data(struct sock *sk,
> +			 struct iov_iter *msg_iter,
> +			 size_t size, int flags,
> +			 unsigned char record_type)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> +	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> +	struct tls_record_info *record = ctx->open_record;
> +	struct page_frag *pfrag;
> +	size_t orig_size = size;
> +	u32 max_open_record_len;
> +	int copy, rc = 0;
> +	bool done = false;
> +	long timeo;
> +
> +	if (flags &
> +	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
> +		return -ENOTSUPP;
> +
> +	if (sk->sk_err)
> +		return -sk->sk_err;
> +
> +	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> +	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
> +	if (rc < 0)
> +		return rc;
> +
> +	pfrag = sk_page_frag(sk);
> +
> +	/* TLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and

s/TLS_TLS_HEADER_SIZE/TLS_HEADER_SIZE/

> +	 * we need to leave room for an authentication tag.
> +	 */
> +	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
> +			      tls_ctx->prepend_size;
> +	do {
> +		if (tls_do_allocation(sk, ctx, pfrag,
> +				      tls_ctx->prepend_size)) {

So you do this block if tls_do_allocation() fails, right?  This is not 
clear to the drive-by reader, it looks a bit like the opposite.  I'd 
suggest something a little more obvious like

		rc = tls_do_allocation(sk, ctx, pfrag,
				       tls_ctx->prepend_size)
		if (rc) {


> +			rc = sk_stream_wait_memory(sk, &timeo);
> +			if (!rc)
> +				continue;
> +
> +			record = ctx->open_record;
> +			if (!record)
> +				break;
> +handle_error:
> +			if (record_type != TLS_RECORD_TYPE_DATA) {
> +				/* avoid sending partial
> +				 * record with type !=
> +				 * application_data
> +				 */
> +				size = orig_size;
> +				destroy_record(record);
> +				ctx->open_record = NULL;
> +			} else if (record->len > tls_ctx->prepend_size) {
> +				goto last_record;
> +			}
> +
> +			break;
> +		}
> +
> +		record = ctx->open_record;
> +		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
> +		copy = min_t(size_t, copy, (max_open_record_len - record->len));
> +
> +		if (copy_from_iter_nocache(page_address(pfrag->page) +
> +					       pfrag->offset,
> +					   copy, msg_iter) != copy) {
> +			rc = -EFAULT;
> +			goto handle_error;

This jumping around begins to feel a bit convoluted - is there another 
way you can handle this?

> +		}
> +		tls_append_frag(record, pfrag, copy);
> +
> +		size -= copy;
> +		if (!size) {
> +last_record:
> +			tls_push_record_flags = flags;
> +			if (more) {
> +				tls_ctx->pending_open_record_frags =
> +						record->num_frags;
> +				break;
> +			}
> +
> +			done = true;
> +		}
> +
> +		if ((done) || record->len >= max_open_record_len ||

parens around (done) are unnecessary

> +		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
> +			rc = tls_push_record(sk,
> +					     tls_ctx,
> +					     ctx,
> +					     record,
> +					     pfrag,
> +					     tls_push_record_flags,
> +					     record_type);
> +			if (rc < 0)
> +				break;
> +		}
> +	} while (!done);
> +
> +	if (orig_size - size > 0)
> +		rc = orig_size - size;

If there was an error returned from tls_push_record(), will this 
overwrite the error rc code?

> +
> +	return rc;
> +}
> +
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	unsigned char record_type = TLS_RECORD_TYPE_DATA;
> +	int rc = 0;

rc initialization unnecessary

> +
> +	lock_sock(sk);
> +
> +	if (unlikely(msg->msg_controllen)) {
> +		rc = tls_proccess_cmsg(sk, msg, &record_type);
> +		if (rc)
> +			goto out;
> +	}
> +
> +	rc = tls_push_data(sk, &msg->msg_iter, size,
> +			   msg->msg_flags, record_type);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +	char *kaddr = kmap(page);
> +	struct kvec iov;
> +	int rc = 0;

rc initialization unnecessary

> +
> +	if (flags & MSG_SENDPAGE_NOTLAST)
> +		flags |= MSG_MORE;
> +
> +	lock_sock(sk);
> +
> +	if (flags & MSG_OOB) {
> +		rc = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kaddr + offset;
> +	iov.iov_len = size;
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
> +	rc = tls_push_data(sk, &msg_iter, size,
> +			   flags, TLS_RECORD_TYPE_DATA);
> +	kunmap(page);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn)
> +{
> +	u64 record_sn = context->hint_record_sn;
> +	struct tls_record_info *info;
> +
> +	info = context->retransmit_hint;
> +	if (!info ||
> +	    before(seq, info->end_seq - info->len)) {
> +		/* if retransmit_hint is irrelevant start
> +		 * from the begging of the list

s/begging/beginning/

> +		 */
> +		info = list_first_entry(&context->records_list,
> +					struct tls_record_info, list);
> +		record_sn = context->unacked_record_sn;
> +	}
> +
> +	list_for_each_entry_from(info, &context->records_list, list) {
> +		if (before(seq, info->end_seq)) {
> +			if (!context->retransmit_hint ||
> +			    after(info->end_seq,
> +				  context->retransmit_hint->end_seq)) {
> +				context->hint_record_sn = record_sn;
> +				context->retransmit_hint = info;
> +			}
> +			*p_record_sn = record_sn;
> +			return info;
> +		}
> +		record_sn++;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(tls_get_record);
> +
> +static int tls_device_push_pending_record(struct sock *sk, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
> +	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
> +}
> +
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
> +{
> +	u16 nonece_size, tag_size, iv_size, rec_seq_size;

s/nonece/nonce/

> +	struct tls_record_info *start_marker_record;
> +	struct tls_offload_context *offload_ctx;
> +	struct tls_crypto_info *crypto_info;
> +	struct net_device *netdev;
> +	char *iv, *rec_seq;
> +	struct sk_buff *skb;
> +	int rc = -EINVAL;
> +	__be64 rcd_sn;
> +
> +	if (!ctx)
> +		goto out;
> +
> +	if (ctx->priv_ctx) {
> +		rc = -EEXIST;
> +		goto out;
> +	}
> +
> +	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
> +	if (!start_marker_record) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
> +	if (!offload_ctx) {
> +		rc = -ENOMEM;
> +		goto free_marker_record;
> +	}
> +
> +	crypto_info = &ctx->crypto_send;
> +	switch (crypto_info->cipher_type) {
> +	case TLS_CIPHER_AES_GCM_128: {
> +		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
> +		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
> +		rec_seq =
> +		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
> +		break;
> +	}

{}'s are unnecessary here

> +	default:
> +		rc = -EINVAL;
> +		goto free_offload_ctx;
> +	}
> +
> +	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
> +	ctx->tag_size = tag_size;
> +	ctx->iv_size = iv_size;
> +	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
> +			  GFP_KERNEL);
> +	if (!ctx->iv) {
> +		rc = -ENOMEM;
> +		goto free_offload_ctx;
> +	}
> +
> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
> +
> +	ctx->rec_seq_size = rec_seq_size;
> +	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
> +	if (!ctx->rec_seq) {
> +		rc = -ENOMEM;
> +		goto free_iv;
> +	}
> +	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
> +
> +	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
> +	if (rc)
> +		goto free_rec_seq;
> +
> +	/* start at rec_seq - 1 to account for the start marker record */
> +	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
> +	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> +
> +	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
> +	start_marker_record->len = 0;
> +	start_marker_record->num_frags = 0;
> +
> +	INIT_LIST_HEAD(&offload_ctx->records_list);
> +	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
> +	spin_lock_init(&offload_ctx->lock);
> +
> +	static_branch_inc(&clean_acked_data_enabled);
> +	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
> +	ctx->push_pending_record = tls_device_push_pending_record;
> +	offload_ctx->sk_destruct = sk->sk_destruct;
> +
> +	/* TLS offload is greatly simplified if we don't send
> +	 * SKBs where only part of the payload needs to be encrypted.
> +	 * So mark the last skb in the write queue as end of record.
> +	 */
> +	skb = tcp_write_queue_tail(sk);
> +	if (skb)
> +		TCP_SKB_CB(skb)->eor = 1;
> +
> +	refcount_set(&ctx->refcount, 1);
> +
> +	/* We support starting offload on multiple sockets
> +	 * concurrently, so we only need a read lock here.
> +	 * This lock must preceed get_netdev_for_sock to prevent races between
> +	 * NETDEV_DOWN and setsockopt.
> +	 */
> +	down_read(&device_offload_lock);
> +	netdev = get_netdev_for_sock(sk);
> +	if (!netdev) {
> +		pr_err_ratelimited("%s: netdev not found\n", __func__);
> +		rc = -EINVAL;
> +		goto release_lock;
> +	}
> +
> +	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
> +		rc = -ENOTSUPP;
> +		goto release_netdev;
> +	}
> +
> +	/* Avoid offloading if the device is down
> +	 * We don't want to offload new flows after
> +	 * the NETDEV_DOWN event
> +	 */
> +	if (!(netdev->flags & IFF_UP)) {
> +		rc = -EINVAL;
> +		goto release_netdev;
> +	}
> +
> +	ctx->priv_ctx = offload_ctx;
> +	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,

Do you have a check somewhere that guarantees any netdev with 
NETIF_F_HW_TLS_TX set actually has the tlsdev_ops defined so you can 
call this without checking it?

> +					     &ctx->crypto_send,
> +					     tcp_sk(sk)->write_seq);
> +	if (rc)
> +		goto release_netdev;
> +
> +	ctx->netdev = netdev;
> +
> +	spin_lock_irq(&tls_device_lock);
> +	list_add_tail(&ctx->list, &tls_device_list);
> +	spin_unlock_irq(&tls_device_lock);
> +
> +	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
> +	/* following this assignment tls_is_sk_tx_device_offloaded
> +	 * will return true and the context might be accessed
> +	 * by the netdev's xmit function.
> +	 */
> +	smp_store_release(&sk->sk_destruct,
> +			  &tls_device_sk_destruct);
> +	up_read(&device_offload_lock);
> +	goto out;
> +
> +release_netdev:
> +	dev_put(netdev);
> +release_lock:
> +	up_read(&device_offload_lock);
> +	static_branch_dec(&clean_acked_data_enabled);
> +	crypto_free_aead(offload_ctx->aead_send);
> +free_rec_seq:
> +	kfree(ctx->rec_seq);
> +free_iv:
> +	kfree(ctx->iv);
> +free_offload_ctx:
> +	kfree(offload_ctx);
> +	ctx->priv_ctx = NULL;
> +free_marker_record:
> +	kfree(start_marker_record);
> +out:
> +	return rc;
> +}
> +
> +static int tls_device_down(struct net_device *netdev)
> +{
> +	struct tls_context *ctx, *tmp;
> +	unsigned long flags;
> +	LIST_HEAD(list);
> +
> +	/* Request a write lock to block new offload attempts
> +	 */

single line comment

> +	down_write(&device_offload_lock);
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
> +		if (ctx->netdev != netdev ||
> +		    !refcount_inc_not_zero(&ctx->refcount))
> +			continue;
> +
> +		list_move(&ctx->list, &list);
> +	}
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &list, list)	{
> +		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +						TLS_OFFLOAD_CTX_DIR_TX);

Are tlsdev_ops and tls_dev_del defined?

> +		ctx->netdev = NULL;
> +		dev_put(netdev);
> +		list_del_init(&ctx->list);
> +
> +		if (refcount_dec_and_test(&ctx->refcount))
> +			tls_device_free_ctx(ctx);
> +	}
> +
> +	up_write(&device_offload_lock);
> +
> +	flush_work(&tls_device_gc_work);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int tls_dev_event(struct notifier_block *this, unsigned long event,
> +			 void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +
> +	if (!(dev->features & NETIF_F_HW_TLS_TX))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +	case NETDEV_FEAT_CHANGE:
> +		return dev->tlsdev_ops ? NOTIFY_DONE : NOTIFY_BAD;

Okay, you've got a check for tlsdev_ops, but what about the function 
pointers that are assumed to be good?

> +	case NETDEV_DOWN:
> +		return tls_device_down(dev);
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block tls_dev_notifier = {
> +	.notifier_call	= tls_dev_event,
> +};
> +
> +void __init tls_device_init(void)
> +{
> +	register_netdevice_notifier(&tls_dev_notifier);
> +}
> +
> +void __exit tls_device_cleanup(void)
> +{
> +	unregister_netdevice_notifier(&tls_dev_notifier);
> +	flush_work(&tls_device_gc_work);
> +}
> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> new file mode 100644
> index 000000000000..f1302f479209
> --- /dev/null
> +++ b/net/tls/tls_device_fallback.c
> @@ -0,0 +1,412 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses.  You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +#include <crypto/scatterwalk.h>
> +#include <net/ip6_checksum.h>
> +
> +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
> +{
> +	struct scatterlist *src = walk->sg;
> +	int diff = walk->offset - src->offset;
> +
> +	sg_set_page(sg, sg_page(src),
> +		    src->length - diff, walk->offset);
> +
> +	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
> +}
> +
> +static int tls_enc_record(struct aead_request *aead_req,
> +			  struct crypto_aead *aead, char *aad, char *iv,
> +			  __be64 rcd_sn, struct scatter_walk *in,
> +			  struct scatter_walk *out, int *in_len)
> +{
> +	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
> +	struct scatterlist sg_in[3];
> +	struct scatterlist sg_out[3];
> +	u16 len;
> +	int rc;
> +
> +	len = min_t(int, *in_len, ARRAY_SIZE(buf));
> +
> +	scatterwalk_copychunks(buf, in, len, 0);
> +	scatterwalk_copychunks(buf, out, len, 1);
> +
> +	*in_len -= len;
> +	if (!*in_len)
> +		return 0;
> +
> +	scatterwalk_pagedone(in, 0, 1);
> +	scatterwalk_pagedone(out, 1, 1);
> +
> +	len = buf[4] | (buf[3] << 8);
> +	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +
> +	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
> +		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
> +
> +	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
> +	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
> +
> +	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
> +	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
> +	chain_to_walk(sg_in + 1, in);
> +	chain_to_walk(sg_out + 1, out);
> +
> +	*in_len -= len;
> +	if (*in_len < 0) {
> +		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		if (*in_len < 0)
> +		/* the input buffer doesn't contain the entire record.

s/./, so/

> +		 * trim len accordingly. The resulting authentication tag
> +		 * will contain garbage. but we don't care as we won't

s/garbage./garbage,/

> +		 * include any of it in the output skb
> +		 * Note that we assume the output buffer length
> +		 * is larger then input buffer length + tag size
> +		 */
> +			len += *in_len;

Especially with that large of a comment, I think the if (*in_len < 0) 
should be after and right next to the one line it protects.

> +
> +		*in_len = 0;
> +	}
> +
> +	if (*in_len) {
> +		scatterwalk_copychunks(NULL, in, len, 2);
> +		scatterwalk_pagedone(in, 0, 1);
> +		scatterwalk_copychunks(NULL, out, len, 2);
> +		scatterwalk_pagedone(out, 1, 1);
> +	}
> +
> +	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
> +
> +	rc = crypto_aead_encrypt(aead_req);
> +
> +	return rc;
> +}
> +
> +static void tls_init_aead_request(struct aead_request *aead_req,
> +				  struct crypto_aead *aead)
> +{
> +	aead_request_set_tfm(aead_req, aead);
> +	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
> +}
> +
> +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
> +						   gfp_t flags)
> +{
> +	unsigned int req_size = sizeof(struct aead_request) +
> +		crypto_aead_reqsize(aead);
> +	struct aead_request *aead_req;
> +
> +	aead_req = kzalloc(req_size, flags);
> +	if (!aead_req)
> +		return NULL;
> +
> +	tls_init_aead_request(aead_req, aead);
> +	return aead_req;

This could be turned around and shortened a little

	aead_req = kzalloc(req_size, flags);
	if (aead_req)
		tls_init_aead_request(aead_req, aead);
	return aead_req;


> +}
> +
> +static int tls_enc_records(struct aead_request *aead_req,
> +			   struct crypto_aead *aead, struct scatterlist *sg_in,
> +			   struct scatterlist *sg_out, char *aad, char *iv,
> +			   u64 rcd_sn, int len)
> +{
> +	struct scatter_walk out, in;
> +	int rc;
> +
> +	scatterwalk_start(&in, sg_in);
> +	scatterwalk_start(&out, sg_out);
> +
> +	do {
> +		rc = tls_enc_record(aead_req, aead, aad, iv,
> +				    cpu_to_be64(rcd_sn), &in, &out, &len);
> +		rcd_sn++;
> +
> +	} while (rc == 0 && len);
> +
> +	scatterwalk_done(&in, 0, 0);
> +	scatterwalk_done(&out, 1, 0);
> +
> +	return rc;
> +}
> +
> +/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
> + * might have been changed by NAT.
> + */
> +static inline void update_chksum(struct sk_buff *skb, int headln)
> +{
> +	struct tcphdr *th = tcp_hdr(skb);
> +	int datalen = skb->len - headln;
> +	const struct ipv6hdr *ipv6h;
> +	const struct iphdr *iph;
> +
> +	/* We only changed the payload so if we are using partial we don't
> +	 * need to update anything.
> +	 */
> +	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
> +		return;
> +
> +	skb->ip_summed = CHECKSUM_PARTIAL;
> +	skb->csum_start = skb_transport_header(skb) - skb->head;
> +	skb->csum_offset = offsetof(struct tcphdr, check);
> +
> +	if (skb->sk->sk_family == AF_INET6) {
> +		ipv6h = ipv6_hdr(skb);
> +		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
> +					     datalen, IPPROTO_TCP, 0);
> +	} else {
> +		iph = ip_hdr(skb);
> +		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
> +					       IPPROTO_TCP, 0);
> +	}
> +}
> +
> +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
> +{
> +	skb_copy_header(nskb, skb);
> +
> +	skb_put(nskb, skb->len);
> +	memcpy(nskb->data, skb->data, headln);
> +	update_chksum(nskb, headln);
> +
> +	nskb->destructor = skb->destructor;
> +	nskb->sk = skb->sk;
> +	skb->destructor = NULL;
> +	skb->sk = NULL;
> +	refcount_add(nskb->truesize - skb->truesize,
> +		     &nskb->sk->sk_wmem_alloc);
> +}
> +
> +/* This function may be called after the user socket is already
> + * closed so make sure we don't use anything freed during
> + * tls_sk_proto_close here
> + */
> +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
> +{
> +	int tcp_header_size = tcp_hdrlen(skb);
> +	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
> +	int payload_len = skb->len - tcp_payload_offset;
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	int remaining, buf_len, resync_sgs, rc, i = 0;
> +	void *buf, *dummy_buf, *iv, *aad;
> +	struct scatterlist *sg_in, sg_out[3];
> +	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> +	struct aead_request *aead_req;
> +	struct sk_buff *nskb = NULL;
> +	struct tls_record_info *record;
> +	unsigned long flags;
> +	s32 sync_size;
> +	u64 rcd_sn;
> +
> +	/* worst case is:
> +	 * MAX_SKB_FRAGS in tls_record_info
> +	 * MAX_SKB_FRAGS + 1 in SKB head and frags.
> +	 */
> +	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
> +
> +	if (!payload_len)
> +		return skb;
> +
> +	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
> +	if (!sg_in)
> +		goto free_orig;
> +
> +	sg_init_table(sg_in, sg_in_max_elements);
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
> +	if (!record) {
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		WARN(1, "Record not found for seq %u\n", tcp_seq);
> +		goto free_sg;
> +	}
> +
> +	sync_size = tcp_seq - tls_record_start_seq(record);
> +	if (sync_size < 0) {
> +		int is_start_marker = tls_record_is_start_marker(record);
> +
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		if (!is_start_marker)
> +		/* This should only occur if the relevant record was
> +		 * already acked. In that case it should be ok
> +		 * to drop the packet and avoid retransmission.
> +		 *
> +		 * There is a corner case where the packet contains
> +		 * both an acked and a non-acked record.
> +		 * We currently don't handle that case and rely
> +		 * on TCP to retranmit a packet that doesn't contain
> +		 * already acked payload.
> +		 */
> +			goto free_orig;

Again, let's keep the "if ..." closer to the one line being protected.

> +
> +		if (payload_len > -sync_size) {
> +			WARN(1, "Fallback of partially offloaded packets is not supported\n");
> +			goto free_sg;
> +		} else {
> +			return skb;
> +		}
> +	}
> +
> +	remaining = sync_size;

It would be a bit clearer, and more future safe, to set i=0 here rather 
than rely on the initialization way back at the top of the function.  In 
fact, why not use a normal for-loop?
	for (i = 0; remaining > 0; i++)

> +	while (remaining > 0) {
> +		skb_frag_t *frag = &record->frags[i];
> +
> +		__skb_frag_ref(frag);
> +		sg_set_page(sg_in + i, skb_frag_page(frag),
> +			    skb_frag_size(frag), frag->page_offset);
> +
> +		remaining -= skb_frag_size(frag);
> +
> +		if (remaining < 0)
> +			sg_in[i].length += remaining;
> +
> +		i++;
> +	}
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +	resync_sgs = i;
> +
> +	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
> +	if (!aead_req)
> +		goto put_sg;
> +
> +	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
> +		  TLS_AAD_SPACE_SIZE +
> +		  sync_size +
> +		  tls_ctx->tag_size;
> +	buf = kmalloc(buf_len, GFP_ATOMIC);
> +	if (!buf)
> +		goto free_req;
> +
> +	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
> +	if (!nskb)
> +		goto free_buf;
> +
> +	skb_reserve(nskb, skb_headroom(skb));
> +
> +	iv = buf;
> +
> +	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
> +	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> +	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
> +
> +	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
> +	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
> +		   payload_len);
> +	/* Add room for authentication tag produced by crypto */
> +	dummy_buf += sync_size;
> +	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
> +	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
> +			  payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
> +			     rcd_sn, sync_size + payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	complete_skb(nskb, skb, tcp_payload_offset);
> +
> +	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
> +	 * nskb->prev will point to the skb itself
> +	 */
> +	nskb->prev = nskb;
> +free_buf:
> +	kfree(buf);
> +free_req:
> +	kfree(aead_req);
> +put_sg:
> +	for (i = 0; i < resync_sgs; i++)
> +		put_page(sg_page(&sg_in[i]));
> +free_sg:
> +	kfree(sg_in);
> +free_orig:
> +	kfree_skb(skb);
> +	return nskb;
> +
> +free_nskb:
> +	kfree_skb(nskb);
> +	nskb = NULL;
> +	goto free_buf;
> +}
> +
> +struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> +				      struct net_device *dev,
> +				      struct sk_buff *skb)
> +{
> +	if (dev == tls_get_ctx(sk)->netdev)
> +		return skb;
> +
> +	return tls_sw_fallback(sk, skb);
> +}
> +
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info)
> +{
> +	const u8 *key;
> +	int rc;
> +
> +	offload_ctx->aead_send =
> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> +	if (IS_ERR(offload_ctx->aead_send)) {
> +		rc = PTR_ERR(offload_ctx->aead_send);
> +		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
> +		offload_ctx->aead_send = NULL;
> +		goto err_out;
> +	}
> +
> +	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
> +
> +	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
> +				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
> +				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	return 0;
> +free_aead:
> +	crypto_free_aead(offload_ctx->aead_send);
> +err_out:
> +	return rc;
> +}
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index d824d548447e..e0dface33017 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -54,6 +54,9 @@ enum {
>   enum {
>   	TLS_BASE_TX,
>   	TLS_SW_TX,
> +#ifdef CONFIG_TLS_DEVICE
> +	TLS_HW_TX,
> +#endif
>   	TLS_NUM_CONFIG,
>   };
>   
> @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
>   		goto err_crypto_info;
>   	}
>   
> -	/* currently SW is default, we will have ethtool in future */
> -	rc = tls_set_sw_offload(sk, ctx);
> -	tx_conf = TLS_SW_TX;
> -	if (rc)
> -		goto err_crypto_info;
> +#ifdef CONFIG_TLS_DEVICE
> +	rc = tls_set_device_offload(sk, ctx);
> +	tx_conf = TLS_HW_TX;
> +	if (rc) {
> +#else
> +	{
> +#endif
> +		/* if HW offload fails fallback to SW */
> +		rc = tls_set_sw_offload(sk, ctx);
> +		tx_conf = TLS_SW_TX;
> +		if (rc)
> +			goto err_crypto_info;
> +	}
>   
>   	ctx->tx_conf = tx_conf;
>   	update_sk_prot(sk, ctx);
> @@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
>   	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
>   	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
>   	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
> +
> +#ifdef CONFIG_TLS_DEVICE
> +	prot[TLS_HW_TX] = prot[TLS_SW_TX];
> +	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
> +	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
> +#endif
>   }
>   
>   static int tls_init(struct sock *sk)
> @@ -531,6 +548,9 @@ static int __init tls_register(void)
>   {
>   	build_protos(tls_prots[TLSV4], &tcp_prot);
>   
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_init();
> +#endif
>   	tcp_register_ulp(&tcp_tls_ulp_ops);
>   
>   	return 0;
> @@ -539,6 +559,9 @@ static int __init tls_register(void)
>   static void __exit tls_unregister(void)
>   {
>   	tcp_unregister_ulp(&tcp_tls_ulp_ops);
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_cleanup();
> +#endif
>   }
>   
>   module_init(tls_register);
> 

^ permalink raw reply

* Re: [PATCH v2] KEYS: DNS: limit the length of option strings
From: Eric Biggers @ 2018-03-23 20:21 UTC (permalink / raw)
  To: David Howells; +Cc: keyrings, netdev, Mark Rutland, Eric Biggers
In-Reply-To: <20180312175707.GA230165@gmail.com>

On Mon, Mar 12, 2018 at 10:57:07AM -0700, Eric Biggers wrote:
> On Wed, Mar 07, 2018 at 03:54:37PM +0000, David Howells wrote:
> > Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > > Fix it by limiting option strings (combined name + value) to a much more
> > > reasonable 128 bytes.  The exact limit is arbitrary, but currently the
> > > only recognized option is formatted as "dnserror=%lu" which fits well
> > > within this limit.
> > 
> > There will be more options coming ("ipv4", "ipv6") but they shouldn't overrun
> > this limit and we can always extend the limit if need be.
> > 
> > David
> 
> David (Howells) do you want to take this patch through the keyrings tree or
> should I ask David Miller to take it through net-next?
> 
> Eric

Ping.

^ permalink raw reply

* [iproute PATCH] man: tc-vlan.8: Fix for incorrect example
From: Phil Sutter @ 2018-03-23 20:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

This has to be a second match statement to the same u32 filter, not a
second one (which tc-filter doesn't support at all).

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 man/man8/tc-vlan.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/tc-vlan.8 b/man/man8/tc-vlan.8
index 59c81e86378d0..f5ffc25f054ed 100644
--- a/man/man8/tc-vlan.8
+++ b/man/man8/tc-vlan.8
@@ -103,7 +103,7 @@ into VLAN ID 123:
 #tc qdisc add dev eth0 handle ffff: ingress
 #tc filter add dev eth0 parent ffff: pref 11 protocol ip \\
 	u32 match ip protocol 1 0xff flowid 1:1 \\
-	u32 match ip src 10.0.0.2 flowid 1:1 \\
+	    match ip src 10.0.0.2 flowid 1:1 \\
 	action vlan push id 123
 .EE
 .RE
-- 
2.16.1

^ permalink raw reply related

* [PATCH net-next 8/8] MAINTAINERS: Add entry for Microsemi Ethernet switches
From: Alexandre Belloni @ 2018-03-23 20:11 UTC (permalink / raw)
  To: David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, Florian Fainelli, netdev, devicetree, linux-kernel,
	linux-mips, Alexandre Belloni
In-Reply-To: <20180323201117.8416-1-alexandre.belloni@bootlin.com>

Add myself as a maintainer for the Microsemi Ethernet switches.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 14ce8b290fea..45be80225130 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9177,6 +9177,12 @@ F:	include/linux/cciss*.h
 F:	include/uapi/linux/cciss*.h
 F:	Documentation/scsi/smartpqi.txt
 
+MICROSEMI ETHERNET SWITCH DRIVER
+M:	Alexandre Belloni <alexandre.belloni@bootlin.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/ethernet/mscc/
+
 MICROSOFT SURFACE PRO 3 BUTTON DRIVER
 M:	Chen Yu <yu.c.chen@intel.com>
 L:	platform-driver-x86@vger.kernel.org
-- 
2.16.2

^ permalink raw reply related

* [PATCH net-next 7/8] MIPS: mscc: connect phys to ports on ocelot_pcb123
From: Alexandre Belloni @ 2018-03-23 20:11 UTC (permalink / raw)
  To: David S . Miller
  Cc: Allan Nielsen, razvan.stefanescu, po.liu, Thomas Petazzoni,
	Andrew Lunn, Florian Fainelli, netdev, devicetree, linux-kernel,
	linux-mips, Alexandre Belloni, James Hogan
In-Reply-To: <20180323201117.8416-1-alexandre.belloni@bootlin.com>

Add phy to switch port connections for PCB123 for internal PHYs.

Cc: James Hogan <jhogan@kernel.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
---
 arch/mips/boot/dts/mscc/ocelot_pcb123.dts | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/mips/boot/dts/mscc/ocelot_pcb123.dts b/arch/mips/boot/dts/mscc/ocelot_pcb123.dts
index 29d6414f8886..66b48f664975 100644
--- a/arch/mips/boot/dts/mscc/ocelot_pcb123.dts
+++ b/arch/mips/boot/dts/mscc/ocelot_pcb123.dts
@@ -25,3 +25,19 @@
 &uart2 {
 	status = "okay";
 };
+
+&port0 {
+	phy-handle = <&phy0>;
+};
+
+&port1 {
+	phy-handle = <&phy1>;
+};
+
+&port2 {
+	phy-handle = <&phy2>;
+};
+
+&port3 {
+	phy-handle = <&phy3>;
+};
-- 
2.16.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox