Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH REPOST v4 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 * know there are new descriptors to fetch.
 		 */
 		wmb();
-		writel(ring->next_to_use, ring->tail);
+		writel_relaxed(ring->next_to_use, ring->tail);
 
 		xdp_do_flush_map();
 	}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 	ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	 * are new descriptors to fetch.
 	 */
 	wmb();
-	writel(ring->next_to_use, ring->tail);
+	writel_relaxed(ring->next_to_use, ring->tail);
 
 	return;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..edb1c34 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring *rx_ring,
 		 * such as IA-64).
 		*/
 		wmb();
-		writel(i, adapter->hw.hw_addr + rx_ring->tail);
+		writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
 	}
 }
 
@@ -2297,7 +2297,7 @@ static inline void igbvf_tx_queue_adv(struct igbvf_adapter *adapter,
 
 	tx_ring->buffer_info[first].next_to_watch = tx_desc;
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tail);
+	writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
 	/* we need this if more than one processor can write to our tail
 	 * at a time, it synchronizes IO on IA64/Altix systems
 	 */
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 4/7] igb: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: sulrich, netdev, timur, linux-kernel, Sinan Kaya, intel-wired-lan,
	linux-arm-msm, linux-arm-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
 	igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 5 -----
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..11e893e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,11 +244,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring *ring)
 	return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-	writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)	\
 	(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)	\
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..6bf778a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
 		 * such as IA-64).
 		 */
 		wmb();
-		ixgbevf_write_tail(rx_ring, i);
+		writel(i, rx_ring->tail);
 	}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	tx_ring->next_to_use = i;
 
 	/* notify HW of packet */
-	ixgbevf_write_tail(tx_ring, i);
+	writel(i, tx_ring->tail);
 
 	return;
 dma_error:
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 6/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6bf778a..774b2a6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	tx_ring->next_to_use = i;
 
 	/* notify HW of packet */
-	writel(i, tx_ring->tail);
+	writel_relaxed(i, tx_ring->tail);
 
 	return;
 dma_error:
-- 
2.7.4

^ permalink raw reply related

* [PATCH REPOST v4 7/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 18:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Sinan Kaya, intel-wired-lan, linux-kernel
In-Reply-To: <1521658572-26354-1-git-send-email-okaya@codeaurora.org>

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 8e12aae..eebef01 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -179,7 +179,7 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 cleaned_count)
 		wmb();
 
 		/* notify hardware of new descriptors */
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -1054,7 +1054,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
 	/* notify HW of packet */
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next RFC V1 0/5] Peer to Peer One-Step time stamping
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn

This series adds support for PTP (IEEE 1588) P2P one-step time
stamping along with a driver for a hardware device that supports this.

If the hardware supports p2p one-step, it subtracts the ingress time
stamp value from the Pdelay_Request correction field.  The user space
software stack then simply copies the correction field into the
Pdelay_Response, and on transmission the hardware adds the egress time
stamp into the correction field.

- Patch 1 adds the new option.
- Patches 2-4 adds support for MII time stamping in non-PHY devices.
- Patch 5 adds a driver implementing the new option.

Earlier today I posted user space support as an RFC on the
linuxptp-devel list.  Comments and review are most welcome.

Thanks,
Richard

Richard Cochran (5):
  net: Introduce peer to peer one step PTP time stamping.
  net: phy: Move time stamping interface into the generic mdio layer.
  net: Introduce field for the MII time stamper.
  net: Use the generic MII time stamper when available.
  net: mdio: Add a driver for InES time stamping IP core.

 Documentation/devicetree/bindings/net/ines-ptp.txt |  42 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   |   1 +
 drivers/net/phy/Makefile                           |   1 +
 drivers/net/phy/dp83640.c                          |  29 +-
 drivers/net/phy/ines_ptp.c                         | 857 +++++++++++++++++++++
 drivers/net/phy/mdio_bus.c                         |  51 +-
 drivers/net/phy/phy.c                              |   6 +-
 drivers/ptp/Kconfig                                |  10 +
 include/linux/mdio.h                               |  23 +
 include/linux/netdevice.h                          |   1 +
 include/linux/phy.h                                |  23 -
 include/uapi/linux/net_tstamp.h                    |   8 +
 net/Kconfig                                        |   8 +-
 net/core/dev_ioctl.c                               |   1 +
 net/core/ethtool.c                                 |   5 +-
 net/core/timestamping.c                            |  36 +-
 16 files changed, 1034 insertions(+), 68 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/ines-ptp.txt
 create mode 100644 drivers/net/phy/ines_ptp.c

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <cover.1521656774.git.richardcochran@gmail.com>

The 1588 standard defines one step operation for both Sync and
PDelay_Resp messages.  Up until now, hardware with P2P one step has
been rare, and kernel support was lacking.  This patch adds support of
the mode in anticipation of new hardware developments.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
 include/uapi/linux/net_tstamp.h                  | 8 ++++++++
 net/core/dev_ioctl.c                             | 1 +
 3 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af4aadb..c6295e5c16af 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -15379,6 +15379,7 @@ int bnx2x_configure_ptp_filters(struct bnx2x *bp)
 		       NIG_REG_P0_TLLH_PTP_RULE_MASK, 0x3EEE);
 		break;
 	case HWTSTAMP_TX_ONESTEP_SYNC:
+	case HWTSTAMP_TX_ONESTEP_P2P:
 		BNX2X_ERR("One-step timestamping is not supported\n");
 		return -ERANGE;
 	}
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 4fe104b2411f..f89b5a836c2a 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -90,6 +90,14 @@ enum hwtstamp_tx_types {
 	 * queue.
 	 */
 	HWTSTAMP_TX_ONESTEP_SYNC,
+
+	/*
+	 * Same as HWTSTAMP_TX_ONESTEP_SYNC, but also enables time
+	 * stamp insertion directly into PDelay_Resp packets. In this
+	 * case, neither transmitted Sync nor PDelay_Resp packets will
+	 * receive a time stamp via the socket error queue.
+	 */
+	HWTSTAMP_TX_ONESTEP_P2P,
 };
 
 /* possible values for hwtstamp_config->rx_filter */
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 0ab1af04296c..cdda085e4b47 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -187,6 +187,7 @@ static int net_hwtstamp_validate(struct ifreq *ifr)
 	case HWTSTAMP_TX_OFF:
 	case HWTSTAMP_TX_ON:
 	case HWTSTAMP_TX_ONESTEP_SYNC:
+	case HWTSTAMP_TX_ONESTEP_P2P:
 		tx_type_valid = 1;
 		break;
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next RFC V1 2/5] net: phy: Move time stamping interface into the generic mdio layer.
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <cover.1521656774.git.richardcochran@gmail.com>

There are different ways of obtaining hardware time stamps on network
packets.  The ingress and egress times can be measured in the MAC, in
the PHY, or by a device listening on the MII bus.  Up until now, the
kernel has support for MAC and PHY time stamping, but not for other
MII bus devices.

This patch moves the PHY time stamping interface into the generic
mdio device in order to support MII time stamping hardware.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/phy/dp83640.c | 29 ++++++++++++++++++++---------
 drivers/net/phy/phy.c     |  4 ++--
 include/linux/mdio.h      | 23 +++++++++++++++++++++++
 include/linux/phy.h       | 23 -----------------------
 net/core/ethtool.c        |  4 ++--
 net/core/timestamping.c   |  8 ++++----
 6 files changed, 51 insertions(+), 40 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 654f42d00092..79aeb5eb471a 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -215,6 +215,10 @@ static LIST_HEAD(phyter_clocks);
 static DEFINE_MUTEX(phyter_clocks_lock);
 
 static void rx_timestamp_work(struct work_struct *work);
+static  int dp83640_ts_info(struct mdio_device *m, struct ethtool_ts_info *i);
+static  int dp83640_hwtstamp(struct mdio_device *m, struct ifreq *i);
+static bool dp83640_rxtstamp(struct mdio_device *m, struct sk_buff *s, int t);
+static void dp83640_txtstamp(struct mdio_device *m, struct sk_buff *s, int t);
 
 /* extended register access functions */
 
@@ -1162,6 +1166,12 @@ static int dp83640_probe(struct phy_device *phydev)
 		list_add_tail(&dp83640->list, &clock->phylist);
 
 	dp83640_clock_put(clock);
+
+	phydev->mdio.ts_info = dp83640_ts_info;
+	phydev->mdio.hwtstamp = dp83640_hwtstamp;
+	phydev->mdio.rxtstamp = dp83640_rxtstamp;
+	phydev->mdio.txtstamp = dp83640_txtstamp;
+
 	return 0;
 
 no_register:
@@ -1288,8 +1298,9 @@ static int dp83640_config_intr(struct phy_device *phydev)
 	}
 }
 
-static int dp83640_hwtstamp(struct phy_device *phydev, struct ifreq *ifr)
+static int dp83640_hwtstamp(struct mdio_device *mdev, struct ifreq *ifr)
 {
+	struct phy_device *phydev = container_of(mdev, struct phy_device, mdio);
 	struct dp83640_private *dp83640 = phydev->priv;
 	struct hwtstamp_config cfg;
 	u16 txcfg0, rxcfg0;
@@ -1397,9 +1408,10 @@ static void rx_timestamp_work(struct work_struct *work)
 		schedule_delayed_work(&dp83640->ts_work, SKB_TIMESTAMP_TIMEOUT);
 }
 
-static bool dp83640_rxtstamp(struct phy_device *phydev,
+static bool dp83640_rxtstamp(struct mdio_device *mdev,
 			     struct sk_buff *skb, int type)
 {
+	struct phy_device *phydev = container_of(mdev, struct phy_device, mdio);
 	struct dp83640_private *dp83640 = phydev->priv;
 	struct dp83640_skb_info *skb_info = (struct dp83640_skb_info *)skb->cb;
 	struct list_head *this, *next;
@@ -1446,9 +1458,10 @@ static bool dp83640_rxtstamp(struct phy_device *phydev,
 	return true;
 }
 
-static void dp83640_txtstamp(struct phy_device *phydev,
+static void dp83640_txtstamp(struct mdio_device *mdev,
 			     struct sk_buff *skb, int type)
 {
+	struct phy_device *phydev = container_of(mdev, struct phy_device, mdio);
 	struct dp83640_private *dp83640 = phydev->priv;
 
 	switch (dp83640->hwts_tx_en) {
@@ -1471,9 +1484,11 @@ static void dp83640_txtstamp(struct phy_device *phydev,
 	}
 }
 
-static int dp83640_ts_info(struct phy_device *dev, struct ethtool_ts_info *info)
+static int dp83640_ts_info(struct mdio_device *mdev,
+			   struct ethtool_ts_info *info)
 {
-	struct dp83640_private *dp83640 = dev->priv;
+	struct phy_device *phydev = container_of(mdev, struct phy_device, mdio);
+	struct dp83640_private *dp83640 = phydev->priv;
 
 	info->so_timestamping =
 		SOF_TIMESTAMPING_TX_HARDWARE |
@@ -1504,10 +1519,6 @@ static struct phy_driver dp83640_driver = {
 	.config_init	= dp83640_config_init,
 	.ack_interrupt  = dp83640_ack_interrupt,
 	.config_intr    = dp83640_config_intr,
-	.ts_info	= dp83640_ts_info,
-	.hwtstamp	= dp83640_hwtstamp,
-	.rxtstamp	= dp83640_rxtstamp,
-	.txtstamp	= dp83640_txtstamp,
 };
 
 static int __init dp83640_init(void)
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c2d9027be863..466bf88053ce 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -457,8 +457,8 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq *ifr, int cmd)
 		return 0;
 
 	case SIOCSHWTSTAMP:
-		if (phydev->drv && phydev->drv->hwtstamp)
-			return phydev->drv->hwtstamp(phydev, ifr);
+		if (phydev->mdio.hwtstamp)
+			return phydev->mdio.hwtstamp(&phydev->mdio, ifr);
 		/* fall through */
 
 	default:
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index 2cfffe586885..70a82c973aed 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -37,6 +37,29 @@ struct mdio_device {
 	void (*device_free)(struct mdio_device *mdiodev);
 	void (*device_remove)(struct mdio_device *mdiodev);
 
+	/* Handles ethtool queries for hardware time stamping. */
+	int (*ts_info)(struct mdio_device *dev, struct ethtool_ts_info *ti);
+
+	/* Handles SIOCSHWTSTAMP ioctl for hardware time stamping. */
+	int  (*hwtstamp)(struct mdio_device *dev, struct ifreq *ifr);
+
+	/*
+	 * Requests a Rx timestamp for 'skb'. If the skb is accepted,
+	 * the mdio device promises to deliver it using netif_rx() as
+	 * soon as a timestamp becomes available. One of the
+	 * PTP_CLASS_ values is passed in 'type'. The function must
+	 * return true if the skb is accepted for delivery.
+	 */
+	bool (*rxtstamp)(struct mdio_device *dev, struct sk_buff *skb, int type);
+
+	/*
+	 * Requests a Tx timestamp for 'skb'. The mdio device promises
+	 * to deliver it using skb_complete_tx_timestamp() as soon as a
+	 * timestamp becomes available. One of the PTP_CLASS_ values
+	 * is passed in 'type'.
+	 */
+	void (*txtstamp)(struct mdio_device *dev, struct sk_buff *skb, int type);
+
 	/* Bus address of the MDIO device (0-31) */
 	int addr;
 	int flags;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 5a9b1753fdc5..77d3f703b8ce 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -570,29 +570,6 @@ struct phy_driver {
 	 */
 	int (*match_phy_device)(struct phy_device *phydev);
 
-	/* Handles ethtool queries for hardware time stamping. */
-	int (*ts_info)(struct phy_device *phydev, struct ethtool_ts_info *ti);
-
-	/* Handles SIOCSHWTSTAMP ioctl for hardware time stamping. */
-	int  (*hwtstamp)(struct phy_device *phydev, struct ifreq *ifr);
-
-	/*
-	 * Requests a Rx timestamp for 'skb'. If the skb is accepted,
-	 * the phy driver promises to deliver it using netif_rx() as
-	 * soon as a timestamp becomes available. One of the
-	 * PTP_CLASS_ values is passed in 'type'. The function must
-	 * return true if the skb is accepted for delivery.
-	 */
-	bool (*rxtstamp)(struct phy_device *dev, struct sk_buff *skb, int type);
-
-	/*
-	 * Requests a Tx timestamp for 'skb'. The phy driver promises
-	 * to deliver it using skb_complete_tx_timestamp() as soon as a
-	 * timestamp becomes available. One of the PTP_CLASS_ values
-	 * is passed in 'type'.
-	 */
-	void (*txtstamp)(struct phy_device *dev, struct sk_buff *skb, int type);
-
 	/* Some devices (e.g. qnap TS-119P II) require PHY register changes to
 	 * enable Wake on LAN, so set_wol is provided to be called in the
 	 * ethernet driver's set_wol function. */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 157cd9efa4be..b4b81eaa15a9 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -2218,8 +2218,8 @@ static int ethtool_get_ts_info(struct net_device *dev, void __user *useraddr)
 	memset(&info, 0, sizeof(info));
 	info.cmd = ETHTOOL_GET_TS_INFO;
 
-	if (phydev && phydev->drv && phydev->drv->ts_info) {
-		err = phydev->drv->ts_info(phydev, &info);
+	if (phydev && phydev->mdio.ts_info) {
+		err = phydev->mdio.ts_info(&phydev->mdio, &info);
 	} else if (ops->get_ts_info) {
 		err = ops->get_ts_info(dev, &info);
 	} else {
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index 42689d5c468c..b8857028e652 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -46,11 +46,11 @@ void skb_clone_tx_timestamp(struct sk_buff *skb)
 		return;
 
 	phydev = skb->dev->phydev;
-	if (likely(phydev->drv->txtstamp)) {
+	if (likely(phydev->mdio.txtstamp)) {
 		clone = skb_clone_sk(skb);
 		if (!clone)
 			return;
-		phydev->drv->txtstamp(phydev, clone, type);
+		phydev->mdio.txtstamp(&phydev->mdio, clone, type);
 	}
 }
 EXPORT_SYMBOL_GPL(skb_clone_tx_timestamp);
@@ -76,8 +76,8 @@ bool skb_defer_rx_timestamp(struct sk_buff *skb)
 		return false;
 
 	phydev = skb->dev->phydev;
-	if (likely(phydev->drv->rxtstamp))
-		return phydev->drv->rxtstamp(phydev, skb, type);
+	if (likely(phydev->mdio.rxtstamp))
+		return phydev->mdio.rxtstamp(&phydev->mdio, skb, type);
 
 	return false;
 }
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next RFC V1 3/5] net: Introduce field for the MII time stamper.
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <cover.1521656774.git.richardcochran@gmail.com>

This patch adds a new field to the network device structure to reference
a time stamping device on the MII bus.  By decoupling the time stamping
function from the PHY device, we pave the way to allowing a non-PHY
device to take this role.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/phy/mdio_bus.c | 51 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/netdevice.h  |  1 +
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 24b5511222c8..fdac8c8ac272 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -717,6 +717,47 @@ static int mdio_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
+static bool mdiodev_supports_timestamping(struct mdio_device *mdiodev)
+{
+	if (mdiodev->ts_info  && mdiodev->hwtstamp &&
+	    mdiodev->rxtstamp && mdiodev->txtstamp)
+		return true;
+	else
+		return false;
+}
+
+static int mdiobus_netdev_notification(struct notifier_block *nb,
+				       unsigned long msg, void *ptr)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct phy_device *phydev = netdev->phydev;
+	struct mdio_device *mdev;
+	struct mii_bus *bus;
+	int i;
+
+	if (netdev->mdiots || msg != NETDEV_UP || !phydev)
+		return NOTIFY_DONE;
+
+	/*
+	 * Examine the MII bus associated with the PHY that is
+	 * attached to the MAC.  If there is a time stamping device
+	 * on the bus, then connect it to the network device.
+	 */
+	bus = phydev->mdio.bus;
+
+	for (i = 0; i < PHY_MAX_ADDR; i++) {
+		mdev = bus->mdio_map[i];
+		if (!mdev)
+			continue;
+		if (mdiodev_supports_timestamping(mdev)) {
+			netdev->mdiots = mdev;
+			return NOTIFY_OK;
+		}
+	}
+
+	return NOTIFY_DONE;
+}
+
 #ifdef CONFIG_PM
 static int mdio_bus_suspend(struct device *dev)
 {
@@ -772,6 +813,10 @@ struct bus_type mdio_bus_type = {
 };
 EXPORT_SYMBOL(mdio_bus_type);
 
+static struct notifier_block mdiobus_netdev_notifier __read_mostly = {
+	.notifier_call = mdiobus_netdev_notification,
+};
+
 int __init mdio_bus_init(void)
 {
 	int ret;
@@ -782,14 +827,18 @@ int __init mdio_bus_init(void)
 		if (ret)
 			class_unregister(&mdio_bus_class);
 	}
+	if (ret)
+		return ret;
+
+	return register_netdevice_notifier(&mdiobus_netdev_notifier);
 
-	return ret;
 }
 EXPORT_SYMBOL_GPL(mdio_bus_init);
 
 #if IS_ENABLED(CONFIG_PHYLIB)
 void mdio_bus_exit(void)
 {
+	unregister_netdevice_notifier(&mdiobus_netdev_notifier);
 	class_unregister(&mdio_bus_class);
 	bus_unregister(&mdio_bus_type);
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5fbb9f1da7fd..223d691aa0b0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1943,6 +1943,7 @@ struct net_device {
 	struct netprio_map __rcu *priomap;
 #endif
 	struct phy_device	*phydev;
+	struct mdio_device	*mdiots;
 	struct lock_class_key	*qdisc_tx_busylock;
 	struct lock_class_key	*qdisc_running_key;
 	bool			proto_down;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next RFC V1 4/5] net: Use the generic MII time stamper when available.
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <cover.1521656774.git.richardcochran@gmail.com>

Now that the infrastructure is in place, convert the PHY time stamping
logic to use the generic MII device.  This change has the added benefit
of simplifying the code somewhat.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/phy/phy.c   |  6 ++++--
 net/Kconfig             |  8 ++++----
 net/core/ethtool.c      |  5 ++---
 net/core/timestamping.c | 36 ++++++++++--------------------------
 4 files changed, 20 insertions(+), 35 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 466bf88053ce..df80c6b14478 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -397,6 +397,7 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq *ifr, int cmd)
 	struct mii_ioctl_data *mii_data = if_mii(ifr);
 	u16 val = mii_data->val_in;
 	bool change_autoneg = false;
+	struct net_device *netdev;
 
 	switch (cmd) {
 	case SIOCGMIIPHY:
@@ -457,8 +458,9 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq *ifr, int cmd)
 		return 0;
 
 	case SIOCSHWTSTAMP:
-		if (phydev->mdio.hwtstamp)
-			return phydev->mdio.hwtstamp(&phydev->mdio, ifr);
+		netdev = phydev->attached_dev;
+		if (netdev->mdiots->hwtstamp)
+			return netdev->mdiots->hwtstamp(netdev->mdiots, ifr);
 		/* fall through */
 
 	default:
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..e38403cd010c 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -102,12 +102,12 @@ config NET_PTP_CLASSIFY
 	def_bool n
 
 config NETWORK_PHY_TIMESTAMPING
-	bool "Timestamping in PHY devices"
+	bool "Timestamping in PHY and MII bus devices"
 	select NET_PTP_CLASSIFY
 	help
-	  This allows timestamping of network packets by PHYs with
-	  hardware timestamping capabilities. This option adds some
-	  overhead in the transmit and receive paths.
+	  This allows timestamping of network packets by PHYs or other
+	  MII bus devices with hardware timestamping capabilities. This
+	  option adds some overhead in the transmit and receive paths.
 
 	  If you are unsure how to answer this question, answer N.
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index b4b81eaa15a9..507e56abecb7 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -2213,13 +2213,12 @@ static int ethtool_get_ts_info(struct net_device *dev, void __user *useraddr)
 	int err = 0;
 	struct ethtool_ts_info info;
 	const struct ethtool_ops *ops = dev->ethtool_ops;
-	struct phy_device *phydev = dev->phydev;
 
 	memset(&info, 0, sizeof(info));
 	info.cmd = ETHTOOL_GET_TS_INFO;
 
-	if (phydev && phydev->mdio.ts_info) {
-		err = phydev->mdio.ts_info(&phydev->mdio, &info);
+	if (dev->mdiots && dev->mdiots->ts_info) {
+		err = dev->mdiots->ts_info(dev->mdiots, &info);
 	} else if (ops->get_ts_info) {
 		err = ops->get_ts_info(dev, &info);
 	} else {
diff --git a/net/core/timestamping.c b/net/core/timestamping.c
index b8857028e652..ecb0ecb03740 100644
--- a/net/core/timestamping.c
+++ b/net/core/timestamping.c
@@ -23,44 +23,32 @@
 #include <linux/skbuff.h>
 #include <linux/export.h>
 
-static unsigned int classify(const struct sk_buff *skb)
-{
-	if (likely(skb->dev && skb->dev->phydev &&
-		   skb->dev->phydev->drv))
-		return ptp_classify_raw(skb);
-	else
-		return PTP_CLASS_NONE;
-}
-
 void skb_clone_tx_timestamp(struct sk_buff *skb)
 {
-	struct phy_device *phydev;
 	struct sk_buff *clone;
 	unsigned int type;
 
-	if (!skb->sk)
+	if (!skb->sk || !skb->dev || !skb->dev->mdiots)
 		return;
 
-	type = classify(skb);
+	type = ptp_classify_raw(skb);
+
 	if (type == PTP_CLASS_NONE)
 		return;
 
-	phydev = skb->dev->phydev;
-	if (likely(phydev->mdio.txtstamp)) {
-		clone = skb_clone_sk(skb);
-		if (!clone)
-			return;
-		phydev->mdio.txtstamp(&phydev->mdio, clone, type);
-	}
+	clone = skb_clone_sk(skb);
+	if (!clone)
+		return;
+
+	skb->dev->mdiots->txtstamp(skb->dev->mdiots, clone, type);
 }
 EXPORT_SYMBOL_GPL(skb_clone_tx_timestamp);
 
 bool skb_defer_rx_timestamp(struct sk_buff *skb)
 {
-	struct phy_device *phydev;
 	unsigned int type;
 
-	if (!skb->dev || !skb->dev->phydev || !skb->dev->phydev->drv)
+	if (!skb->dev || !skb->dev->mdiots)
 		return false;
 
 	if (skb_headroom(skb) < ETH_HLEN)
@@ -75,10 +63,6 @@ bool skb_defer_rx_timestamp(struct sk_buff *skb)
 	if (type == PTP_CLASS_NONE)
 		return false;
 
-	phydev = skb->dev->phydev;
-	if (likely(phydev->mdio.rxtstamp))
-		return phydev->mdio.rxtstamp(&phydev->mdio, skb, type);
-
-	return false;
+	return skb->dev->mdiots->rxtstamp(skb->dev->mdiots, skb, type);
 }
 EXPORT_SYMBOL_GPL(skb_defer_rx_timestamp);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
From: Richard Cochran @ 2018-03-21 18:58 UTC (permalink / raw)
  To: netdev
  Cc: devicetree, Andrew Lunn, David Miller, Florian Fainelli,
	Mark Rutland, Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <cover.1521656774.git.richardcochran@gmail.com>

The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
logic recognizes and time stamps PTP frames on the MII bus.  This
patch adds a driver for the core along with a device tree binding to
allow hooking the driver to MAC devices.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
 Documentation/devicetree/bindings/net/ines-ptp.txt |  42 +
 drivers/net/phy/Makefile                           |   1 +
 drivers/net/phy/ines_ptp.c                         | 857 +++++++++++++++++++++
 drivers/ptp/Kconfig                                |  10 +
 4 files changed, 910 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/ines-ptp.txt
 create mode 100644 drivers/net/phy/ines_ptp.c

diff --git a/Documentation/devicetree/bindings/net/ines-ptp.txt b/Documentation/devicetree/bindings/net/ines-ptp.txt
new file mode 100644
index 000000000000..ed7b1d773ded
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/ines-ptp.txt
@@ -0,0 +1,42 @@
+ZHAW InES PTP time stamping IP core
+
+The IP core needs two different kinds of nodes.  The control node
+lives somewhere in the memory map and specifies the address of the
+control registers.  There can be up to three port nodes placed on the
+mdio bus.  They associate a particular MAC with a port index within
+the IP core.
+
+Required properties of the control node:
+
+- compatible:		"ines,ptp-ctrl"
+- reg:			physical address and size of the register bank
+- phandle:		globally unique handle for the ports to point to
+
+Required properties of the port nodes:
+
+- compatible:		"ines,ptp-port"
+- ctrl-handle:		points to the control node
+- port-index:		port channel within the IP core
+- reg:			phy address. This is required even though the
+			device does not respond to mdio operations
+
+Example:
+
+	timestamper@60000000 {
+		compatible = "ines,ptp-ctrl";
+		reg = <0x60000000 0x80>;
+		phandle = <0x10>;
+	};
+
+	ethernet@80000000 {
+		...
+		mdio {
+			...
+			timestamper@1f {
+				compatible = "ines,ptp-port";
+				ctrl-handle = <0x10>;
+				port-index = <0>;
+				reg = <0x1f>;
+			};
+		};
+	};
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 01acbcb2c798..e286bb822295 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_DP83848_PHY)	+= dp83848.o
 obj-$(CONFIG_DP83867_PHY)	+= dp83867.o
 obj-$(CONFIG_FIXED_PHY)		+= fixed_phy.o
 obj-$(CONFIG_ICPLUS_PHY)	+= icplus.o
+obj-$(CONFIG_INES_PTP_TSTAMP)	+= ines_ptp.o
 obj-$(CONFIG_INTEL_XWAY_PHY)	+= intel-xway.o
 obj-$(CONFIG_LSI_ET1011C_PHY)	+= et1011c.o
 obj-$(CONFIG_LXT_PHY)		+= lxt.o
diff --git a/drivers/net/phy/ines_ptp.c b/drivers/net/phy/ines_ptp.c
new file mode 100644
index 000000000000..4f66459d4417
--- /dev/null
+++ b/drivers/net/phy/ines_ptp.c
@@ -0,0 +1,857 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018 MOSER-BAER AG
+ */
+#define pr_fmt(fmt) "InES_PTP: " fmt
+
+#include <linux/ethtool.h>
+#include <linux/export.h>
+#include <linux/if_vlan.h>
+#include <linux/mdio.h>
+#include <linux/module.h>
+#include <linux/net_tstamp.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
+#include <linux/platform_device.h>
+#include <linux/ptp_classify.h>
+#include <linux/ptp_clock_kernel.h>
+#include <linux/stddef.h>
+
+MODULE_DESCRIPTION("Driver for the ZHAW InES PTP time stamping IP core");
+MODULE_AUTHOR("Richard Cochran <richardcochran@gmail.com>");
+MODULE_VERSION("1.0");
+MODULE_LICENSE("GPL");
+
+/* GLOBAL register */
+#define MCAST_MAC_SELECT_SHIFT	2
+#define MCAST_MAC_SELECT_MASK	0x3
+#define IO_RESET		BIT(1)
+#define PTP_RESET		BIT(0)
+
+/* VERSION register */
+#define IF_MAJOR_VER_SHIFT	12
+#define IF_MAJOR_VER_MASK	0xf
+#define IF_MINOR_VER_SHIFT	8
+#define IF_MINOR_VER_MASK	0xf
+#define FPGA_MAJOR_VER_SHIFT	4
+#define FPGA_MAJOR_VER_MASK	0xf
+#define FPGA_MINOR_VER_SHIFT	0
+#define FPGA_MINOR_VER_MASK	0xf
+
+/* INT_STAT register */
+#define RX_INTR_STATUS_3	BIT(5)
+#define RX_INTR_STATUS_2	BIT(4)
+#define RX_INTR_STATUS_1	BIT(3)
+#define TX_INTR_STATUS_3	BIT(2)
+#define TX_INTR_STATUS_2	BIT(1)
+#define TX_INTR_STATUS_1	BIT(0)
+
+/* INT_MSK register */
+#define RX_INTR_MASK_3		BIT(5)
+#define RX_INTR_MASK_2		BIT(4)
+#define RX_INTR_MASK_1		BIT(3)
+#define TX_INTR_MASK_3		BIT(2)
+#define TX_INTR_MASK_2		BIT(1)
+#define TX_INTR_MASK_1		BIT(0)
+
+/* BUF_STAT register */
+#define RX_FIFO_NE_3		BIT(5)
+#define RX_FIFO_NE_2		BIT(4)
+#define RX_FIFO_NE_1		BIT(3)
+#define TX_FIFO_NE_3		BIT(2)
+#define TX_FIFO_NE_2		BIT(1)
+#define TX_FIFO_NE_1		BIT(0)
+
+/* PORT_CONF register */
+#define CM_ONE_STEP		BIT(6)
+#define PHY_SPEED_SHIFT		4
+#define PHY_SPEED_MASK		0x3
+#define P2P_DELAY_WR_POS_SHIFT	2
+#define P2P_DELAY_WR_POS_MASK	0x3
+#define PTP_MODE_SHIFT		0
+#define PTP_MODE_MASK		0x3
+
+/* TS_STAT_TX register */
+#define TS_ENABLE		BIT(15)
+#define DATA_READ_POS_SHIFT	8
+#define DATA_READ_POS_MASK	0x1f
+#define DISCARDED_EVENTS_SHIFT	4
+#define DISCARDED_EVENTS_MASK	0xf
+
+#define INES_N_PORTS		3
+#define INES_REGISTER_SIZE	0x80
+#define INES_PORT_OFFSET	0x20
+#define INES_PORT_SIZE		0x20
+#define INES_FIFO_DEPTH		90
+#define INES_MAX_EVENTS		100
+
+#define BC_PTP_V1		0
+#define BC_PTP_V2		1
+#define TC_E2E_PTP_V2		2
+#define TC_P2P_PTP_V2		3
+
+#define OFF_PTP_CLOCK_ID	20
+#define OFF_PTP_PORT_NUM	28
+
+#define PHY_SPEED_10		0
+#define PHY_SPEED_100		1
+#define PHY_SPEED_1000		2
+
+#define PORT_CONF \
+	((PHY_SPEED_1000 << PHY_SPEED_SHIFT) | (BC_PTP_V2 << PTP_MODE_SHIFT))
+
+#define ines_read32(s, r)	__raw_readl(&s->regs->r)
+#define ines_write32(s, v, r)	__raw_writel(v, &s->regs->r)
+
+#define MESSAGE_TYPE_SYNC		1
+#define MESSAGE_TYPE_P_DELAY_REQ	2
+#define MESSAGE_TYPE_P_DELAY_RESP	3
+#define MESSAGE_TYPE_DELAY_REQ		4
+
+#define SYNC				0x0
+#define DELAY_REQ			0x1
+#define PDELAY_REQ			0x2
+#define PDELAY_RESP			0x3
+
+static LIST_HEAD(ines_clocks);
+static DEFINE_MUTEX(ines_clocks_lock);
+
+struct ines_global_registers {
+	u32 id;
+	u32 test;
+	u32 global;
+	u32 version;
+	u32 test2;
+	u32 int_stat;
+	u32 int_msk;
+	u32 buf_stat;
+};
+
+struct ines_port_registers {
+	u32 port_conf;
+	u32 p_delay;
+	u32 ts_stat_tx;
+	u32 ts_stat_rx;
+	u32 ts_tx;
+	u32 ts_rx;
+};
+
+struct ines_timestamp {
+	struct list_head list;
+	unsigned long	tmo;
+	u16		tag;
+	u64		sec;
+	u64		nsec;
+	u64		clkid;
+	u16		portnum;
+	u16		seqid;
+};
+
+struct ines_port {
+	struct ines_port_registers	*regs;
+	struct ines_clock		*clock;
+	bool				rxts_enabled;
+	bool				txts_enabled;
+	unsigned int			index;
+	struct delayed_work		ts_work;
+	/* lock protects event list and tx_skb */
+	spinlock_t			lock;
+	struct sk_buff			*tx_skb;
+	struct list_head		events;
+	struct list_head		pool;
+	struct ines_timestamp		pool_data[INES_MAX_EVENTS];
+};
+
+struct ines_clock {
+	struct ines_port		port[INES_N_PORTS];
+	struct ines_global_registers	*regs;
+	void __iomem			*base;
+	struct device_node		*node;
+	struct list_head		list;
+};
+
+static bool ines_match(struct sk_buff *skb, unsigned int ptp_class,
+		       struct ines_timestamp *ts);
+static int ines_rxfifo_read(struct ines_port *port);
+static u64 ines_rxts64(struct ines_port *port, unsigned int words);
+static bool ines_timestamp_expired(struct ines_timestamp *ts);
+static u64 ines_txts64(struct ines_port *port, unsigned int words);
+static void ines_txtstamp_work(struct work_struct *work);
+static bool is_sync_pdelay_resp(struct sk_buff *skb, int type);
+static u8 tag_to_msgtype(u8 tag);
+
+static void ines_clock_cleanup(struct ines_clock *clock)
+{
+	struct ines_port *port;
+	int i;
+
+	for (i = 0; i < INES_N_PORTS; i++) {
+		port = &clock->port[i];
+		cancel_delayed_work_sync(&port->ts_work);
+	}
+}
+
+static int ines_clock_init(struct ines_clock *clock, struct device_node *node,
+			   void __iomem *addr)
+{
+	unsigned long port_addr;
+	struct ines_port *port;
+	int i, j;
+
+	INIT_LIST_HEAD(&clock->list);
+	clock->node = node;
+	clock->base = addr;
+	clock->regs = clock->base;
+
+	for (i = 0; i < INES_N_PORTS; i++) {
+		port = &clock->port[i];
+		port_addr = (unsigned long) clock->base +
+			INES_PORT_OFFSET + i * INES_PORT_SIZE;
+		port->regs = (struct ines_port_registers *) port_addr;
+		port->clock = clock;
+		port->index = i;
+		INIT_DELAYED_WORK(&port->ts_work, ines_txtstamp_work);
+		spin_lock_init(&port->lock);
+		INIT_LIST_HEAD(&port->events);
+		INIT_LIST_HEAD(&port->pool);
+		for (j = 0; j < INES_MAX_EVENTS; j++)
+			list_add(&port->pool_data[j].list, &port->pool);
+	}
+
+	ines_write32(clock, 0xBEEF, test);
+	ines_write32(clock, 0xBEEF, test2);
+
+	pr_debug("ID      0x%x\n", ines_read32(clock, id));
+	pr_debug("TEST    0x%x\n", ines_read32(clock, test));
+	pr_debug("VERSION 0x%x\n", ines_read32(clock, version));
+	pr_debug("TEST2   0x%x\n", ines_read32(clock, test2));
+
+	for (i = 0; i < INES_N_PORTS; i++) {
+		port = &clock->port[i];
+		ines_write32(port, PORT_CONF, port_conf);
+	}
+
+	return 0;
+}
+
+static void ines_dump_ts(char *label, struct ines_timestamp *ts)
+{
+#ifdef DEBUG
+	pr_err("%s timestamp, tag=0x%04hx t=%llu.%9llu c=0x%llx p=%hu s=%hu\n",
+	       label, ts->tag, ts->sec, ts->nsec,
+	       ts->clkid, ts->portnum, ts->seqid);
+#endif
+}
+
+static struct ines_port *ines_find_port(struct device_node *node, u32 index)
+{
+	struct ines_port *port = NULL;
+	struct ines_clock *clock;
+	struct list_head *this;
+
+	if (index > INES_N_PORTS - 1)
+		return NULL;
+
+	mutex_lock(&ines_clocks_lock);
+	list_for_each(this, &ines_clocks) {
+		clock = list_entry(this, struct ines_clock, list);
+		if (clock->node == node) {
+			port = &clock->port[index];
+			break;
+		}
+	}
+	mutex_unlock(&ines_clocks_lock);
+	return port;
+}
+
+static u64 ines_find_rxts(struct ines_port *port, struct sk_buff *skb, int type)
+{
+	struct list_head *this, *next;
+	struct ines_timestamp *ts;
+	unsigned long flags;
+	u64 ns = 0;
+
+	if (type == PTP_CLASS_NONE)
+		return 0;
+
+	spin_lock_irqsave(&port->lock, flags);
+	ines_rxfifo_read(port);
+	list_for_each_safe(this, next, &port->events) {
+		ts = list_entry(this, struct ines_timestamp, list);
+		if (ines_timestamp_expired(ts)) {
+			list_del_init(&ts->list);
+			list_add(&ts->list, &port->pool);
+			continue;
+		}
+		if (ines_match(skb, type, ts)) {
+			ns = ts->sec * 1000000000ULL + ts->nsec;
+			list_del_init(&ts->list);
+			list_add(&ts->list, &port->pool);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	return ns;
+}
+
+static u64 ines_find_txts(struct ines_port *port, struct sk_buff *skb)
+{
+	unsigned int class = ptp_classify_raw(skb), i;
+	u32 data_rd_pos, buf_stat, mask, ts_stat_tx;
+	struct ines_timestamp ts;
+	unsigned long flags;
+	u64 ns = 0;
+
+	mask = TX_FIFO_NE_1 << port->index;
+
+	spin_lock_irqsave(&port->lock, flags);
+
+	for (i = 0; i < INES_FIFO_DEPTH; i++) {
+
+		buf_stat = ines_read32(port->clock, buf_stat);
+		if (!(buf_stat & mask)) {
+			pr_debug("Tx timestamp FIFO unexpectedly empty\n");
+			break;
+		}
+		ts_stat_tx = ines_read32(port, ts_stat_tx);
+		data_rd_pos = (ts_stat_tx >> DATA_READ_POS_SHIFT) &
+			DATA_READ_POS_MASK;
+		if (data_rd_pos) {
+			pr_err("unexpected Tx read pos %u\n", data_rd_pos);
+			break;
+		}
+
+		ts.tag     = ines_read32(port, ts_tx);
+		ts.sec     = ines_txts64(port, 3);
+		ts.nsec    = ines_txts64(port, 2);
+		ts.clkid   = ines_txts64(port, 4);
+		ts.portnum = ines_read32(port, ts_tx);
+		ts.seqid   = ines_read32(port, ts_tx);
+
+		ines_dump_ts("Tx", &ts);
+
+		if (ines_match(skb, class, &ts)) {
+			ns = ts.sec * 1000000000ULL + ts.nsec;
+			break;
+		}
+	}
+
+	spin_unlock_irqrestore(&port->lock, flags);
+	return ns;
+}
+
+static int ines_hwtstamp(struct mdio_device *m, struct ifreq *ifr)
+{
+	u32 cm_one_step = 0, port_conf, ts_stat_rx, ts_stat_tx;
+	struct ines_port *port = dev_get_drvdata(&m->dev);
+	struct hwtstamp_config cfg;
+	unsigned long flags;
+
+	if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
+		return -EFAULT;
+
+	/* reserved for future extensions */
+	if (cfg.flags)
+		return -EINVAL;
+
+	switch (cfg.tx_type) {
+	case HWTSTAMP_TX_OFF:
+		ts_stat_tx = 0;
+		break;
+	case HWTSTAMP_TX_ON:
+		ts_stat_tx = TS_ENABLE;
+		break;
+	case HWTSTAMP_TX_ONESTEP_P2P:
+		ts_stat_tx = TS_ENABLE;
+		cm_one_step = CM_ONE_STEP;
+		break;
+	default:
+		return -ERANGE;
+	}
+
+	switch (cfg.rx_filter) {
+	case HWTSTAMP_FILTER_NONE:
+		ts_stat_rx = 0;
+		break;
+	case HWTSTAMP_FILTER_ALL:
+	case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+	case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
+	case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
+		return -ERANGE;
+	case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+	case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+	case HWTSTAMP_FILTER_PTP_V2_EVENT:
+	case HWTSTAMP_FILTER_PTP_V2_SYNC:
+	case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+		ts_stat_rx = TS_ENABLE;
+		cfg.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
+		break;
+	default:
+		return -ERANGE;
+	}
+
+	spin_lock_irqsave(&port->lock, flags);
+
+	port_conf = ines_read32(port, port_conf);
+	port_conf &= ~CM_ONE_STEP;
+	port_conf |= cm_one_step;
+
+	ines_write32(port, port_conf, port_conf);
+	ines_write32(port, ts_stat_rx, ts_stat_rx);
+	ines_write32(port, ts_stat_tx, ts_stat_tx);
+
+	port->rxts_enabled = ts_stat_rx == TS_ENABLE ? true : false;
+	port->txts_enabled = ts_stat_tx == TS_ENABLE ? true : false;
+
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
+}
+
+static bool ines_match(struct sk_buff *skb, unsigned int ptp_class,
+		       struct ines_timestamp *ts)
+{
+	u8 *msgtype, *data = skb_mac_header(skb);
+	unsigned int offset = 0;
+	u16 *portn, *seqid;
+	u64 *clkid;
+
+	if (unlikely(ptp_class & PTP_CLASS_V1))
+		return false;
+
+	if (ptp_class & PTP_CLASS_VLAN)
+		offset += VLAN_HLEN;
+
+	switch (ptp_class & PTP_CLASS_PMASK) {
+	case PTP_CLASS_IPV4:
+		offset += ETH_HLEN + IPV4_HLEN(data + offset) + UDP_HLEN;
+		break;
+	case PTP_CLASS_IPV6:
+		offset += ETH_HLEN + IP6_HLEN + UDP_HLEN;
+		break;
+	case PTP_CLASS_L2:
+		offset += ETH_HLEN;
+		break;
+	default:
+		return false;
+	}
+
+	if (skb->len + ETH_HLEN < offset + OFF_PTP_SEQUENCE_ID + sizeof(*seqid))
+		return false;
+
+	msgtype = data + offset;
+	clkid = (u64 *)(data + offset + OFF_PTP_CLOCK_ID);
+	portn = (u16 *)(data + offset + OFF_PTP_PORT_NUM);
+	seqid = (u16 *)(data + offset + OFF_PTP_SEQUENCE_ID);
+
+	if (tag_to_msgtype(ts->tag & 0x7) != (*msgtype & 0xf)) {
+		pr_debug("msgtype mismatch ts %hhu != skb %hhu\n",
+			 tag_to_msgtype(ts->tag & 0x7), *msgtype & 0xf);
+		return false;
+	}
+	if (cpu_to_be64(ts->clkid) != *clkid) {
+		pr_debug("clkid mismatch ts %llx != skb %llx\n",
+			 cpu_to_be64(ts->clkid), *clkid);
+		return false;
+	}
+	if (ts->portnum != ntohs(*portn)) {
+		pr_debug("portn mismatch ts %hu != skb %hu\n",
+			 ts->portnum, ntohs(*portn));
+		return false;
+	}
+	if (ts->seqid != ntohs(*seqid)) {
+		pr_debug("seqid mismatch ts %hu != skb %hu\n",
+			 ts->seqid, ntohs(*seqid));
+		return false;
+	}
+
+	return true;
+}
+
+static bool ines_rxtstamp(struct mdio_device *m, struct sk_buff *skb, int type)
+{
+	struct ines_port *port = dev_get_drvdata(&m->dev);
+	struct skb_shared_hwtstamps *ssh;
+	u64 ns;
+
+	if (!port->rxts_enabled)
+		return false;
+
+	ns = ines_find_rxts(port, skb, type);
+	if (!ns)
+		return false;
+
+	ssh = skb_hwtstamps(skb);
+	ssh->hwtstamp = ns_to_ktime(ns);
+	netif_rx(skb);
+
+	return true;
+}
+
+static int ines_rxfifo_read(struct ines_port *port)
+{
+	u32 data_rd_pos, buf_stat, mask, ts_stat_rx;
+	struct ines_timestamp *ts;
+	unsigned int i;
+
+	mask = RX_FIFO_NE_1 << port->index;
+
+	for (i = 0; i < INES_FIFO_DEPTH; i++) {
+		if (list_empty(&port->pool)) {
+			pr_err("event pool is empty\n");
+			return -1;
+		}
+		buf_stat = ines_read32(port->clock, buf_stat);
+		if (!(buf_stat & mask))
+			break;
+
+		ts_stat_rx = ines_read32(port, ts_stat_rx);
+		data_rd_pos = (ts_stat_rx >> DATA_READ_POS_SHIFT) &
+			DATA_READ_POS_MASK;
+		if (data_rd_pos) {
+			pr_err("unexpected Rx read pos %u\n", data_rd_pos);
+			break;
+		}
+
+		ts = list_first_entry(&port->pool, struct ines_timestamp, list);
+		ts->tmo     = jiffies + HZ;
+		ts->tag     = ines_read32(port, ts_rx);
+		ts->sec     = ines_rxts64(port, 3);
+		ts->nsec    = ines_rxts64(port, 2);
+		ts->clkid   = ines_rxts64(port, 4);
+		ts->portnum = ines_read32(port, ts_rx);
+		ts->seqid   = ines_read32(port, ts_rx);
+
+		ines_dump_ts("Rx", ts);
+
+		list_del_init(&ts->list);
+		list_add_tail(&ts->list, &port->events);
+	}
+
+	return 0;
+}
+
+static u64 ines_rxts64(struct ines_port *port, unsigned int words)
+{
+	unsigned int i;
+	u64 result;
+	u16 word;
+
+	word = ines_read32(port, ts_rx);
+	result = word;
+	words--;
+	for (i = 0; i < words; i++) {
+		word = ines_read32(port, ts_rx);
+		result <<= 16;
+		result |= word;
+	}
+	return result;
+}
+
+static bool ines_timestamp_expired(struct ines_timestamp *ts)
+{
+	return time_after(jiffies, ts->tmo);
+}
+
+static int ines_ts_info(struct mdio_device *m, struct ethtool_ts_info *info)
+{
+	info->so_timestamping =
+		SOF_TIMESTAMPING_TX_HARDWARE |
+		SOF_TIMESTAMPING_TX_SOFTWARE |
+		SOF_TIMESTAMPING_RX_HARDWARE |
+		SOF_TIMESTAMPING_RX_SOFTWARE |
+		SOF_TIMESTAMPING_SOFTWARE |
+		SOF_TIMESTAMPING_RAW_HARDWARE;
+
+	info->phc_index = -1;
+
+	info->tx_types =
+		(1 << HWTSTAMP_TX_OFF) |
+		(1 << HWTSTAMP_TX_ON) |
+		(1 << HWTSTAMP_TX_ONESTEP_P2P);
+
+	info->rx_filters =
+		(1 << HWTSTAMP_FILTER_NONE) |
+		(1 << HWTSTAMP_FILTER_PTP_V2_EVENT);
+
+	return 0;
+}
+
+static u64 ines_txts64(struct ines_port *port, unsigned int words)
+{
+	unsigned int i;
+	u64 result;
+	u16 word;
+
+	word = ines_read32(port, ts_tx);
+	result = word;
+	words--;
+	for (i = 0; i < words; i++) {
+		word = ines_read32(port, ts_tx);
+		result <<= 16;
+		result |= word;
+	}
+	return result;
+}
+
+static bool ines_txts_onestep(struct ines_port *port, struct sk_buff *skb, int type)
+{
+	unsigned long flags;
+	u32 port_conf;
+
+	spin_lock_irqsave(&port->lock, flags);
+	port_conf = ines_read32(port, port_conf);
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	if (port_conf & CM_ONE_STEP)
+		return is_sync_pdelay_resp(skb, type);
+
+	return false;
+}
+
+static void ines_txtstamp(struct mdio_device *m, struct sk_buff *skb, int type)
+{
+	struct ines_port *port = dev_get_drvdata(&m->dev);
+	struct sk_buff *old_skb;
+	unsigned long flags;
+
+	if (!port->txts_enabled || ines_txts_onestep(port, skb, type)) {
+		kfree_skb(skb);
+		return;
+	}
+
+	spin_lock_irqsave(&port->lock, flags);
+
+	if (port->tx_skb)
+		old_skb = port->tx_skb;
+
+	port->tx_skb = skb;
+
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	if (old_skb)
+		kfree_skb(old_skb);
+
+	schedule_delayed_work(&port->ts_work, 1);
+}
+
+static void ines_txtstamp_work(struct work_struct *work)
+{
+	struct ines_port *port =
+		container_of(work, struct ines_port, ts_work.work);
+	struct skb_shared_hwtstamps ssh;
+	struct sk_buff *skb;
+	unsigned long flags;
+	u64 ns;
+
+	spin_lock_irqsave(&port->lock, flags);
+	skb = port->tx_skb;
+	port->tx_skb = NULL;
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	ns = ines_find_txts(port, skb);
+	if (!ns) {
+		kfree_skb(skb);
+		return;
+	}
+	ssh.hwtstamp = ns_to_ktime(ns);
+	skb_complete_tx_timestamp(skb, &ssh);
+}
+
+static bool is_sync_pdelay_resp(struct sk_buff *skb, int type)
+{
+	u8 *data = skb->data, *msgtype;
+	unsigned int offset = 0;
+
+	if (type & PTP_CLASS_VLAN)
+		offset += VLAN_HLEN;
+
+	switch (type & PTP_CLASS_PMASK) {
+	case PTP_CLASS_IPV4:
+		offset += ETH_HLEN + IPV4_HLEN(data + offset) + UDP_HLEN;
+		break;
+	case PTP_CLASS_IPV6:
+		offset += ETH_HLEN + IP6_HLEN + UDP_HLEN;
+		break;
+	case PTP_CLASS_L2:
+		offset += ETH_HLEN;
+		break;
+	default:
+		return 0;
+	}
+
+	if (type & PTP_CLASS_V1)
+		offset += OFF_PTP_CONTROL;
+
+	if (skb->len < offset + 1)
+		return 0;
+
+	msgtype = data + offset;
+
+	switch ((*msgtype & 0xf)) {
+	case SYNC:
+	case PDELAY_RESP:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static u8 tag_to_msgtype(u8 tag)
+{
+	switch (tag) {
+	case MESSAGE_TYPE_SYNC:
+		return SYNC;
+	case MESSAGE_TYPE_P_DELAY_REQ:
+		return PDELAY_REQ;
+	case MESSAGE_TYPE_P_DELAY_RESP:
+		return PDELAY_RESP;
+	case MESSAGE_TYPE_DELAY_REQ:
+		return DELAY_REQ;
+	}
+	return 0xf;
+}
+
+static int ines_ptp_ctrl_probe(struct platform_device *pld)
+{
+	struct ines_clock *clock;
+	struct resource *res;
+	void __iomem *addr;
+	int err = 0;
+
+	res = platform_get_resource(pld, IORESOURCE_MEM, 0);
+	if (!res) {
+		dev_err(&pld->dev, "missing memory resource\n");
+		return -EINVAL;
+	}
+	addr = devm_ioremap_resource(&pld->dev, res);
+	if (IS_ERR(addr)) {
+		err = PTR_ERR(addr);
+		goto out;
+	}
+	clock = kzalloc(sizeof(*clock), GFP_KERNEL);
+	if (!clock) {
+		err = -ENOMEM;
+		goto out;
+	}
+	if (ines_clock_init(clock, pld->dev.of_node, addr)) {
+		kfree(clock);
+		err = -ENOMEM;
+		goto out;
+	}
+	mutex_lock(&ines_clocks_lock);
+	list_add_tail(&ines_clocks, &clock->list);
+	mutex_unlock(&ines_clocks_lock);
+
+	dev_set_drvdata(&pld->dev, clock);
+out:
+	return err;
+}
+
+static int ines_ptp_ctrl_remove(struct platform_device *pld)
+{
+	struct ines_clock *clock = dev_get_drvdata(&pld->dev);
+
+	mutex_lock(&ines_clocks_lock);
+	list_del(&clock->list);
+	mutex_unlock(&ines_clocks_lock);
+	ines_clock_cleanup(clock);
+	kfree(clock);
+	return 0;
+}
+
+static int ines_ptp_port_probe(struct mdio_device *mdiodev)
+{
+	struct device_node *node;
+	struct ines_port *port;
+	int err = 0;
+	u32 index;
+
+	if (of_property_read_u32(mdiodev->dev.of_node, "port-index", &index)) {
+		dev_err(&mdiodev->dev, "missing port-index\n");
+		return -EINVAL;
+	}
+	node = of_parse_phandle(mdiodev->dev.of_node, "ctrl-handle", 0);
+	if (IS_ERR(node)) {
+		dev_err(&mdiodev->dev, "missing ctrl-handle\n");
+		return PTR_ERR(node);
+	}
+	port = ines_find_port(node, index);
+	if (!port) {
+		dev_err(&mdiodev->dev, "missing port\n");
+		err = -ENODEV;
+		goto out;
+	}
+	mdiodev->ts_info  = ines_ts_info;
+	mdiodev->hwtstamp = ines_hwtstamp;
+	mdiodev->rxtstamp = ines_rxtstamp;
+	mdiodev->txtstamp = ines_txtstamp;
+	dev_set_drvdata(&mdiodev->dev, port);
+out:
+	of_node_put(node);
+	return err;
+}
+
+static void ines_ptp_port_remove(struct mdio_device *mdiodev)
+{
+}
+
+static const struct of_device_id ines_ptp_ctrl_of_match[] = {
+	{ .compatible = "ines,ptp-ctrl" },
+	{ }
+};
+
+MODULE_DEVICE_TABLE(of, ines_ptp_ctrl_of_match);
+
+static struct platform_driver ines_ptp_ctrl_driver = {
+	.probe  = ines_ptp_ctrl_probe,
+	.remove = ines_ptp_ctrl_remove,
+	.driver = {
+		.name = "ines_ptp_ctrl",
+		.of_match_table = of_match_ptr(ines_ptp_ctrl_of_match),
+	},
+};
+
+static const struct of_device_id ines_ptp_port_of_match[] = {
+	{ .compatible = "ines,ptp-port" },
+	{ }
+};
+
+static struct mdio_driver ines_ptp_port_driver = {
+	.probe	= ines_ptp_port_probe,
+	.remove = ines_ptp_port_remove,
+	.mdiodrv.driver = {
+		.name = "ines_ptp_port",
+		.of_match_table = ines_ptp_port_of_match,
+	},
+};
+
+static int __init ines_ptp_init(void)
+{
+	int err;
+
+	err = platform_driver_register(&ines_ptp_ctrl_driver);
+	if (err)
+		return err;
+
+	err = mdio_driver_register(&ines_ptp_port_driver);
+	if (err)
+		platform_driver_unregister(&ines_ptp_ctrl_driver);
+
+	return err;
+}
+
+static void __exit ines_ptp_cleanup(void)
+{
+	mdio_driver_unregister(&ines_ptp_port_driver);
+	platform_driver_unregister(&ines_ptp_ctrl_driver);
+}
+
+module_init(ines_ptp_init);
+module_exit(ines_ptp_cleanup);
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index a21ad10d613c..10ef161dd2a1 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -88,6 +88,16 @@ config DP83640_PHY
 	  In order for this to work, your MAC driver must also
 	  implement the skb_tx_timestamp() function.
 
+config INES_PTP_TSTAMP
+	tristate "ZHAW InES PTP time stamping IP core"
+	depends on NETWORK_PHY_TIMESTAMPING
+	depends on PHYLIB
+	depends on PTP_1588_CLOCK
+	---help---
+	  This driver adds support for using the ZHAW InES 1588 IP
+	  core.  This clock is only useful if the MII bus of your MAC
+	  is wired up to the core.
+
 config PTP_1588_CLOCK_PCH
 	tristate "Intel PCH EG20T as PTP clock"
 	depends on X86_32 || COMPILE_TEST
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v4 00/17] netdev: Eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-21 19:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, timur, sulrich, linux-arm-kernel, linux-arm-msm
In-Reply-To: <20180321.115655.108053425798020503.davem@davemloft.net>

On 3/21/2018 10:56 AM, David Miller wrote:
> From: Sinan Kaya <okaya@codeaurora.org>
> Date: Mon, 19 Mar 2018 22:42:15 -0400
> 
>> Code includes wmb() followed by writel() in multiple places. writel()
>> already has a barrier on some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> I did a regex search for wmb() followed by writel() in each drivers
>> directory.
>> I scrubbed the ones I care about in this series.
>>
>> I considered "ease of change", "popular usage" and "performance critical
>> path" as the determining criteria for my filtering.
> 
> I agree that for performance sensitive operations, specifically writing
> doorbell registers in the hot paths or RX and TX packet processing, this
> is a good change.
> 
> However, in configuration paths and whatnot, it is much less urgent and
> useful.
> 
> Therefore I think it would work better if you concentrated solely on
> hot code path cases.
> 
> You can, on a driver by driver basis, submit the other transformations
> in the slow paths, and let the driver maintainers decide whether to
> take those on or not.
> 
> Also, please stick exactly to the case where we have:
> 
> 	wmb/mb/etc.
> 	writel()
> 

OK

> Because I see some changes where we have:
> 
> 	writel()
> 
> 	barrier()
> 
> 	writel()
> 

barrier() on ARM is a write barrier. Apparently, it is a compiler barrier
on Intel. I briefly discussed the barrier() behavior in rdma mailing list [1].

Our conclusion is that code should have used wmb() if it really needed
to synchronize memory contents to the device and barrier() is already
wrong. It just guarantees that code doesn't move. writel() already has
a compiler barrier inside. It won't move to begin with.

Like you suggested, we decided to leave these changes alone and even
skip those drivers.

I'll take another look at the patches.

> for exmaple, and you are turning that second writel() into a relaxed
> on as well.  The above is using a compile barrier, not a memory
> barrier, so effectively it is two writel()'s in sequence which is
> not what this patch set is about.
> 
> If anything, that compile barrier() is superfluous and could be
> removed.  But that is also a separate change from what this patch
> series is doing here.
> 

agreed, I'll remove such changes.

> Finally, it makes it that much easier if we can see the preceeding
> memory barrier in the context of the patch that adjusts the writel
> into a writel_relaxed.
> 
> In one case, a macro DOORBELL() is changed to use writel().  This
> makes it so that the patch reviewer has to scan over the entire
> driver in question to see exactly how DOORBELL() is used and whether
> it fits the criteria for the writel_relaxed() transformation.
> 
> I would suggest that you adjust the name of the macro in a situation
> like this, f.e. to DOORBELL_RELAXED().

makes sense.

> 
> Thank you.
> 

[1] https://patchwork.kernel.org/project/LKML/list/?submitter=145491

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: [PATCH net-next RFC V1 2/5] net: phy: Move time stamping interface into the generic mdio layer.
From: Florian Fainelli @ 2018-03-21 19:10 UTC (permalink / raw)
  To: Richard Cochran, netdev
  Cc: devicetree, Andrew Lunn, David Miller, Mark Rutland,
	Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <b032dcdb34684e3831530da1bb198c764e61968b.1521656774.git.richardcochran@gmail.com>

On 03/21/2018 11:58 AM, Richard Cochran wrote:
> There are different ways of obtaining hardware time stamps on network
> packets.  The ingress and egress times can be measured in the MAC, in
> the PHY, or by a device listening on the MII bus.  Up until now, the
> kernel has support for MAC and PHY time stamping, but not for other
> MII bus devices.
> 
> This patch moves the PHY time stamping interface into the generic
> mdio device in order to support MII time stamping hardware.
> 
> Signed-off-by: Richard Cochran <richardcochran@gmail.com>
> ---
>  drivers/net/phy/dp83640.c | 29 ++++++++++++++++++++---------
>  drivers/net/phy/phy.c     |  4 ++--
>  include/linux/mdio.h      | 23 +++++++++++++++++++++++
>  include/linux/phy.h       | 23 -----------------------
>  net/core/ethtool.c        |  4 ++--
>  net/core/timestamping.c   |  8 ++++----
>  6 files changed, 51 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 654f42d00092..79aeb5eb471a 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -215,6 +215,10 @@ static LIST_HEAD(phyter_clocks);
>  static DEFINE_MUTEX(phyter_clocks_lock);
>  
>  static void rx_timestamp_work(struct work_struct *work);
> +static  int dp83640_ts_info(struct mdio_device *m, struct ethtool_ts_info *i);
> +static  int dp83640_hwtstamp(struct mdio_device *m, struct ifreq *i);
> +static bool dp83640_rxtstamp(struct mdio_device *m, struct sk_buff *s, int t);
> +static void dp83640_txtstamp(struct mdio_device *m, struct sk_buff *s, int t);
>  
>  /* extended register access functions */
>  
> @@ -1162,6 +1166,12 @@ static int dp83640_probe(struct phy_device *phydev)
>  		list_add_tail(&dp83640->list, &clock->phylist);
>  
>  	dp83640_clock_put(clock);
> +
> +	phydev->mdio.ts_info = dp83640_ts_info;
> +	phydev->mdio.hwtstamp = dp83640_hwtstamp;
> +	phydev->mdio.rxtstamp = dp83640_rxtstamp;
> +	phydev->mdio.txtstamp = dp83640_txtstamp;

Why is this implemented a the mdio_device level and not at the
mdio_driver level? This looks like the wrong level at which this is done.
--
Florian

^ permalink raw reply

* Re: [PATCH net-next RFC V1 3/5] net: Introduce field for the MII time stamper.
From: Florian Fainelli @ 2018-03-21 19:12 UTC (permalink / raw)
  To: Richard Cochran, netdev
  Cc: devicetree, Andrew Lunn, David Miller, Mark Rutland,
	Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <338f19a006839f1d00e3cfd3521bdd5fd0afc5fe.1521656774.git.richardcochran@gmail.com>

On 03/21/2018 11:58 AM, Richard Cochran wrote:
> This patch adds a new field to the network device structure to reference
> a time stamping device on the MII bus.  By decoupling the time stamping
> function from the PHY device, we pave the way to allowing a non-PHY
> device to take this role.
> 
> Signed-off-by: Richard Cochran <richardcochran@gmail.com>
> ---
>  drivers/net/phy/mdio_bus.c | 51 +++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/netdevice.h  |  1 +
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 24b5511222c8..fdac8c8ac272 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -717,6 +717,47 @@ static int mdio_uevent(struct device *dev, struct kobj_uevent_env *env)
>  	return 0;
>  }
>  
> +static bool mdiodev_supports_timestamping(struct mdio_device *mdiodev)
> +{
> +	if (mdiodev->ts_info  && mdiodev->hwtstamp &&
> +	    mdiodev->rxtstamp && mdiodev->txtstamp)
> +		return true;
> +	else
> +		return false;
> +}
> +
> +static int mdiobus_netdev_notification(struct notifier_block *nb,
> +				       unsigned long msg, void *ptr)
> +{
> +	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
> +	struct phy_device *phydev = netdev->phydev;
> +	struct mdio_device *mdev;
> +	struct mii_bus *bus;
> +	int i;
> +
> +	if (netdev->mdiots || msg != NETDEV_UP || !phydev)
> +		return NOTIFY_DONE;

You are still assuming that we have a phy_device somehow, whereas you
parch series wants to solve that for generic MDIO devices, that is a bit
confusing.

> +
> +	/*
> +	 * Examine the MII bus associated with the PHY that is
> +	 * attached to the MAC.  If there is a time stamping device
> +	 * on the bus, then connect it to the network device.
> +	 */
> +	bus = phydev->mdio.bus;
> +
> +	for (i = 0; i < PHY_MAX_ADDR; i++) {
> +		mdev = bus->mdio_map[i];
> +		if (!mdev)
> +			continue;
> +		if (mdiodev_supports_timestamping(mdev)) {
> +			netdev->mdiots = mdev;
> +			return NOTIFY_OK;

What guarantees that netdev->mdiots gets cleared? Also, why is this done
with a notifier instead of through phy_{connect,attach,disconnect}? It
looks like we still have this requirement of the mdio TS device being a
phy_device somehow, I am confused here...

> +		}
> +	}
> +
> +	return NOTIFY_DONE;
> +}
> +
>  #ifdef CONFIG_PM
>  static int mdio_bus_suspend(struct device *dev)
>  {

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5fbb9f1da7fd..223d691aa0b0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1943,6 +1943,7 @@ struct net_device {
>  	struct netprio_map __rcu *priomap;
>  #endif
>  	struct phy_device	*phydev;
> +	struct mdio_device	*mdiots;

phy_device embedds a mdio_device, can you find a way to rework the PHY
PTP code to utilize the phy_device's mdio instance so do not introduce
yet another pointer in that big structure that net_device already is?

>  	struct lock_class_key	*qdisc_tx_busylock;
>  	struct lock_class_key	*qdisc_running_key;
>  	bool			proto_down;
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
From: Andrew Lunn @ 2018-03-21 19:33 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, devicetree, David Miller, Florian Fainelli, Mark Rutland,
	Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <e9def53dd0f751d7f5e3df47e604d7ccb6020159.1521656774.git.richardcochran@gmail.com>

On Wed, Mar 21, 2018 at 11:58:18AM -0700, Richard Cochran wrote:
> The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
> logic recognizes and time stamps PTP frames on the MII bus.  This
> patch adds a driver for the core along with a device tree binding to
> allow hooking the driver to MAC devices.

Hi Richard

Can you point us at some documentation for this.

I think Florian and I want to better understand how this device works,
in order to understand your other changes.

   Andrew

^ permalink raw reply

* Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Florian Fainelli @ 2018-03-21 19:33 UTC (permalink / raw)
  To: Tal Gilboa, David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521657225-65392-1-git-send-email-talgi@mellanox.com>

Hi Tal,

On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.

Thanks a lot for providing this documentation, this is very helpful! A
few things that could be good to be expanded upon a little bit:

- HW must support configuring a timeout per channel
- HW must support configuring a number of packets before getting an
interrupt per channel

Does that sound about right?

[snip]

> +In order to use Net DIM from a networking driver, the driver needs to call the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt.

I would make it a bit clearer that this is on each invocation of the
interrupt service routine function. With NAPI + net DIM working in
concert you would not actually get one packet per interrupt
consistently, it would largely depend on the rate, right?

> Since Net DIM has a built-in moderation and it might decide to skip
> +iterations under certain conditions, there is no need to moderate the net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use it
> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
> +bytes, packets and interrupts count. No need to perform any calculations, just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on
> +the driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This callback
> +will be scheduled and run in a separate thread in order not to add overhead to
> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=================
> +
> +The following code demonstrates how to register a driver to Net DIM. The actual
> +usage is not complete but it should make the outline of the usage clear.

It could be worth to touch a word or two about reflecting the use of Net
DIM within the driver into ethtool_coalesce::use_adaptive_rx_coalesce
and ethtool_coalesce::use_adaptive_tx_coalesce?

> +
> +my_driver.c:
> +
> +#include <linux/net_dim.h>
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> +	/* Get struct net_dim from struct work_struct */
> +	struct net_dim *dim = container_of(work, struct net_dim,
> +					   work);
> +	/* Do interrupt moderation related stuff */
> +	...
> +
> +	/* Signal net DIM work is done and it should move to next iteration */
> +	dim->state = NET_DIM_START_MEASURE;
> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* A struct to hold current measured data */
> +	struct net_dim_sample dim_sample;
> +	...
> +	/* Initiate data sample struct with current data */
> +	net_dim_sample(my_entity->events,
> +		       my_entity->packets,
> +		       my_entity->bytes,
> +		       &dim_sample);
> +	/* Call net DIM */
> +	net_dim(&my_entity->dim, dim_sample);
> +	...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* Initiate struct work_struct with my driver's callback function */
> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
> +	...
> +}
> 


-- 
Florian

^ permalink raw reply

* [PATCH][next] gre: fix TUNNEL_SEQ bit check on sequence numbering
From: Colin King @ 2018-03-21 19:34 UTC (permalink / raw)
  To: David S . Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The current logic of flags | TUNNEL_SEQ is always non-zero and hence
sequence numbers are always incremented no matter the setting of the
TUNNEL_SEQ bit.  Fix this by using & instead of |.

Detected by CoverityScan, CID#1466039 ("Operands don't affect result")

Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/ipv4/ip_gre.c  | 2 +-
 net/ipv6/ip6_gre.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2fa2ef2e2af9..9ab1aa2f7660 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -550,7 +550,7 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		(TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
 	gre_build_header(skb, tunnel_hlen, flags, proto,
 			 tunnel_id_to_key32(tun_info->key.tun_id),
-			 (flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
+			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 
 	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 0bcefc480aeb..3a98c694da5f 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -725,7 +725,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 		gre_build_header(skb, tunnel->tun_hlen,
 				 flags, protocol,
 				 tunnel_id_to_key32(tun_info->key.tun_id),
-				 (flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
+				 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
 						      : 0);
 
 	} else {
-- 
2.15.1

^ permalink raw reply related

* Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Randy Dunlap @ 2018-03-21 19:37 UTC (permalink / raw)
  To: Tal Gilboa, David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <1521657225-65392-1-git-send-email-talgi@mellanox.com>

On 03/21/2018 11:33 AM, Tal Gilboa wrote:
> Net DIM is a generic algorithm, purposed for dynamically
> optimizing network devices interrupt moderation. This
> document describes how it works and how to use it.
> 
> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
> ---
>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
>  1 file changed, 174 insertions(+)
>  create mode 100644 Documentation/networking/net_dim.txt
> 
> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
> new file mode 100644
> index 0000000..9cb31c5
> --- /dev/null
> +++ b/Documentation/networking/net_dim.txt
> @@ -0,0 +1,174 @@
> +Net DIM - Generic Network Dynamic Interrupt Moderation
> +======================================================
> +
> +Author:
> +	Tal Gilboa <talgi@mellanox.com>
> +
> +
> +Contents
> +=========
> +
> +- Assumptions
> +- Introduction
> +- The Net DIM Algorithm
> +- Registering a Network Device to DIM
> +- Example
> +
> +Part 0: Assumptions
> +======================
> +
> +This document assumes the reader has basic knowledge in network drivers
> +and in general interrupt moderation.
> +
> +
> +Part I: Introduction
> +======================
> +
> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
> +interrupt moderation configuration of a channel in order to optimize packet
> +processing. The mechanism includes an algorithm which decides if and how to
> +change moderation parameters for a channel, usually by performing an analysis on
> +runtime data sampled from the system. Net DIM is such a mechanism. In each
> +iteration of the algorithm, it analyses a given sample of the data, compares it
> +to the previous sample and if required, it can decide to change some of the
> +interrupt moderation configuration fields. The data sample is composed of data
> +bandwidth, the number of packets and the number of events. The time between
> +samples is also measured. Net DIM compares the current and the previous data and
> +returns an adjusted interrupt moderation configuration object. In some cases,
> +the algorithm might decide not to change anything. The configuration fields are
> +the minimum duration (microseconds) allowed between events and the maximum
> +number of wanted packets per event. The Net DIM algorithm ascribes importance to
> +increase bandwidth over reducing interrupt rate.
> +
> +
> +Part II: The Net DIM Algorithm
> +===============================
> +
> +Each iteration of the Net DIM algorithm follows these steps:
> +1. Calculates new data sample.
> +2. Compares it to previous sample.
> +3. Makes a decision - suggests interrupt moderation configuration fields.
> +4. Applies a schedule work function, which applies suggested configuration.
> +
> +The first two steps are straightforward, both the new and the previous data are
> +supplied by the driver registered to Net DIM. The previous data is the new data
> +supplied to the previous iteration. The comparison step checks the difference
> +between the new and previous data and decides on the result of the last step.
> +A step would result as "better" if bandwidth increases and as "worse" if
> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
> +compared in a similar fashion - increase == "better" and decrease == "worse".
> +In case there is no change in the packet rate as well, the interrupt rate is
> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
> +increase in the interrupt rate is considered "worse" and a decrease is
> +considered "better". Step #2 has an optimization for avoiding false results: it
> +only considers a difference between samples as valid if it is greater than a
> +certain percentage. Also, since Net DIM does not measure anything by itself, it
> +assumes the data provided by the driver is valid.
> +
> +Step #3 decides on the suggested configuration based on the result from step #2
> +and the internal state of the algorithm. The states reflect the "direction" of
> +the algorithm: is it going left (reducing moderation), right (increasing
> +moderation) or standing still. Another optimization is that if a decision
> +to stay still is made multiple times, the interval between iterations of the
> +algorithm would increase in order to reduce calculation overhead. Also, after
> +"parking" on one of the most left or most right decisions, the algorithm may
> +decide to verify this decision by taking a step in the other direction. This is
> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
> +decision is made, an interrupt moderation configuration is selected from
> +the predefined profiles.

I think a short description of the predefined profiles could help.

> +
> +The last step is to notify the registered driver that it should apply the
> +suggested configuration. This is done by scheduling a work function, defined by
> +the Net DIM API and provided by the registered driver.
> +
> +As you can see, Net DIM itself does not actively interact with the system. It
> +would have trouble making the correct decisions if the wrong data is supplied to
> +it and it would be useless if the work function would not apply the suggested
> +configuration. This does, however, allow the registered driver some room for
> +manoeuvre as it may provide partial data or ignore the algorithm suggestion
> +under some conditions.
> +
> +
> +Part III: Registering a Network Device to DIM
> +==============================================
> +
> +Net DIM API exposes the main function net_dim(struct net_dim *dim,
> +struct net_dim_sample end_sample). This function is the entry point to the Net
> +DIM algorithm and has to be called every time the driver would like to check if
> +it should change interrupt moderation parameters. The driver should provide two

Is it completely up to the driver to decide when to call net_dim()?
So it could be based on TX traffic, RX traffic, time, queue depths, etc.?

> +data structures: struct net_dim and struct net_dim_sample. Struct net_dim
> +describes the state of DIM for a specific object (RX queue, TX queue,
> +other queues, etc.). This includes the current selected profile, previous data
> +samples, the callback function provided by the driver and more.
> +Struct net_dim_sample describes a data sample, which will be compared to the
> +data sample stored in struct net_dim in order to decide on the algorithm's next
> +step. The sample should include bytes, packets and interrupts, measured by
> +the driver.
> +
> +In order to use Net DIM from a networking driver, the driver needs to call the
> +main net_dim() function. The recommended method is to call net_dim() on each
> +interrupt. Since Net DIM has a built-in moderation and it might decide to skip

(continuing my question from above:)
or on each interrupt.  But the hardware could also be doing interrupt mitigation,
so each interrupt doesn't always correlate to anything specific.

> +iterations under certain conditions, there is no need to moderate the net_dim()
> +calls as well. As mentioned above, the driver needs to provide an object of type
> +struct net_dim to the net_dim() function call. It is advised for each entity
> +using Net DIM to hold a struct net_dim as part of its data structure and use it
> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
> +bytes, packets and interrupts count. No need to perform any calculations, just
> +include the raw data.
> +
> +The net_dim() call itself does not return anything. Instead Net DIM relies on
> +the driver to provide a callback function, which is called when the algorithm
> +decides to make a change in the interrupt moderation parameters. This callback
> +will be scheduled and run in a separate thread in order not to add overhead to
> +the data flow. After the work is done, Net DIM algorithm needs to be set to
> +the proper state in order to move to the next iteration.
> +
> +
> +Part IV: Example
> +=================
> +
> +The following code demonstrates how to register a driver to Net DIM. The actual
> +usage is not complete but it should make the outline of the usage clear.
> +
> +my_driver.c:
> +
> +#include <linux/net_dim.h>
> +
> +/* Callback for net DIM to schedule on a decision to change moderation */
> +void my_driver_do_dim_work(struct work_struct *work)
> +{
> +	/* Get struct net_dim from struct work_struct */
> +	struct net_dim *dim = container_of(work, struct net_dim,
> +					   work);
> +	/* Do interrupt moderation related stuff */
> +	...
> +
> +	/* Signal net DIM work is done and it should move to next iteration */
> +	dim->state = NET_DIM_START_MEASURE;
> +}
> +
> +/* My driver's interrupt handler */
> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* A struct to hold current measured data */
> +	struct net_dim_sample dim_sample;
> +	...
> +	/* Initiate data sample struct with current data */
> +	net_dim_sample(my_entity->events,
> +		       my_entity->packets,
> +		       my_entity->bytes,
> +		       &dim_sample);
> +	/* Call net DIM */
> +	net_dim(&my_entity->dim, dim_sample);
> +	...
> +}
> +
> +/* My entity's initialization function (my_entity was already allocated) */
> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
> +{
> +	...
> +	/* Initiate struct work_struct with my driver's callback function */
> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
> +	...
> +}
> 

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

thanks,
-- 
~Randy

^ permalink raw reply

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: Jakob Unterwurzacher @ 2018-03-21 19:44 UTC (permalink / raw)
  To: John Fastabend, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <4e33aae4-9e87-22b4-7f09-008183ea553a@gmail.com>

On 21.03.18 19:43, John Fastabend wrote:
> Thats my theory at least. Are you able to test a patch if I generate
> one to fix this?

Yes, no problem.

I just tested with the flag change you suggested (see below, I had to 
keep TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.

Thanks,
Jakob


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f21b20..51b68ef4977b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
         .dump           =       pfifo_fast_dump,
         .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
         .owner          =       THIS_MODULE,
-       .static_flags   =       TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
+       .static_flags   =       TCQ_F_CPUSTATS,
  };
  EXPORT_SYMBOL(pfifo_fast_ops);

^ permalink raw reply related

* Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation
From: Florian Fainelli @ 2018-03-21 19:44 UTC (permalink / raw)
  To: Randy Dunlap, Tal Gilboa, David S. Miller
  Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <c5dcab9e-0a76-ab59-e683-367927ab6d6a@infradead.org>

On 03/21/2018 12:37 PM, Randy Dunlap wrote:
> On 03/21/2018 11:33 AM, Tal Gilboa wrote:
>> Net DIM is a generic algorithm, purposed for dynamically
>> optimizing network devices interrupt moderation. This
>> document describes how it works and how to use it.
>>
>> Signed-off-by: Tal Gilboa <talgi@mellanox.com>
>> ---
>>  Documentation/networking/net_dim.txt | 174 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 174 insertions(+)
>>  create mode 100644 Documentation/networking/net_dim.txt
>>
>> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
>> new file mode 100644
>> index 0000000..9cb31c5
>> --- /dev/null
>> +++ b/Documentation/networking/net_dim.txt
>> @@ -0,0 +1,174 @@
>> +Net DIM - Generic Network Dynamic Interrupt Moderation
>> +======================================================
>> +
>> +Author:
>> +	Tal Gilboa <talgi@mellanox.com>
>> +
>> +
>> +Contents
>> +=========
>> +
>> +- Assumptions
>> +- Introduction
>> +- The Net DIM Algorithm
>> +- Registering a Network Device to DIM
>> +- Example
>> +
>> +Part 0: Assumptions
>> +======================
>> +
>> +This document assumes the reader has basic knowledge in network drivers
>> +and in general interrupt moderation.
>> +
>> +
>> +Part I: Introduction
>> +======================
>> +
>> +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
>> +interrupt moderation configuration of a channel in order to optimize packet
>> +processing. The mechanism includes an algorithm which decides if and how to
>> +change moderation parameters for a channel, usually by performing an analysis on
>> +runtime data sampled from the system. Net DIM is such a mechanism. In each
>> +iteration of the algorithm, it analyses a given sample of the data, compares it
>> +to the previous sample and if required, it can decide to change some of the
>> +interrupt moderation configuration fields. The data sample is composed of data
>> +bandwidth, the number of packets and the number of events. The time between
>> +samples is also measured. Net DIM compares the current and the previous data and
>> +returns an adjusted interrupt moderation configuration object. In some cases,
>> +the algorithm might decide not to change anything. The configuration fields are
>> +the minimum duration (microseconds) allowed between events and the maximum
>> +number of wanted packets per event. The Net DIM algorithm ascribes importance to
>> +increase bandwidth over reducing interrupt rate.
>> +
>> +
>> +Part II: The Net DIM Algorithm
>> +===============================
>> +
>> +Each iteration of the Net DIM algorithm follows these steps:
>> +1. Calculates new data sample.
>> +2. Compares it to previous sample.
>> +3. Makes a decision - suggests interrupt moderation configuration fields.
>> +4. Applies a schedule work function, which applies suggested configuration.
>> +
>> +The first two steps are straightforward, both the new and the previous data are
>> +supplied by the driver registered to Net DIM. The previous data is the new data
>> +supplied to the previous iteration. The comparison step checks the difference
>> +between the new and previous data and decides on the result of the last step.
>> +A step would result as "better" if bandwidth increases and as "worse" if
>> +bandwidth reduces. If there is no change in bandwidth, the packet rate is
>> +compared in a similar fashion - increase == "better" and decrease == "worse".
>> +In case there is no change in the packet rate as well, the interrupt rate is
>> +compared. Here the algorithm tries to optimize for lower interrupt rate so an
>> +increase in the interrupt rate is considered "worse" and a decrease is
>> +considered "better". Step #2 has an optimization for avoiding false results: it
>> +only considers a difference between samples as valid if it is greater than a
>> +certain percentage. Also, since Net DIM does not measure anything by itself, it
>> +assumes the data provided by the driver is valid.
>> +
>> +Step #3 decides on the suggested configuration based on the result from step #2
>> +and the internal state of the algorithm. The states reflect the "direction" of
>> +the algorithm: is it going left (reducing moderation), right (increasing
>> +moderation) or standing still. Another optimization is that if a decision
>> +to stay still is made multiple times, the interval between iterations of the
>> +algorithm would increase in order to reduce calculation overhead. Also, after
>> +"parking" on one of the most left or most right decisions, the algorithm may
>> +decide to verify this decision by taking a step in the other direction. This is
>> +done in order to avoid getting stuck in a "deep sleep" scenario. Once a
>> +decision is made, an interrupt moderation configuration is selected from
>> +the predefined profiles.
> 
> I think a short description of the predefined profiles could help.

Agreed it would help if the different modes
(NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE,
NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE) were expanded a bit further. The
whole term QE sounds very much Ethernet converged adapter to me...

> 
>> +
>> +The last step is to notify the registered driver that it should apply the
>> +suggested configuration. This is done by scheduling a work function, defined by
>> +the Net DIM API and provided by the registered driver.
>> +
>> +As you can see, Net DIM itself does not actively interact with the system. It
>> +would have trouble making the correct decisions if the wrong data is supplied to
>> +it and it would be useless if the work function would not apply the suggested
>> +configuration. This does, however, allow the registered driver some room for
>> +manoeuvre as it may provide partial data or ignore the algorithm suggestion
>> +under some conditions.
>> +
>> +
>> +Part III: Registering a Network Device to DIM
>> +==============================================
>> +
>> +Net DIM API exposes the main function net_dim(struct net_dim *dim,
>> +struct net_dim_sample end_sample). This function is the entry point to the Net
>> +DIM algorithm and has to be called every time the driver would like to check if
>> +it should change interrupt moderation parameters. The driver should provide two
> 
> Is it completely up to the driver to decide when to call net_dim()?
> So it could be based on TX traffic, RX traffic, time, queue depths, etc.?
> 
>> +data structures: struct net_dim and struct net_dim_sample. Struct net_dim
>> +describes the state of DIM for a specific object (RX queue, TX queue,
>> +other queues, etc.). This includes the current selected profile, previous data
>> +samples, the callback function provided by the driver and more.
>> +Struct net_dim_sample describes a data sample, which will be compared to the
>> +data sample stored in struct net_dim in order to decide on the algorithm's next
>> +step. The sample should include bytes, packets and interrupts, measured by
>> +the driver.
>> +
>> +In order to use Net DIM from a networking driver, the driver needs to call the
>> +main net_dim() function. The recommended method is to call net_dim() on each
>> +interrupt. Since Net DIM has a built-in moderation and it might decide to skip
> 
> (continuing my question from above:)
> or on each interrupt.  But the hardware could also be doing interrupt mitigation,
> so each interrupt doesn't always correlate to anything specific.
> 
>> +iterations under certain conditions, there is no need to moderate the net_dim()
>> +calls as well. As mentioned above, the driver needs to provide an object of type
>> +struct net_dim to the net_dim() function call. It is advised for each entity
>> +using Net DIM to hold a struct net_dim as part of its data structure and use it
>> +as the main Net DIM API object. The struct net_dim_sample should hold the latest
>> +bytes, packets and interrupts count. No need to perform any calculations, just
>> +include the raw data.
>> +
>> +The net_dim() call itself does not return anything. Instead Net DIM relies on
>> +the driver to provide a callback function, which is called when the algorithm
>> +decides to make a change in the interrupt moderation parameters. This callback
>> +will be scheduled and run in a separate thread in order not to add overhead to
>> +the data flow. After the work is done, Net DIM algorithm needs to be set to
>> +the proper state in order to move to the next iteration.
>> +
>> +
>> +Part IV: Example
>> +=================
>> +
>> +The following code demonstrates how to register a driver to Net DIM. The actual
>> +usage is not complete but it should make the outline of the usage clear.
>> +
>> +my_driver.c:
>> +
>> +#include <linux/net_dim.h>
>> +
>> +/* Callback for net DIM to schedule on a decision to change moderation */
>> +void my_driver_do_dim_work(struct work_struct *work)
>> +{
>> +	/* Get struct net_dim from struct work_struct */
>> +	struct net_dim *dim = container_of(work, struct net_dim,
>> +					   work);
>> +	/* Do interrupt moderation related stuff */
>> +	...
>> +
>> +	/* Signal net DIM work is done and it should move to next iteration */
>> +	dim->state = NET_DIM_START_MEASURE;
>> +}
>> +
>> +/* My driver's interrupt handler */
>> +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
>> +{
>> +	...
>> +	/* A struct to hold current measured data */
>> +	struct net_dim_sample dim_sample;
>> +	...
>> +	/* Initiate data sample struct with current data */
>> +	net_dim_sample(my_entity->events,
>> +		       my_entity->packets,
>> +		       my_entity->bytes,
>> +		       &dim_sample);
>> +	/* Call net DIM */
>> +	net_dim(&my_entity->dim, dim_sample);
>> +	...
>> +}
>> +
>> +/* My entity's initialization function (my_entity was already allocated) */
>> +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
>> +{
>> +	...
>> +	/* Initiate struct work_struct with my driver's callback function */
>> +	INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
>> +	...
>> +}
>>
> 
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> 
> thanks,
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH v2 bpf-next 4/8] tracepoint: compute num_args at build time
From: Linus Torvalds @ 2018-03-21 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Daniel Borkmann, Peter Zijlstra, Steven Rostedt,
	Network Development, kernel-team, Linux API
In-Reply-To: <20180321185448.2806324-5-ast@fb.com>

On Wed, Mar 21, 2018 at 11:54 AM, Alexei Starovoitov <ast@fb.com> wrote:
>
> add fancy macro to compute number of arguments passed into tracepoint
> at compile time and store it as part of 'struct tracepoint'.

We should probably do this __COUNT() thing in some generic header, we
just talked last week about another use case entirely.

And wouldn't it be nice to just have some generic infrastructure like this:

    /*
     * This counts to ten.
     *
     * Any more than that, and we'd need to take off our shoes
     */
    #define __GET_COUNT(_0,_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_n,...) _n
    #define __COUNT(...) \
        __GET_COUNT(__VA_ARGS__,10,9,8,7,6,5,4,3,2,1,0)
    #define COUNT(...) __COUNT(dummy,##__VA_ARGS__)

    #define __CONCAT(a,b) a##b
    #define __CONCATENATE(a,b) __CONCAT(a,b)

and then you can do things like:

    #define fn(...) __CONCATENATE(fn,COUNT(__VA_ARGS__))(__VA_ARGS__)

which turns "fn(x,y,z..)" into "fn<N>(x,y,z)".

That can be useful for things like "max(a,b,c,d)" expanding to
"max4()", and then you can just have the trivial

  #define max3(a,b,c) max2(a,max2(b.c))

etc (with proper parentheses, of course).

And I'd rather not have that function name concatenation be part of
the counting logic, because we actually may have different ways of
counting, so the concatenation is separate.

In particular, the other situation this came up for, the counting was
in arguments _pairs_, so you'd use a "COUNT_PAIR()" instead of
"COUNT()".

NOTE NOTE NOTE! The above was slightly tested and then cut-and-pasted.
I might have screwed up at any point. Think of it as pseudo-code.

         Linus

^ permalink raw reply

* RE: [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local variables
From: Yan Markman @ 2018-03-21 19:57 UTC (permalink / raw)
  To: Maxime Chevallier, davem@davemloft.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Antoine Tenart, thomas.petazzoni@bootlin.com,
	gregory.clement@bootlin.com, miquel.raynal@bootlin.com,
	Nadav Haklai, Stefan Chulski, mw@semihalf.com
In-Reply-To: <20180321151400.6658-1-maxime.chevallier@bootlin.com>

Hi Maxime

Please check the TWO points:

1). The mvpp2_prs_flow_find() returns TID if found
    The TID=0 is valid FOUND value
    For Not-found use -ENOENT (just like your mvpp2_prs_vlan_find)

2). The original code always uses "mvpp2_prs_entry *pe" storage Zero-Allocated
   Please check the correctnes of new "mvpp2_prs_entry pe" without 
        memset(pe, 0, sizeof(pe));
   in all procedures where pe=kzalloc() has been replaced

Thanks
Yan Markman
Tel. 05-44732819


-----Original Message-----
From: Maxime Chevallier [mailto:maxime.chevallier@bootlin.com] 
Sent: Wednesday, March 21, 2018 5:14 PM
To: davem@davemloft.net
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Antoine Tenart <antoine.tenart@bootlin.com>; thomas.petazzoni@bootlin.com; gregory.clement@bootlin.com; miquel.raynal@bootlin.com; Nadav Haklai <nadavh@marvell.com>; Stefan Chulski <stefanc@marvell.com>; Yan Markman <ymarkman@marvell.com>; mw@semihalf.com
Subject: [PATCH net-next v2] net: mvpp2: Don't use dynamic allocs for local variables

Some helper functions that search for given entries in the TCAM filter on PPv2 controller make use of dynamically alloced temporary variables, allocated with GFP_KERNEL. These functions can be called in atomic context, and dynamic alloc is not really needed in these cases anyways.

This commit gets rid of dynamic allocs and use stack allocation in the following functions, and where they're used :
 - mvpp2_prs_flow_find
 - mvpp2_prs_vlan_find
 - mvpp2_prs_double_vlan_find
 - mvpp2_prs_mac_da_range_find

For all these functions, instead of returning an temporary object representing the TCAM entry, we simply return the TCAM id that matches the requested entry.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
V2: Remove unnecessary brackets, following Antoine Tenart's review.

 drivers/net/ethernet/marvell/mvpp2.c | 289 +++++++++++++++--------------------
 1 file changed, 127 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 9bd35f2291d6..28e33e139178 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -1913,16 +1913,11 @@ static void mvpp2_prs_sram_offset_set(struct mvpp2_prs_entry *pe,  }
 
 /* Find parser flow entry */
-static struct mvpp2_prs_entry *mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
+static int mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_FLOWS);
-
 	/* Go through the all entires with MVPP2_PRS_LU_FLOWS */
 	for (tid = MVPP2_PRS_TCAM_SRAM_SIZE - 1; tid >= 0; tid--) {
 		u8 bits;
@@ -1931,17 +1926,16 @@ static struct mvpp2_prs_entry *mvpp2_prs_flow_find(struct mvpp2 *priv, int flow)
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_FLOWS)
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
-		bits = mvpp2_prs_sram_ai_get(pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
+		bits = mvpp2_prs_sram_ai_get(&pe);
 
 		/* Sram store classification lookup ID in AI bits [5:0] */
 		if ((bits & MVPP2_PRS_FLOW_ID_MASK) == flow)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Return first free tcam index, seeking from start to end */ @@ -2189,16 +2183,12 @@ static void mvpp2_prs_dsa_tag_ethertype_set(struct mvpp2 *priv, int port,  }
 
 /* Search for existing single/triple vlan entry */ -static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
-						   unsigned short tpid, int ai)
+static int mvpp2_prs_vlan_find(struct mvpp2 *priv, unsigned short tpid, 
+int ai)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
+	memset(&pe, 0, sizeof(pe));
 
 	/* Go through the all entries with MVPP2_PRS_LU_VLAN */
 	for (tid = MVPP2_PE_FIRST_FREE_TID;
@@ -2210,19 +2200,19 @@ static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_VLAN)
 			continue;
 
-		pe->index = tid;
+		pe.index = tid;
 
-		mvpp2_prs_hw_read(priv, pe);
-		match = mvpp2_prs_tcam_data_cmp(pe, 0, swab16(tpid));
+		mvpp2_prs_hw_read(priv, &pe);
+		match = mvpp2_prs_tcam_data_cmp(&pe, 0, swab16(tpid));
 		if (!match)
 			continue;
 
 		/* Get vlan type */
-		ri_bits = mvpp2_prs_sram_ri_get(pe);
+		ri_bits = mvpp2_prs_sram_ri_get(&pe);
 		ri_bits &= MVPP2_PRS_RI_VLAN_MASK;
 
 		/* Get current ai value from tcam */
-		ai_bits = mvpp2_prs_tcam_ai_get(pe);
+		ai_bits = mvpp2_prs_tcam_ai_get(&pe);
 		/* Clear double vlan bit */
 		ai_bits &= ~MVPP2_PRS_DBL_VLAN_AI_BIT;
 
@@ -2231,33 +2221,30 @@ static struct mvpp2_prs_entry *mvpp2_prs_vlan_find(struct mvpp2 *priv,
 
 		if (ri_bits == MVPP2_PRS_RI_VLAN_SINGLE ||
 		    ri_bits == MVPP2_PRS_RI_VLAN_TRIPLE)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Add/update single/triple vlan entry */  static int mvpp2_prs_vlan_add(struct mvpp2 *priv, unsigned short tpid, int ai,
 			      unsigned int port_map)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid_aux, tid;
 	int ret = 0;
 
-	pe = mvpp2_prs_vlan_find(priv, tpid, ai);
+	tid = mvpp2_prs_vlan_find(priv, tpid, ai);
 
-	if (!pe) {
+	if (tid < 0) {
 		/* Create new tcam entry */
 		tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_LAST_FREE_TID,
 						MVPP2_PE_FIRST_FREE_TID);
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
+		memset(&pe, 0, sizeof(pe));
 
 		/* Get last double vlan tid */
 		for (tid_aux = MVPP2_PE_LAST_FREE_TID; @@ -2268,49 +2255,47 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, unsigned short tpid, int ai,
 			    priv->prs_shadow[tid_aux].lu != MVPP2_PRS_LU_VLAN)
 				continue;
 
-			pe->index = tid_aux;
-			mvpp2_prs_hw_read(priv, pe);
-			ri_bits = mvpp2_prs_sram_ri_get(pe);
+			pe.index = tid_aux;
+			mvpp2_prs_hw_read(priv, &pe);
+			ri_bits = mvpp2_prs_sram_ri_get(&pe);
 			if ((ri_bits & MVPP2_PRS_RI_VLAN_MASK) ==
 			    MVPP2_PRS_RI_VLAN_DOUBLE)
 				break;
 		}
 
-		if (tid <= tid_aux) {
-			ret = -EINVAL;
-			goto free_pe;
-		}
+		if (tid <= tid_aux)
+			return -EINVAL;
 
-		memset(pe, 0, sizeof(*pe));
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+		mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
+		pe.index = tid;
 
-		mvpp2_prs_match_etype(pe, 0, tpid);
+		mvpp2_prs_match_etype(&pe, 0, tpid);
 
 		/* VLAN tag detected, proceed with VID filtering */
-		mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_VID);
+		mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_VID);
 
 		/* Clear all ai bits for next iteration */
-		mvpp2_prs_sram_ai_update(pe, 0, MVPP2_PRS_SRAM_AI_MASK);
+		mvpp2_prs_sram_ai_update(&pe, 0, MVPP2_PRS_SRAM_AI_MASK);
 
 		if (ai == MVPP2_PRS_SINGLE_VLAN_AI) {
-			mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_SINGLE,
+			mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_SINGLE,
 						 MVPP2_PRS_RI_VLAN_MASK);
 		} else {
 			ai |= MVPP2_PRS_DBL_VLAN_AI_BIT;
-			mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_TRIPLE,
+			mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_TRIPLE,
 						 MVPP2_PRS_RI_VLAN_MASK);
 		}
-		mvpp2_prs_tcam_ai_update(pe, ai, MVPP2_PRS_SRAM_AI_MASK);
+		mvpp2_prs_tcam_ai_update(&pe, ai, MVPP2_PRS_SRAM_AI_MASK);
 
-		mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_VLAN);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 	/* Update ports' mask */
-	mvpp2_prs_tcam_port_map_set(pe, port_map);
+	mvpp2_prs_tcam_port_map_set(&pe, port_map);
 
-	mvpp2_prs_hw_write(priv, pe);
-free_pe:
-	kfree(pe);
+	mvpp2_prs_hw_write(priv, &pe);
 
 	return ret;
 }
@@ -2329,17 +2314,14 @@ static int mvpp2_prs_double_vlan_ai_free_get(struct mvpp2 *priv)  }
 
 /* Search for existing double vlan entry */ -static struct mvpp2_prs_entry *mvpp2_prs_double_vlan_find(struct mvpp2 *priv,
-							  unsigned short tpid1,
-							  unsigned short tpid2)
+static int mvpp2_prs_double_vlan_find(struct mvpp2 *priv, unsigned short tpid1,
+				      unsigned short tpid2)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
+	memset(&pe, 0, sizeof(pe));
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
 
 	/* Go through the all entries with MVPP2_PRS_LU_VLAN */
 	for (tid = MVPP2_PE_FIRST_FREE_TID;
@@ -2351,22 +2333,21 @@ static struct mvpp2_prs_entry *mvpp2_prs_double_vlan_find(struct mvpp2 *priv,
 		    priv->prs_shadow[tid].lu != MVPP2_PRS_LU_VLAN)
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
 
-		match = mvpp2_prs_tcam_data_cmp(pe, 0, swab16(tpid1))
-			&& mvpp2_prs_tcam_data_cmp(pe, 4, swab16(tpid2));
+		match = mvpp2_prs_tcam_data_cmp(&pe, 0, swab16(tpid1)) &&
+			mvpp2_prs_tcam_data_cmp(&pe, 4, swab16(tpid2));
 
 		if (!match)
 			continue;
 
-		ri_mask = mvpp2_prs_sram_ri_get(pe) & MVPP2_PRS_RI_VLAN_MASK;
+		ri_mask = mvpp2_prs_sram_ri_get(&pe) & MVPP2_PRS_RI_VLAN_MASK;
 		if (ri_mask == MVPP2_PRS_RI_VLAN_DOUBLE)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Add or update double vlan entry */
@@ -2374,28 +2355,24 @@ static int mvpp2_prs_double_vlan_add(struct mvpp2 *priv, unsigned short tpid1,
 				     unsigned short tpid2,
 				     unsigned int port_map)
 {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid_aux, tid, ai, ret = 0;
 
-	pe = mvpp2_prs_double_vlan_find(priv, tpid1, tpid2);
+	tid = mvpp2_prs_double_vlan_find(priv, tpid1, tpid2);
 
-	if (!pe) {
+	if (tid < 0) {
 		/* Create new tcam entry */
 		tid = mvpp2_prs_tcam_first_free(priv, MVPP2_PE_FIRST_FREE_TID,
 				MVPP2_PE_LAST_FREE_TID);
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
+		memset(&pe, 0, sizeof(pe));
 
 		/* Set ai value for new double vlan entry */
 		ai = mvpp2_prs_double_vlan_ai_free_get(priv);
-		if (ai < 0) {
-			ret = ai;
-			goto free_pe;
-		}
+		if (ai < 0)
+			return ai;
 
 		/* Get first single/triple vlan tid */
 		for (tid_aux = MVPP2_PE_FIRST_FREE_TID; @@ -2406,46 +2383,45 @@ static int mvpp2_prs_double_vlan_add(struct mvpp2 *priv, unsigned short tpid1,
 			    priv->prs_shadow[tid_aux].lu != MVPP2_PRS_LU_VLAN)
 				continue;
 
-			pe->index = tid_aux;
-			mvpp2_prs_hw_read(priv, pe);
-			ri_bits = mvpp2_prs_sram_ri_get(pe);
+			pe.index = tid_aux;
+			mvpp2_prs_hw_read(priv, &pe);
+			ri_bits = mvpp2_prs_sram_ri_get(&pe);
 			ri_bits &= MVPP2_PRS_RI_VLAN_MASK;
 			if (ri_bits == MVPP2_PRS_RI_VLAN_SINGLE ||
 			    ri_bits == MVPP2_PRS_RI_VLAN_TRIPLE)
 				break;
 		}
 
-		if (tid >= tid_aux) {
-			ret = -ERANGE;
-			goto free_pe;
-		}
+		if (tid >= tid_aux)
+			return -ERANGE;
 
-		memset(pe, 0, sizeof(*pe));
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_VLAN);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+		mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_VLAN);
+		pe.index = tid;
 
 		priv->prs_double_vlans[ai] = true;
 
-		mvpp2_prs_match_etype(pe, 0, tpid1);
-		mvpp2_prs_match_etype(pe, 4, tpid2);
+		mvpp2_prs_match_etype(&pe, 0, tpid1);
+		mvpp2_prs_match_etype(&pe, 4, tpid2);
 
-		mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_VLAN);
 		/* Shift 4 bytes - skip outer vlan tag */
-		mvpp2_prs_sram_shift_set(pe, MVPP2_VLAN_TAG_LEN,
+		mvpp2_prs_sram_shift_set(&pe, MVPP2_VLAN_TAG_LEN,
 					 MVPP2_PRS_SRAM_OP_SEL_SHIFT_ADD);
-		mvpp2_prs_sram_ri_update(pe, MVPP2_PRS_RI_VLAN_DOUBLE,
+		mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_VLAN_DOUBLE,
 					 MVPP2_PRS_RI_VLAN_MASK);
-		mvpp2_prs_sram_ai_update(pe, ai | MVPP2_PRS_DBL_VLAN_AI_BIT,
+		mvpp2_prs_sram_ai_update(&pe, ai | MVPP2_PRS_DBL_VLAN_AI_BIT,
 					 MVPP2_PRS_SRAM_AI_MASK);
 
-		mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_VLAN);
+		mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_VLAN);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 
 	/* Update ports' mask */
-	mvpp2_prs_tcam_port_map_set(pe, port_map);
-	mvpp2_prs_hw_write(priv, pe);
-free_pe:
-	kfree(pe);
+	mvpp2_prs_tcam_port_map_set(&pe, port_map);
+	mvpp2_prs_hw_write(priv, &pe);
+
 	return ret;
 }
 
@@ -3528,7 +3504,7 @@ static int mvpp2_prs_vid_range_find(struct mvpp2 *priv, int pmap, u16 vid,
 		return tid;
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 /* Write parser entry for VID filtering */ @@ -3551,7 +3527,7 @@ static int mvpp2_prs_vid_entry_add(struct mvpp2_port *port, u16 vid)
 		shift = MVPP2_VLAN_TAG_LEN;
 
 	/* No such entry */
-	if (!tid) {
+	if (tid < 0) {
 		memset(&pe, 0, sizeof(pe));
 
 		/* Go through all entries from first to last in vlan range */ @@ -3604,7 +3580,7 @@ static void mvpp2_prs_vid_entry_remove(struct mvpp2_port *port, u16 vid)
 	tid = mvpp2_prs_vid_range_find(priv, (1 << port->id), vid, 0xfff);
 
 	/* No such entry */
-	if (tid)
+	if (tid < 0)
 		return;
 
 	mvpp2_prs_hw_inv(priv, tid);
@@ -3771,18 +3747,13 @@ static bool mvpp2_prs_mac_range_equals(struct mvpp2_prs_entry *pe,  }
 
 /* Find tcam entry with matched pair <MAC DA, port> */ -static struct mvpp2_prs_entry *
+static int
 mvpp2_prs_mac_da_range_find(struct mvpp2 *priv, int pmap, const u8 *da,
 			    unsigned char *mask, int udf_type)  {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = kzalloc(sizeof(*pe), GFP_ATOMIC);
-	if (!pe)
-		return NULL;
-	mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_MAC);
-
 	/* Go through the all entires with MVPP2_PRS_LU_MAC */
 	for (tid = MVPP2_PE_MAC_RANGE_START;
 	     tid <= MVPP2_PE_MAC_RANGE_END; tid++) { @@ -3793,17 +3764,16 @@ mvpp2_prs_mac_da_range_find(struct mvpp2 *priv, int pmap, const u8 *da,
 		    (priv->prs_shadow[tid].udf != udf_type))
 			continue;
 
-		pe->index = tid;
-		mvpp2_prs_hw_read(priv, pe);
-		entry_pmap = mvpp2_prs_tcam_port_map_get(pe);
+		pe.index = tid;
+		mvpp2_prs_hw_read(priv, &pe);
+		entry_pmap = mvpp2_prs_tcam_port_map_get(&pe);
 
-		if (mvpp2_prs_mac_range_equals(pe, da, mask) &&
+		if (mvpp2_prs_mac_range_equals(&pe, da, mask) &&
 		    entry_pmap == pmap)
-			return pe;
+			return tid;
 	}
-	kfree(pe);
 
-	return NULL;
+	return -ENOENT;
 }
 
 /* Update parser's mac da entry */
@@ -3813,15 +3783,15 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 	unsigned char mask[ETH_ALEN] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
 	struct mvpp2 *priv = port->priv;
 	unsigned int pmap, len, ri;
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
 	/* Scan TCAM and see if entry with this <MAC DA, port> already exist */
-	pe = mvpp2_prs_mac_da_range_find(priv, BIT(port->id), da, mask,
-					 MVPP2_PRS_UDF_MAC_DEF);
+	tid = mvpp2_prs_mac_da_range_find(priv, BIT(port->id), da, mask,
+					  MVPP2_PRS_UDF_MAC_DEF);
 
 	/* No such entry */
-	if (!pe) {
+	if (tid < 0) {
 		if (!add)
 			return 0;
 
@@ -3833,39 +3803,39 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_ATOMIC);
-		if (!pe)
-			return -ENOMEM;
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_MAC);
-		pe->index = tid;
+		memset(&pe, 0, sizeof(pe));
+
+		pe.index = tid;
 
 		/* Mask all ports */
-		mvpp2_prs_tcam_port_map_set(pe, 0);
+		mvpp2_prs_tcam_port_map_set(&pe, 0);
+	} else {
+		mvpp2_prs_hw_read(priv, &pe);
 	}
 
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_MAC);
+
 	/* Update port mask */
-	mvpp2_prs_tcam_port_set(pe, port->id, add);
+	mvpp2_prs_tcam_port_set(&pe, port->id, add);
 
 	/* Invalidate the entry if no ports are left enabled */
-	pmap = mvpp2_prs_tcam_port_map_get(pe);
+	pmap = mvpp2_prs_tcam_port_map_get(&pe);
 	if (pmap == 0) {
-		if (add) {
-			kfree(pe);
+		if (add)
 			return -EINVAL;
-		}
-		mvpp2_prs_hw_inv(priv, pe->index);
-		priv->prs_shadow[pe->index].valid = false;
-		kfree(pe);
+
+		mvpp2_prs_hw_inv(priv, pe.index);
+		priv->prs_shadow[pe.index].valid = false;
 		return 0;
 	}
 
 	/* Continue - set next lookup */
-	mvpp2_prs_sram_next_lu_set(pe, MVPP2_PRS_LU_DSA);
+	mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_DSA);
 
 	/* Set match on DA */
 	len = ETH_ALEN;
 	while (len--)
-		mvpp2_prs_tcam_data_byte_set(pe, len, da[len], 0xff);
+		mvpp2_prs_tcam_data_byte_set(&pe, len, da[len], 0xff);
 
 	/* Set result info bits */
 	if (is_broadcast_ether_addr(da)) {
@@ -3879,21 +3849,19 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
 			ri |= MVPP2_PRS_RI_MAC_ME_MASK;
 	}
 
-	mvpp2_prs_sram_ri_update(pe, ri, MVPP2_PRS_RI_L2_CAST_MASK |
+	mvpp2_prs_sram_ri_update(&pe, ri, MVPP2_PRS_RI_L2_CAST_MASK |
 				 MVPP2_PRS_RI_MAC_ME_MASK);
-	mvpp2_prs_shadow_ri_set(priv, pe->index, ri, MVPP2_PRS_RI_L2_CAST_MASK |
+	mvpp2_prs_shadow_ri_set(priv, pe.index, ri, MVPP2_PRS_RI_L2_CAST_MASK 
+|
 				MVPP2_PRS_RI_MAC_ME_MASK);
 
 	/* Shift to ethertype */
-	mvpp2_prs_sram_shift_set(pe, 2 * ETH_ALEN,
+	mvpp2_prs_sram_shift_set(&pe, 2 * ETH_ALEN,
 				 MVPP2_PRS_SRAM_OP_SEL_SHIFT_ADD);
 
 	/* Update shadow table and hw entry */
-	priv->prs_shadow[pe->index].udf = MVPP2_PRS_UDF_MAC_DEF;
-	mvpp2_prs_shadow_set(priv, pe->index, MVPP2_PRS_LU_MAC);
-	mvpp2_prs_hw_write(priv, pe);
-
-	kfree(pe);
+	priv->prs_shadow[pe.index].udf = MVPP2_PRS_UDF_MAC_DEF;
+	mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_MAC);
+	mvpp2_prs_hw_write(priv, &pe);
 
 	return 0;
 }
@@ -4014,13 +3982,13 @@ static int mvpp2_prs_tag_mode_set(struct mvpp2 *priv, int port, int type)
 /* Set prs flow for the port */
 static int mvpp2_prs_def_flow(struct mvpp2_port *port)  {
-	struct mvpp2_prs_entry *pe;
+	struct mvpp2_prs_entry pe;
 	int tid;
 
-	pe = mvpp2_prs_flow_find(port->priv, port->id);
+	tid = mvpp2_prs_flow_find(port->priv, port->id);
 
 	/* Such entry not exist */
-	if (!pe) {
+	if (tid < 0) {
 		/* Go through the all entires from last to first */
 		tid = mvpp2_prs_tcam_first_free(port->priv,
 						MVPP2_PE_LAST_FREE_TID,
@@ -4028,24 +3996,21 @@ static int mvpp2_prs_def_flow(struct mvpp2_port *port)
 		if (tid < 0)
 			return tid;
 
-		pe = kzalloc(sizeof(*pe), GFP_KERNEL);
-		if (!pe)
-			return -ENOMEM;
-
-		mvpp2_prs_tcam_lu_set(pe, MVPP2_PRS_LU_FLOWS);
-		pe->index = tid;
+		pe.index = tid;
 
 		/* Set flow ID*/
-		mvpp2_prs_sram_ai_update(pe, port->id, MVPP2_PRS_FLOW_ID_MASK);
-		mvpp2_prs_sram_bits_set(pe, MVPP2_PRS_SRAM_LU_DONE_BIT, 1);
+		mvpp2_prs_sram_ai_update(&pe, port->id, MVPP2_PRS_FLOW_ID_MASK);
+		mvpp2_prs_sram_bits_set(&pe, MVPP2_PRS_SRAM_LU_DONE_BIT, 1);
 
 		/* Update shadow table */
-		mvpp2_prs_shadow_set(port->priv, pe->index, MVPP2_PRS_LU_FLOWS);
+		mvpp2_prs_shadow_set(port->priv, pe.index, MVPP2_PRS_LU_FLOWS);
+	} else {
+		mvpp2_prs_hw_read(port->priv, &pe);
 	}
 
-	mvpp2_prs_tcam_port_map_set(pe, (1 << port->id));
-	mvpp2_prs_hw_write(port->priv, pe);
-	kfree(pe);
+	mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_FLOWS);
+	mvpp2_prs_tcam_port_map_set(&pe, (1 << port->id));
+	mvpp2_prs_hw_write(port->priv, &pe);
 
 	return 0;
 }
--
2.11.0

^ permalink raw reply related

* RE: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP time stamping.
From: Keller, Jacob E @ 2018-03-21 20:05 UTC (permalink / raw)
  To: Richard Cochran, netdev@vger.kernel.org
  Cc: devicetree@vger.kernel.org, Andrew Lunn, David Miller,
	Florian Fainelli, Mark Rutland, Miroslav Lichvar, Rob Herring,
	Willem de Bruijn
In-Reply-To: <60ae964d6a0da497bcac1d3fdb5b3fe01f5d70f1.1521656774.git.richardcochran@gmail.com>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Richard Cochran
> Sent: Wednesday, March 21, 2018 11:58 AM
> To: netdev@vger.kernel.org
> Cc: devicetree@vger.kernel.org; Andrew Lunn <andrew@lunn.ch>; David Miller
> <davem@davemloft.net>; Florian Fainelli <f.fainelli@gmail.com>; Mark Rutland
> <mark.rutland@arm.com>; Miroslav Lichvar <mlichvar@redhat.com>; Rob
> Herring <robh+dt@kernel.org>; Willem de Bruijn <willemb@google.com>
> Subject: [PATCH net-next RFC V1 1/5] net: Introduce peer to peer one step PTP
> +
> +	/*
> +	 * Same as HWTSTAMP_TX_ONESTEP_SYNC, but also enables time
> +	 * stamp insertion directly into PDelay_Resp packets. In this
> +	 * case, neither transmitted Sync nor PDelay_Resp packets will
> +	 * receive a time stamp via the socket error queue.
> +	 */
> +	HWTSTAMP_TX_ONESTEP_P2P,
>  };
> 

I am guessing that we expect all devices which support onestep P2P messages, will always support onestep SYNC as well?

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 20:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team
In-Reply-To: <CAKgT0UftYHNnmkVX64NfuvHDmdjghVnN7bQCoenAOD3dK61gOg@mail.gmail.com>



On 3/21/18 7:59 AM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 10:02 PM, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>> On 3/20/18 4:50 PM, Alexander Duyck wrote:
>>>
>>> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>>> function skb_segment(), line 3667. The bpf program attaches to
>>>> clsact ingress, calls bpf_skb_change_proto to change protocol
>>>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>>>> to send the changed packet out.
>>>>
>>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>> 3473                             netdev_features_t features)
>>>> 3474 {
>>>> 3475         struct sk_buff *segs = NULL;
>>>> 3476         struct sk_buff *tail = NULL;
>>>> ...
>>>> 3665                 while (pos < offset + len) {
>>>> 3666                         if (i >= nfrags) {
>>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>>> 3668
>>>> 3669                                 i = 0;
>>>> 3670                                 nfrags =
>>>> skb_shinfo(list_skb)->nr_frags;
>>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>>> 3672                                 frag_skb = list_skb;
>>>> ...
>>>>
>>>> call stack:
>>>> ...
>>>>    #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>>    #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>>    #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>>    #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>>    #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>>    #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>>    #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>>    #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>>       [exception RIP: skb_segment+3044]
>>>>       RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>>       RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>>       RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>>       RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>>       R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>>       R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>>    #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>>> --- <IRQ stack> ---
>>>> ...
>>>>
>>>> The triggering input skb has the following properties:
>>>>       list_skb = skb->frag_list;
>>>>       skb->nfrags != NULL && skb_headlen(list_skb) != 0
>>>> and skb_segment() is not able to handle a frag_list skb
>>>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>>>
>>>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>>>> case properly if list_skb->head_frag is true, which is expected in
>>>> most cases. The head frag is processed before list_skb->frags
>>>> are processed.
>>>>
>>>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    net/core/skbuff.c | 51
>>>> +++++++++++++++++++++++++++++++++++++--------------
>>>>    1 file changed, 37 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 715c134..59bbc06 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>           struct sk_buff *segs = NULL;
>>>>           struct sk_buff *tail = NULL;
>>>>           struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>>>> -       skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>>>> +       skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag =
>>>> NULL;
>>>
>>>
>>> I think you misunderstood me. I wasn't saying you allocate head_frag.
>>> I was saying you could move the declaration down.
>>
>>
>> Sorry for my misunderstanding. I did understand your intention of moving
>> the declaration down in order to save stack space. I thought that we cannot
>> really move declaration down (although it works in C, but semantically it is
>> not quite right, more later), so I moved on to
>> use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
>> could live on the stack.
>>
>>>
>>>>           unsigned int mss = skb_shinfo(head_skb)->gso_size;
>>>>           unsigned int doffset = head_skb->data -
>>>> skb_mac_header(head_skb);
>>>>           struct sk_buff *frag_skb = head_skb;
>>>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                   while (pos < offset + len) {
>>>
>>>
>>> So right here in the loop you could add a "skb_frag_t head_frag;" just
>>> so we declare it here and save ourselves the stack space.
>>
>>
>> I actually tried to move "skb_frag_t head_frag". The stack size remains the
>> same, 0xc0. This is related to how C compiler allocates stack space.
>> The declaration place won't decide the stack size as long as the declaration
>> dictates the usage. The stack size is really determined by liveness
>> analysis.
>>
>> Further, we have code like:
>>          do {
>>             ....
>>             while (pos < offset + len) {
>>                 if (i >= nfrags) {
>>                     ...
>>                     head_frag = ...
>>                 }
>>                 ... = head_frag; // head_frag access guaranteed after
>>                                  // above definition, but it may not
>>                                  // be in the same outer do-while loop.
>>             }
>>             ...
>>           } while (((offset += len) < head_skb->len);
>>
>> So the use of head_frag maybe in different outer loop iterations.
>> So I feel the definition of head_frag should be outside the
>> outer do-while loop, which is the main function scope. I will add some
>> comments here.
> 
> So the point I had is that head_frag doesn't need to live that long.
> All you are doing is arranging the data so that you can essentially
> just dump it into *nskb_frag.
> 
> One alternative you could look at would be to rearrange the code so
> that the MAX_SKB_FRAGS check occurs first, then perform the check for
> (i >= nfrags). Doing it that way we end up bypassing the need for
> head_frag entirely as you could just populate *nskb_frag directly. You
> could probably just add an inline structure that would convert the
> head frag into a skb_frag_t structure and return that. Something along
> the lines of:
>          *nskb_frag = (i < 0) ? skb_head_frag_to_page_desc(frag_skb) : *frag;

This mechanism (inline function skb_head_frag_to_page_desc) works great!
I removed the local variable and the total stack size remains unchanged
with the patch.

> 
> Doing it that way you could pull out the bits below that were
> populating head frag and just have it all handled as an inline
> function. An added advantage is that it should make the line wrapping
> easier to deal with. :-)
> 
>>>
>>>>                           if (i >= nfrags) {
>>>> -                               BUG_ON(skb_headlen(list_skb));
>>>> -
>>>>                                   i = 0;
>>>> +                               if (skb_headlen(list_skb)) {
>>>> +                                       struct page *page;
>>>> +
>>>> +                                       BUG_ON(!list_skb->head_frag);
>>>> +
>>>> +                                       page =
>>>> virt_to_head_page(list_skb->head);
>>>> +                                       if (!head_frag) {
>>>> +                                               head_frag =
>>>> kmalloc(sizeof(skb_frag_t),
>>>> +
>>>> GFP_KERNEL);
>>>> +                                               if (!head_frag)
>>>> +                                                       goto err;
>>>> +                                       }
>>>
>>>
>>> Please no memory allocation. I just meant you could allocate it on the
>>> stack later.
>>>
>>>> +                                       head_frag->page.p = page;
>>>> +                                       head_frag->page_offset =
>>>> list_skb->data -
>>>> +                                               (unsigned char
>>>> *)page_address(page);
>>>> +                                       head_frag->size =
>>>> skb_headlen(list_skb);
>>>> +                                       /* set i = -1 so we will pick
>>>> head_frag
>>>> +                                        * instead of
>>>> skb_shinfo(list_skb)->frags
>>>> +                                        * when i == -1.
>>>> +                                        */
>>>> +                                       i = -1;
>>>> +                               }
>>>
>>>
>>> So it took me a bit to pick up on the fact that line below wasn't
>>> removed. So we are basically trying to do this all in one pass now. Do
>>> I have that right?
>>>
>>> One thing you could look at doing to save yourself the extra "if"
>>> later would be to pull frag pointer before you go through skb_headlen
>>> check above. Then if you are going to use a head_frag you could just
>>> do a "i--; frag--;" combination just to rewind and make the room for
>>> the increment to come later. That way you don't have an invalid frag
>>> pointer floating around. That way you only have to do this once
>>> instead of having to do a conditional check per fragment.
>>
>>
>> Right. This indeed make code more cleaner.
>>
>>>
>>>>                                   nfrags = skb_shinfo(list_skb)->nr_frags;
>>>> -                               frag = skb_shinfo(list_skb)->frags;
>>>
>>>
>>> This patch might be more readable if you were to just insert the
>>> skb_headlen() bits down here and left the i=0 through frag = .. in one
>>> piece.
>>
>>
>> Right. Will implement as suggested.
>>
>>>
>>>> -                               frag_skb = list_skb;
>>>> -
>>>> -                               BUG_ON(!nfrags);
>>>> -
>>>> -                               if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> -                                   skb_zerocopy_clone(nskb, frag_skb,
>>>> -                                                      GFP_ATOMIC))
>>>> -                                       goto err;
>>>> +                               if (nfrags) {
>>>> +                                       frag =
>>>> skb_shinfo(list_skb)->frags;
>>>> +                                       frag_skb = list_skb;
>>>> +
>>>> +                                       if (skb_orphan_frags(frag_skb,
>>>> GFP_ATOMIC) ||
>>>> +                                           skb_zerocopy_clone(nskb,
>>>> frag_skb,
>>>> +
>>>> GFP_ATOMIC))
>>>> +                                               goto err;
>>>> +                               }
>>>>
>>>>                                   list_skb = list_skb->next;
>>>>                           }
>>>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                                   goto err;
>>>>                           }
>>>>
>>>> -                       *nskb_frag = *frag;
>>>> +                       *nskb_frag = (i == -1) ? *head_frag : *frag;
>>>
>>>
>>> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
>>
>>
>> Good suggestion. Will implement as suggested.
>>
>>
>>>
>>>>                           __skb_frag_ref(nskb_frag);
>>>>                           size = skb_frag_size(nskb_frag);
>>>>
>>>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>
>>>>                           if (pos + size <= offset + len) {
>>>>                                   i++;
>>>> -                               frag++;
>>>> +                               if (i != 0)
>>>> +                                       frag++;
>>>>                                   pos += size;
>>>>                           } else {
>>>>                                   skb_frag_size_sub(nskb_frag, pos + size
>>>> - (offset + len));
>>>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff
>>>> *head_skb,
>>>>                   swap(tail->destructor, head_skb->destructor);
>>>>                   swap(tail->sk, head_skb->sk);
>>>>           }
>>>> +       kfree(head_frag);
>>>>           return segs;
>>>>
>>>>    err:
>>>>           kfree_skb_list(segs);
>>>> +       kfree(head_frag);
>>>>           return ERR_PTR(err);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(skb_segment);
>>>> --
>>>> 2.9.5
>>>>
>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox