Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 09/10] ixgbe: Update adaptive ITR algorithm
From: Jeff Kirsher @ 2017-10-09 18:39 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20171009184000.80053-1-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

The following change is meant to update the adaptive ITR algorithm to
better support the needs of the network. Specifically with this change what
I have done is make it so that our ITR algorithm will try to prevent either
starving a socket buffer for memory in the case of Tx, or overrunning an Rx
socket buffer on receive.

In addition a side effect of the calculations used is that we should
function better with new features such as XDP which can handle small
packets at high rates without needing to lock us into NAPI polling mode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   7 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  11 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 215 +++++++++++++++++++-------
 3 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 008d0085e01f..468c3555a629 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -435,8 +435,15 @@ static inline unsigned int ixgbe_rx_pg_order(struct ixgbe_ring *ring)
 }
 #define ixgbe_rx_pg_size(_ring) (PAGE_SIZE << ixgbe_rx_pg_order(_ring))
 
+#define IXGBE_ITR_ADAPTIVE_MIN_INC	2
+#define IXGBE_ITR_ADAPTIVE_MIN_USECS	10
+#define IXGBE_ITR_ADAPTIVE_MAX_USECS	126
+#define IXGBE_ITR_ADAPTIVE_LATENCY	0x80
+#define IXGBE_ITR_ADAPTIVE_BULK		0x00
+
 struct ixgbe_ring_container {
 	struct ixgbe_ring *ring;	/* pointer to linked list of rings */
+	unsigned long next_update;	/* jiffies value of last update */
 	unsigned int total_bytes;	/* total bytes processed this int */
 	unsigned int total_packets;	/* total packets processed this int */
 	u16 work_limit;			/* total work allowed per interrupt */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f1bfae0c41d0..8e2a957aca18 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -806,6 +806,7 @@ static void ixgbe_add_ring(struct ixgbe_ring *ring,
 	ring->next = head->ring;
 	head->ring = ring;
 	head->count++;
+	head->next_update = jiffies + 1;
 }
 
 /**
@@ -879,8 +880,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* initialize work limits */
 	q_vector->tx.work_limit = adapter->tx_work_limit;
 
-	/* initialize pointer to rings */
-	ring = q_vector->ring;
+	/* Initialize setting for adaptive ITR */
+	q_vector->tx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	q_vector->rx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
 
 	/* intialize ITR */
 	if (txr_count && !rxr_count) {
@@ -897,6 +901,9 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 			q_vector->itr = adapter->rx_itr_setting;
 	}
 
+	/* initialize pointer to rings */
+	ring = q_vector->ring;
+
 	while (txr_count) {
 		/* assign generic ring traits */
 		ring->dev = &adapter->pdev->dev;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 211074934d5b..5e2686d106db 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2540,50 +2540,174 @@ enum latency_range {
 static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
 			     struct ixgbe_ring_container *ring_container)
 {
-	int bytes = ring_container->total_bytes;
-	int packets = ring_container->total_packets;
-	u32 timepassed_us;
-	u64 bytes_perint;
-	u8 itr_setting = ring_container->itr;
+	unsigned int itr = IXGBE_ITR_ADAPTIVE_MIN_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	unsigned int avg_wire_size, packets, bytes;
+	unsigned long next_update = jiffies;
 
-	if (packets == 0)
+	/* If we don't have any rings just leave ourselves set for maximum
+	 * possible latency so we take ourselves out of the equation.
+	 */
+	if (!ring_container->ring)
 		return;
 
-	/* simple throttlerate management
-	 *   0-10MB/s   lowest (100000 ints/s)
-	 *  10-20MB/s   low    (20000 ints/s)
-	 *  20-1249MB/s bulk   (12000 ints/s)
+	/* If we didn't update within up to 1 - 2 jiffies we can assume
+	 * that either packets are coming in so slow there hasn't been
+	 * any work, or that there is so much work that NAPI is dealing
+	 * with interrupt moderation and we don't need to do anything.
 	 */
-	/* what was last interrupt timeslice? */
-	timepassed_us = q_vector->itr >> 2;
-	if (timepassed_us == 0)
-		return;
+	if (time_after(next_update, ring_container->next_update))
+		goto clear_counts;
 
-	bytes_perint = bytes / timepassed_us; /* bytes/usec */
+	packets = ring_container->total_packets;
 
-	switch (itr_setting) {
-	case lowest_latency:
-		if (bytes_perint > 10)
-			itr_setting = low_latency;
-		break;
-	case low_latency:
-		if (bytes_perint > 20)
-			itr_setting = bulk_latency;
-		else if (bytes_perint <= 10)
-			itr_setting = lowest_latency;
+	/* We have no packets to actually measure against. This means
+	 * either one of the other queues on this vector is active or
+	 * we are a Tx queue doing TSO with too high of an interrupt rate.
+	 *
+	 * When this occurs just tick up our delay by the minimum value
+	 * and hope that this extra delay will prevent us from being called
+	 * without any work on our queue.
+	 */
+	if (!packets) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		itr += ring_container->itr & IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto clear_counts;
+	}
+
+	bytes = ring_container->total_bytes;
+
+	/* If packets are less than 4 or bytes are less than 9000 assume
+	 * insufficient data to use bulk rate limiting approach. We are
+	 * likely latency driven.
+	 */
+	if (packets < 4 && bytes < 9000) {
+		itr = IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto adjust_by_size;
+	}
+
+	/* Between 4 and 48 we can assume that our current interrupt delay
+	 * is only slightly too low. As such we should increase it by a small
+	 * fixed amount.
+	 */
+	if (packets < 48) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		goto clear_counts;
+	}
+
+	/* Between 48 and 96 is our "goldilocks" zone where we are working
+	 * out "just right". Just report that our current ITR is good for us.
+	 */
+	if (packets < 96) {
+		itr = q_vector->itr >> 2;
+		goto clear_counts;
+	}
+
+	/* If packet count is 96 or greater we are likely looking at a slight
+	 * overrun of the delay we want. Try halving our delay to see if that
+	 * will cut the number of packets in half per interrupt.
+	 */
+	if (packets < 256) {
+		itr = q_vector->itr >> 3;
+		if (itr < IXGBE_ITR_ADAPTIVE_MIN_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MIN_USECS;
+		goto clear_counts;
+	}
+
+	/* The paths below assume we are dealing with a bulk ITR since number
+	 * of packets is 256 or greater. We are just going to have to compute
+	 * a value and try to bring the count under control, though for smaller
+	 * packet sizes there isn't much we can do as NAPI polling will likely
+	 * be kicking in sooner rather than later.
+	 */
+	itr = IXGBE_ITR_ADAPTIVE_BULK;
+
+adjust_by_size:
+	/* If packet counts are 256 or greater we can assume we have a gross
+	 * overestimation of what the rate should be. Instead of trying to fine
+	 * tune it just use the formula below to try and dial in an exact value
+	 * give the current packet size of the frame.
+	 */
+	avg_wire_size = bytes / packets;
+
+	/* The following is a crude approximation of:
+	 *  wmem_default / (size + overhead) = desired_pkts_per_int
+	 *  rate / bits_per_byte / (size + ethernet overhead) = pkt_rate
+	 *  (desired_pkt_rate / pkt_rate) * usecs_per_sec = ITR value
+	 *
+	 * Assuming wmem_default is 212992 and overhead is 640 bytes per
+	 * packet, (256 skb, 64 headroom, 320 shared info), we can reduce the
+	 * formula down to
+	 *
+	 *  (170 * (size + 24)) / (size + 640) = ITR
+	 *
+	 * We first do some math on the packet size and then finally bitshift
+	 * by 8 after rounding up. We also have to account for PCIe link speed
+	 * difference as ITR scales based on this.
+	 */
+	if (avg_wire_size <= 60) {
+		/* Start at 50k ints/sec */
+		avg_wire_size = 5120;
+	} else if (avg_wire_size <= 316) {
+		/* 50K ints/sec to 16K ints/sec */
+		avg_wire_size *= 40;
+		avg_wire_size += 2720;
+	} else if (avg_wire_size <= 1084) {
+		/* 16K ints/sec to 9.2K ints/sec */
+		avg_wire_size *= 15;
+		avg_wire_size += 11452;
+	} else if (avg_wire_size <= 1980) {
+		/* 9.2K ints/sec to 8K ints/sec */
+		avg_wire_size *= 5;
+		avg_wire_size += 22420;
+	} else {
+		/* plateau at a limit of 8K ints/sec */
+		avg_wire_size = 32256;
+	}
+
+	/* If we are in low latency mode half our delay which doubles the rate
+	 * to somewhere between 100K to 16K ints/sec
+	 */
+	if (itr & IXGBE_ITR_ADAPTIVE_LATENCY)
+		avg_wire_size >>= 1;
+
+	/* Resultant value is 256 times larger than it needs to be. This
+	 * gives us room to adjust the value as needed to either increase
+	 * or decrease the value based on link speeds of 10G, 2.5G, 1G, etc.
+	 *
+	 * Use addition as we have already recorded the new latency flag
+	 * for the ITR value.
+	 */
+	switch (q_vector->adapter->link_speed) {
+	case IXGBE_LINK_SPEED_10GB_FULL:
+	case IXGBE_LINK_SPEED_100_FULL:
+	default:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 256) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
-	case bulk_latency:
-		if (bytes_perint <= 20)
-			itr_setting = low_latency;
+	case IXGBE_LINK_SPEED_2_5GB_FULL:
+	case IXGBE_LINK_SPEED_1GB_FULL:
+	case IXGBE_LINK_SPEED_10_FULL:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 64) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
 	}
 
-	/* clear work counters since we have the values we need */
+clear_counts:
+	/* write back value */
+	ring_container->itr = itr;
+
+	/* next update should occur within next jiffy */
+	ring_container->next_update = next_update + 1;
+
 	ring_container->total_bytes = 0;
 	ring_container->total_packets = 0;
-
-	/* write updated itr to ring container */
-	ring_container->itr = itr_setting;
 }
 
 /**
@@ -2625,34 +2749,19 @@ void ixgbe_write_eitr(struct ixgbe_q_vector *q_vector)
 
 static void ixgbe_set_itr(struct ixgbe_q_vector *q_vector)
 {
-	u32 new_itr = q_vector->itr;
-	u8 current_itr;
+	u32 new_itr;
 
 	ixgbe_update_itr(q_vector, &q_vector->tx);
 	ixgbe_update_itr(q_vector, &q_vector->rx);
 
-	current_itr = max(q_vector->rx.itr, q_vector->tx.itr);
+	/* use the smallest value of new ITR delay calculations */
+	new_itr = min(q_vector->rx.itr, q_vector->tx.itr);
 
-	switch (current_itr) {
-	/* counts and packets in update_itr are dependent on these numbers */
-	case lowest_latency:
-		new_itr = IXGBE_100K_ITR;
-		break;
-	case low_latency:
-		new_itr = IXGBE_20K_ITR;
-		break;
-	case bulk_latency:
-		new_itr = IXGBE_12K_ITR;
-		break;
-	default:
-		break;
-	}
+	/* Clear latency flag if set, shift into correct position */
+	new_itr &= ~IXGBE_ITR_ADAPTIVE_LATENCY;
+	new_itr <<= 2;
 
 	if (new_itr != q_vector->itr) {
-		/* do an exponential smoothing */
-		new_itr = (10 * new_itr * q_vector->itr) /
-			  ((9 * new_itr) + q_vector->itr);
-
 		/* save the algorithm value here */
 		q_vector->itr = new_itr;
 
-- 
2.14.2

^ permalink raw reply related

* [net-next 10/10] ixgbe: fix crash when injecting AER after failed reset
From: Jeff Kirsher @ 2017-10-09 18:40 UTC (permalink / raw)
  To: davem; +Cc: Emil Tantilov, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171009184000.80053-1-jeffrey.t.kirsher@intel.com>

From: Emil Tantilov <emil.s.tantilov@intel.com>

In case where AER recovery fails the device is left in a down state.
Consecutive AER error injection can lead to a double IRQ free.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5e2686d106db..c6f9da7990c7 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10861,6 +10861,9 @@ static pci_ers_result_t ixgbe_io_error_detected(struct pci_dev *pdev,
 	if (!test_bit(__IXGBE_SERVICE_INITED, &adapter->state))
 		return PCI_ERS_RESULT_DISCONNECT;
 
+	if (!netif_device_present(netdev))
+		return PCI_ERS_RESULT_DISCONNECT;
+
 	rtnl_lock();
 	netif_device_detach(netdev);
 
-- 
2.14.2

^ permalink raw reply related

* Re: [PATCH net-next 1/1] [net] bonding: Add NUMA notice
From: Patrick Talbert @ 2017-10-09 19:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20171007.232036.1361815658002685993.davem@davemloft.net>

On Sat, Oct 7, 2017 at 6:20 PM, David Miller <davem@davemloft.net> wrote:
> From: Patrick Talbert <ptalbert@redhat.com>
> Date: Thu,  5 Oct 2017 16:23:45 -0400
>
>> Network performance can suffer when a load balancing bond uses slave
>> interfaces which are in different NUMA domains.
>>
>> This compares the NUMA domain of a newly enslaved interface against any
>> existing enslaved interfaces and prints a warning if they do not match.
>>
>> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
>
> This is a bit over the top, and doesn't even handle cases where
> the device has no specific NUMA node (-1).

Hello David,

The motivation was simply to have a notification in place if the
interface to-be-enslaved does not match the existing one(s). We have
seen performance issues with bonding which were eventually tracked
down to this mismatched NUMA domain issue. I thought it was worth
having the *mild* warning because it can have a decent impact yet is
probably not an obvious thing to check for most users.

Though I now see your point. If the bonded interfaces are VLANs, and
their underlying physical interfaces happen to be in different NUMA
domains, then my check will not know as the VLAN interface numa_node
member will be -1 no matter what. That's probably a pretty common
setup but adding the logic to check for it probably isn't worth it.

Patrick

^ permalink raw reply

* Re: [PATCH v4 1/2] net: phy: DP83822 initial driver submission
From: Andrew F. Davis @ 2017-10-09 19:12 UTC (permalink / raw)
  To: Dan Murphy, andrew, f.fainelli; +Cc: netdev, Woojung.Huh
In-Reply-To: <20171009165934.12671-1-dmurphy@ti.com>

On 10/09/2017 11:59 AM, Dan Murphy wrote:
> Add support for the TI  DP83822 10/100Mbit ethernet phy.
> 
> The DP83822 provides flexibility to connect to a MAC through a
> standard MII, RMII or RGMII interface.
> 
> In addition the DP83822 needs to be removed from the DP83848 driver
> as the WoL support is added here for this device.
> 
> Datasheet:
> http://www.ti.com/product/DP83822I/datasheet
> 
> Signed-off-by: Dan Murphy <dmurphy@ti.com>> ---
> 
> v4 - Squash DP83822 removal patch into this patch -
> https://www.mail-archive.com/netdev@vger.kernel.org/msg192424.html
> 
> v3 - Fixed WoL indication bit and removed mutex for suspend/resume - 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg191891.html and
> https://www.mail-archive.com/netdev@vger.kernel.org/msg191665.html
> 
> v2 - Updated per comments.  Removed unnessary parantheis, called genphy_suspend/
> resume routines and then performing WoL changes, reworked sopass storage and reduced
> the number of phy reads, and moved WOL_SECURE_ON - 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg191392.html
> 
>  drivers/net/phy/Kconfig   |   5 +
>  drivers/net/phy/Makefile  |   1 +
>  drivers/net/phy/dp83822.c | 302 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/net/phy/dp83848.c |   3 -
>  4 files changed, 308 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/net/phy/dp83822.c
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index cd931cf9dcc2..8e78a482e09e 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -277,6 +277,11 @@ config DAVICOM_PHY
>  	---help---
>  	  Currently supports dm9161e and dm9131
>  
> +config DP83822_PHY
> +	tristate "Texas Instruments DP83822 PHY"
> +	---help---
> +	  Supports the DP83822 PHY.
> +
>  config DP83848_PHY
>  	tristate "Texas Instruments DP83848 PHY"
>  	---help---
> diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
> index 416df92fbf4f..df3b82ba8550 100644
> --- a/drivers/net/phy/Makefile
> +++ b/drivers/net/phy/Makefile
> @@ -55,6 +55,7 @@ obj-$(CONFIG_CICADA_PHY)	+= cicada.o
>  obj-$(CONFIG_CORTINA_PHY)	+= cortina.o
>  obj-$(CONFIG_DAVICOM_PHY)	+= davicom.o
>  obj-$(CONFIG_DP83640_PHY)	+= dp83640.o
> +obj-$(CONFIG_DP83822_PHY)	+= dp83822.o
>  obj-$(CONFIG_DP83848_PHY)	+= dp83848.o
>  obj-$(CONFIG_DP83867_PHY)	+= dp83867.o
>  obj-$(CONFIG_FIXED_PHY)		+= fixed_phy.o
> diff --git a/drivers/net/phy/dp83822.c b/drivers/net/phy/dp83822.c
> new file mode 100644
> index 000000000000..de196dbc46cd
> --- /dev/null
> +++ b/drivers/net/phy/dp83822.c
> @@ -0,0 +1,302 @@
> +/*
> + * Driver for the Texas Instruments DP83822 PHY
> + *
> + * Copyright (C) 2017 Texas Instruments Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/kernel.h>
> +#include <linux/mii.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/phy.h>
> +#include <linux/netdevice.h>
> +
> +#define DP83822_PHY_ID	        0x2000a240
> +#define DP83822_DEVADDR		0x1f
> +
> +#define MII_DP83822_MISR1	0x12
> +#define MII_DP83822_MISR2	0x13
> +#define MII_DP83822_RESET_CTRL	0x1f
> +
> +#define DP83822_HW_RESET	BIT(15)
> +#define DP83822_SW_RESET	BIT(14)
> +
> +/* MISR1 bits */
> +#define DP83822_RX_ERR_HF_INT_EN	BIT(0)
> +#define DP83822_FALSE_CARRIER_HF_INT_EN	BIT(1)
> +#define DP83822_ANEG_COMPLETE_INT_EN	BIT(2)
> +#define DP83822_DUP_MODE_CHANGE_INT_EN	BIT(3)
> +#define DP83822_SPEED_CHANGED_INT_EN	BIT(4)
> +#define DP83822_LINK_STAT_INT_EN	BIT(5)
> +#define DP83822_ENERGY_DET_INT_EN	BIT(6)
> +#define DP83822_LINK_QUAL_INT_EN	BIT(7)
> +
> +/* MISR2 bits */
> +#define DP83822_JABBER_DET_INT_EN	BIT(0)
> +#define DP83822_WOL_PKT_INT_EN		BIT(1)
> +#define DP83822_SLEEP_MODE_INT_EN	BIT(2)
> +#define DP83822_MDI_XOVER_INT_EN	BIT(3)
> +#define DP83822_LB_FIFO_INT_EN		BIT(4)
> +#define DP83822_PAGE_RX_INT_EN		BIT(5)
> +#define DP83822_ANEG_ERR_INT_EN		BIT(6)
> +#define DP83822_EEE_ERROR_CHANGE_INT_EN	BIT(7)
> +
> +/* INT_STAT1 bits */
> +#define DP83822_WOL_INT_EN	BIT(4)
> +#define DP83822_WOL_INT_STAT	BIT(12)
> +
> +#define MII_DP83822_RXSOP1	0x04a5
> +#define	MII_DP83822_RXSOP2	0x04a6
> +#define	MII_DP83822_RXSOP3	0x04a7
> +
> +/* WoL Registers */
> +#define	MII_DP83822_WOL_CFG	0x04a0
> +#define	MII_DP83822_WOL_STAT	0x04a1
> +#define	MII_DP83822_WOL_DA1	0x04a2
> +#define	MII_DP83822_WOL_DA2	0x04a3
> +#define	MII_DP83822_WOL_DA3	0x04a4
> +
> +/* WoL bits */
> +#define DP83822_WOL_MAGIC_EN	BIT(1)

Datasheet seems to indicate MAGIC_EN is bit 0, not 1.

> +#define DP83822_WOL_SECURE_ON	BIT(5)
> +#define DP83822_WOL_EN		BIT(7)
> +#define DP83822_WOL_INDICATION_SEL BIT(8)
> +#define DP83822_WOL_CLR_INDICATION BIT(11)
> +
> +static int dp83822_ack_interrupt(struct phy_device *phydev)
> +{
> +	int err = phy_read(phydev, MII_DP83822_MISR1);
> +
> +	if (err < 0)
> +		return err;
> +

The above could also be written:

int err;

err = phy_read(phydev, MII_DP83822_MISR1);
if (err < 0)
	return err;

This matches the below better and is more clear to me.

> +	err = phy_read(phydev, MII_DP83822_MISR2);
> +	if (err < 0)
> +		return err;
> +
> +	return 0;
> +}
> +
> +static int dp83822_set_wol(struct phy_device *phydev,
> +			   struct ethtool_wolinfo *wol)
> +{
> +	struct net_device *ndev = phydev->attached_dev;
> +	u16 value;
> +	const u8 *mac;
> +
> +	if (wol->wolopts & (WAKE_MAGIC | WAKE_MAGICSECURE)) {
> +		mac = (const u8 *)ndev->dev_addr;
> +
> +		if (!is_valid_ether_addr(mac))
> +			return -EINVAL;
> +
> +		/* MAC addresses start with byte 5, but stored in mac[0].
> +		 * 822 PHYs store bytes 4|5, 2|3, 0|1
> +		 */
> +		phy_write_mmd(phydev, DP83822_DEVADDR,
> +			      MII_DP83822_WOL_DA1, (mac[1] << 8) | mac[0]);
> +		phy_write_mmd(phydev, DP83822_DEVADDR,
> +			      MII_DP83822_WOL_DA2, (mac[3] << 8) | mac[2]);
> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_DA3,
> +			      (mac[5] << 8) | mac[4]);

This 'phy_write_mmd' doesn't match the others, 'MII_DP83822_WOL_DAx'
should be on the next line, or all others should be on same.

> +
> +		value = phy_read_mmd(phydev, DP83822_DEVADDR,
> +				     MII_DP83822_WOL_CFG);
> +		if (wol->wolopts & WAKE_MAGIC)
> +			value |= DP83822_WOL_MAGIC_EN;
> +		else
> +			value &= ~DP83822_WOL_MAGIC_EN;
> +
> +		if (wol->wolopts & WAKE_MAGICSECURE) {
> +			phy_write_mmd(phydev, DP83822_DEVADDR,
> +				      MII_DP83822_RXSOP1,
> +				      (wol->sopass[1] << 8) | wol->sopass[0]);
> +			phy_write_mmd(phydev, DP83822_DEVADDR,
> +				      MII_DP83822_RXSOP2,
> +				      (wol->sopass[3] << 8) | wol->sopass[2]);
> +			phy_write_mmd(phydev, DP83822_DEVADDR,
> +				      MII_DP83822_RXSOP3,
> +				      (wol->sopass[5] << 8) | wol->sopass[4]);
> +			value |= DP83822_WOL_SECURE_ON;
> +		} else {
> +			value &= ~DP83822_WOL_SECURE_ON;
> +		}
> +
> +		value |= (DP83822_WOL_EN | DP83822_WOL_INDICATION_SEL |
> +			  DP83822_WOL_CLR_INDICATION);
> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG,
> +			      value);
> +	} else {
> +		value = phy_read_mmd(phydev, DP83822_DEVADDR,
> +				     MII_DP83822_WOL_CFG);
> +		value &= ~DP83822_WOL_EN;
> +		phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG,
> +			      value);
> +	}
> +
> +	return 0;
> +}
> +
> +static void dp83822_get_wol(struct phy_device *phydev,
> +			    struct ethtool_wolinfo *wol)
> +{
> +	int value;
> +	u16 sopass_val;
> +
> +	wol->supported = (WAKE_MAGIC | WAKE_MAGICSECURE);
> +	wol->wolopts = 0;
> +
> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
> +	if (value & DP83822_WOL_MAGIC_EN)
> +		wol->wolopts |= WAKE_MAGIC;
> +
> +	if (~value & DP83822_WOL_CLR_INDICATION)
> +		wol->wolopts = 0;

I'm not sure I understand the logic here, why do we clear all other
wolopts if this is not set?

> +
> +	if (value & DP83822_WOL_SECURE_ON) {
> +		wol->wolopts |= WAKE_MAGICSECURE;
> +	} else {
> +		wol->wolopts &= ~WAKE_MAGICSECURE;

wol->wolopts is set to 0 at the start, and nothing else sets it, why
clear it here?

> +		return;
> +	}
> +
> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP1);
> +	wol->sopass[0] = (sopass_val & 0xff);
> +	wol->sopass[1] = (sopass_val >> 8);
> +
> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP2);
> +	wol->sopass[2] = (sopass_val & 0xff);
> +	wol->sopass[3] = (sopass_val >> 8);
> +
> +	sopass_val = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_RXSOP3);
> +	wol->sopass[4] = (sopass_val & 0xff);
> +	wol->sopass[5] = (sopass_val >> 8);

Why not encase the above password lines in the 'if (value &
DP83822_WOL_SECURE_ON)' block above, then you can drop the whole else block.

> +}
> +
> +static int dp83822_config_intr(struct phy_device *phydev)
> +{
> +	int misr_status;
> +	int err;
> +
> +	if (phydev->interrupts == PHY_INTERRUPT_ENABLED) {
> +		misr_status = phy_read(phydev, MII_DP83822_MISR1);
> +		if (misr_status < 0)
> +			return misr_status;
> +
> +		misr_status |= (DP83822_RX_ERR_HF_INT_EN |
> +				DP83822_FALSE_CARRIER_HF_INT_EN |
> +				DP83822_ANEG_COMPLETE_INT_EN |
> +				DP83822_DUP_MODE_CHANGE_INT_EN |
> +				DP83822_SPEED_CHANGED_INT_EN |
> +				DP83822_LINK_STAT_INT_EN |
> +				DP83822_ENERGY_DET_INT_EN |
> +				DP83822_LINK_QUAL_INT_EN);
> +
> +		err = phy_write(phydev, MII_DP83822_MISR1, misr_status);
> +		if (err < 0)
> +			return err;
> +
> +		misr_status = phy_read(phydev, MII_DP83822_MISR2);
> +		if (misr_status < 0)
> +			return misr_status;
> +
> +		misr_status |= (DP83822_JABBER_DET_INT_EN |
> +				DP83822_WOL_PKT_INT_EN |
> +				DP83822_SLEEP_MODE_INT_EN |
> +				DP83822_MDI_XOVER_INT_EN |
> +				DP83822_LB_FIFO_INT_EN |
> +				DP83822_PAGE_RX_INT_EN |
> +				DP83822_ANEG_ERR_INT_EN |
> +				DP83822_EEE_ERROR_CHANGE_INT_EN);
> +
> +		err = phy_write(phydev, MII_DP83822_MISR2, misr_status);
> +	} else {
> +		err = phy_write(phydev, MII_DP83822_MISR1, 0);

You should only clear the ones you set, I know it is all of them plus
the other registers are read-only, but for clarity you could have a
define with the mask you are using for each register and then ~MASK when
clearing, like the dp83848.c driver.

> +		if (err < 0)
> +			return err;
> +
> +		err = phy_write(phydev, MII_DP83822_MISR1, 0);
> +	}
> +
> +	return err;
> +}
> +
> +static int dp83822_phy_reset(struct phy_device *phydev)
> +{
> +	int err;
> +
> +	err = phy_write(phydev, MII_DP83822_RESET_CTRL, DP83822_HW_RESET);
> +	if (err < 0)
> +		return err;
> +
> +	return 0;
> +}
> +
> +static int dp83822_suspend(struct phy_device *phydev)
> +{
> +	int value;
> +
> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
> +
> +	if (!(value & DP83822_WOL_EN))
> +		genphy_suspend(phydev);
> +
> +	return 0;
> +}
> +
> +static int dp83822_resume(struct phy_device *phydev)
> +{
> +	int value;
> +
> +	genphy_resume(phydev);
> +
> +	value = phy_read_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG);
> +
> +	phy_write_mmd(phydev, DP83822_DEVADDR, MII_DP83822_WOL_CFG, value |
> +		      DP83822_WOL_CLR_INDICATION);
> +
> +

Extra newline.

> +	return 0;
> +}
> +
> +static struct phy_driver dp83822_driver[] = {
> +	{
> +	 .phy_id = DP83822_PHY_ID,
> +	 .phy_id_mask = 0xfffffff0,
> +	 .name = "TI DP83822",
> +	 .features = PHY_BASIC_FEATURES,
> +	 .flags = PHY_HAS_INTERRUPT,
> +	 .config_init = genphy_config_init,
> +	 .soft_reset = dp83822_phy_reset,
> +	 .get_wol = dp83822_get_wol,
> +	 .set_wol = dp83822_set_wol,
> +	 .ack_interrupt = dp83822_ack_interrupt,
> +	 .config_intr = dp83822_config_intr,
> +	 .config_aneg = genphy_config_aneg,
> +	 .read_status = genphy_read_status,
> +	 .suspend = dp83822_suspend,
> +	 .resume = dp83822_resume,
> +	 },

Something is not right about the indenting here, tab then space?

> +};
> +module_phy_driver(dp83822_driver);
> +
> +static struct mdio_device_id __maybe_unused dp83822_tbl[] = {
> +	{ DP83822_PHY_ID, 0xfffffff0 },
> +	{ },
> +};
> +MODULE_DEVICE_TABLE(mdio, dp83822_tbl);
> +
> +MODULE_DESCRIPTION("Texas Instruments DP83822 PHY driver");
> +MODULE_AUTHOR("Dan Murphy <dmurphy@ti.com");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/net/phy/dp83848.c b/drivers/net/phy/dp83848.c
> index 3de4fe4dda77..3966d43c5146 100644
> --- a/drivers/net/phy/dp83848.c
> +++ b/drivers/net/phy/dp83848.c
> @@ -20,7 +20,6 @@
>  #define TI_DP83620_PHY_ID		0x20005ce0
>  #define NS_DP83848C_PHY_ID		0x20005c90
>  #define TLK10X_PHY_ID			0x2000a210
> -#define TI_DP83822_PHY_ID		0x2000a240
>  
>  /* Registers */
>  #define DP83848_MICR			0x11 /* MII Interrupt Control Register */
> @@ -80,7 +79,6 @@ static struct mdio_device_id __maybe_unused dp83848_tbl[] = {
>  	{ NS_DP83848C_PHY_ID, 0xfffffff0 },
>  	{ TI_DP83620_PHY_ID, 0xfffffff0 },
>  	{ TLK10X_PHY_ID, 0xfffffff0 },
> -	{ TI_DP83822_PHY_ID, 0xfffffff0 },
>  	{ }
>  };
>  MODULE_DEVICE_TABLE(mdio, dp83848_tbl);
> @@ -110,7 +108,6 @@ static struct phy_driver dp83848_driver[] = {
>  	DP83848_PHY_DRIVER(NS_DP83848C_PHY_ID, "NS DP83848C 10/100 Mbps PHY"),
>  	DP83848_PHY_DRIVER(TI_DP83620_PHY_ID, "TI DP83620 10/100 Mbps PHY"),
>  	DP83848_PHY_DRIVER(TLK10X_PHY_ID, "TI TLK10X 10/100 Mbps PHY"),
> -	DP83848_PHY_DRIVER(TI_DP83822_PHY_ID, "TI DP83822 10/100 Mbps PHY"),
>  };
>  module_phy_driver(dp83848_driver);
>  
> 

^ permalink raw reply

* Re: [PATCH v1 RFC 1/7] Replace license with GPL
From: Florian Fainelli @ 2017-10-09 19:40 UTC (permalink / raw)
  To: Tristram.Ha, David.Laight
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel, andrew, pavel, ruediger.schmitt
In-Reply-To: <93AF473E2DA327428DE3D46B72B1E9FD4112EE49@CHN-SV-EXMX02.mchp-main.com>

On 10/09/2017 11:40 AM, Tristram.Ha@microchip.com wrote:
>> From: Tristram.Ha@microchip.com
>>> Sent: 06 October 2017 21:33
>>> Replace license with GPL.
>>
>> Don't you need permission from all the people who have updated
>> the files in order to make this change?
>>
>> 	David
> 
> I am a little confused by your comment.  The 4 original KSZ9477 DSA
> driver files were written by Woojung at Microchip Technology Inc.
> There was a complaint the "AS IS" license is not exactly GPL.
> 
> It should be submitted formally to net-next instead of a RFC, but it
> is probably pointless to do that when there is no code change.
> 
> I am hoping these drastic changes of KSZ9477 driver can be accepted
> so that the patches can be submitted formally to net-next.

What David is saying is that you need to take into account every one who
has been making changes to the Microchip driver, because anytime
someones submits a patch that gets accepted, their "voice" now counts:

Fortunately, the list seems to be fairly limited:

git shortlog -sne drivers/net/dsa/microchip
     5  Arkadi Sharshevsky <arkadis@mellanox.com>
     1  Woojung Huh <Woojung.Huh@microchip.com>

So seeking contributor approval for a license change should be fairly
simple.
-- 
Florian

^ permalink raw reply

* [PATCH] wimax/i2400m: Remove VLAIS
From: Matthias Kaehlcke @ 2017-10-09 19:41 UTC (permalink / raw)
  To: Inaky Perez-Gonzalez
  Cc: linux-wimax, netdev, linux-kernel, Behan Webster, Mark Charlebois,
	Arnd Bergmann, Guenter Roeck, Matthias Kaehlcke

From: Behan Webster <behanw@converseincode.com>

Convert Variable Length Array in Struct (VLAIS) to valid C by converting
local struct definition to use a flexible array. The structure is only
used to define a cast of a buffer so the size of the struct is not used
to allocate storage.

Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Mark Charebois <charlebm@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
---
 drivers/net/wimax/i2400m/fw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wimax/i2400m/fw.c b/drivers/net/wimax/i2400m/fw.c
index c9c711dcd0e6..a89b5685e68b 100644
--- a/drivers/net/wimax/i2400m/fw.c
+++ b/drivers/net/wimax/i2400m/fw.c
@@ -652,7 +652,7 @@ static int i2400m_download_chunk(struct i2400m *i2400m, const void *chunk,
 	struct device *dev = i2400m_dev(i2400m);
 	struct {
 		struct i2400m_bootrom_header cmd;
-		u8 cmd_payload[chunk_len];
+		u8 cmd_payload[];
 	} __packed *buf;
 	struct i2400m_bootrom_header ack;
 
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* Re: linux-next: manual merge of the drivers-x86 tree with the net-next tree
From: Mika Westerberg @ 2017-10-09 19:43 UTC (permalink / raw)
  To: Mark Brown
  Cc: Darren Hart, Mario Limonciello, Yehezkel Bernat, Andy Shevchenko,
	Amir Levy, Michael Jamet, David S. Miller, netdev,
	Linux-Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20171009175633.dm7hjmgoxusspue5@sirena.co.uk>

On Mon, Oct 09, 2017 at 06:56:33PM +0100, Mark Brown wrote:
>  +Networking over Thunderbolt cable
>  +---------------------------------
>  +Thunderbolt technology allows software communication across two hosts
>  +connected by a Thunderbolt cable.
>  +
>  +It is possible to tunnel any kind of traffic over Thunderbolt link but
>  +currently we only support Apple ThunderboltIP protocol.
>  +
>  +If the other host is running Windows or macOS only thing you need to
>  +do is to connect Thunderbolt cable between the two hosts, the
>  +``thunderbolt-net`` is loaded automatically. If the other host is also
>  +Linux you should load ``thunderbolt-net`` manually on one host (it does
>  +not matter which one)::
>  +
>  +  # modprobe thunderbolt-net
>  +
>  +This triggers module load on the other host automatically. If the driver
>  +is built-in to the kernel image, there is no need to do anything.
>  +
>  +The driver will create one virtual ethernet interface per Thunderbolt
>  +port which are named like ``thunderbolt0`` and so on. From this point
>  +you can either use standard userspace tools like ``ifconfig`` to
>  +configure the interface or let your GUI to handle it automatically.
> ++
> + Forcing power
> + -------------
> + Many OEMs include a method that can be used to force the power of a
> + thunderbolt controller to an "On" state even if nothing is connected.
> + If supported by your machine this will be exposed by the WMI bus with
> + a sysfs attribute called "force_power".
> + 
> + For example the intel-wmi-thunderbolt driver exposes this attribute in:
> +   /sys/devices/platform/PNP0C14:00/wmi_bus/wmi_bus-PNP0C14:00/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
> + 
> +   To force the power to on, write 1 to this attribute file.
> +   To disable force power, write 0 to this attribute file.
> + 
> + Note: it's currently not possible to query the force power state of a platform.

If possible, I would rather move this chapter to be before "Networking
over Thunderbolt cable". Reason is that it then follows NVM flashing
chapter which is typically where you need to force power in the first
place.

^ permalink raw reply

* Re: [PATCH v1 RFC 1/7] Replace license with GPL
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-2-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 390 bytes --]

On Fri 2017-10-06 13:32:59, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Replace license with GPL.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Acked-by: Pavel Machek <pavel@ucw.cz>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 2/7] Clean up code according to patch check suggestions
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-3-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

On Fri 2017-10-06 13:33:00, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Clean up code according to patch check suggestions.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>


Reviewed-by: Pavel Machek <pavel@ucw.cz>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 3/7] Rename some functions with ksz9477 prefix
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-4-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 461 bytes --]

On Fri 2017-10-06 13:33:01, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Rename some functions with ksz9477 prefix to separate chip specific code
> from common code.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Pavel Machek <pavel@ucw.cz>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 4/7] Rename ksz_spi.c to ksz9477_spi.c
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-5-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

On Fri 2017-10-06 13:33:02, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to add
> more KSZ switch drivers.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

This will ask people question they already answered, but I guess
that's ok.

Reviewed-by: Pavel Machek <pavel@ucw.cz>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 5/7] Break KSZ9477 DSA driver into two files
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-6-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

On Fri 2017-10-06 13:33:03, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Break KSZ9477 DSA driver into two files in preparation to add more KSZ
> switch drivers.
> Add common functions in ksz_common.h so that other KSZ switch drivers
> can access code in ksz_common.c.
> Add ksz_spi.h for common functions used by KSZ switch SPI drivers.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Pavel Machek <pavel@ucw.cz>

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 6/7] Add MIB counter reading support
From: Pavel Machek @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Andrew Lunn, Florian Fainelli, Ruediger Schmitt, muvarov,
	nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver, netdev,
	linux-kernel
In-Reply-To: <1507321985-15097-7-git-send-email-Tristram.Ha@microchip.com>

[-- Attachment #1: Type: text/plain, Size: 1998 bytes --]

Hi!

> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Add MIB counter reading support.
> Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
> name is always KSZ####.
> Header file ksz_priv.h no longer contains any chip specific data.

Nothing obviously wrong here, but if you are doing another iteration,
it would be nice to separate "Rename ksz_9477_reg.h to ksz9477_reg.h"
from the code changes.

Best regards,

> +	timeout = 1000;
> +	do {
> +		ksz_pread32(dev, port, REG_PORT_MIB_CTRL_STAT__4,
> +			    &data);
> +		usleep_range(1, 10);
> +		if (!(data & MIB_COUNTER_READ))
> +			break;
> +	} while (timeout-- > 0);

1000 iterations, 1usec each, so 1msec, but you allow up to 10 msec. Interesting.

> +	/* failed to read MIB. get out of loop */
> +	if (!timeout) {
> +		dev_dbg(dev->dev, "Failed to get MIB\n");
> +		return;
> +	}

Are you sure this works? AFAICT "timeout" will underflow, so !timeout
will not trigger.


> -static void ksz_get_ethtool_stats(struct dsa_switch *ds, int port,
> -				  uint64_t *buf)
> -{
> -	struct ksz_device *dev = ds->priv;
> -	int i;
> -	u32 data;
> -	int timeout;
> -
> -	mutex_lock(&dev->stats_mutex);
> -
> -	for (i = 0; i < TOTAL_SWITCH_COUNTER_NUM; i++) {
> -		data = MIB_COUNTER_READ;
> -		data |= ((mib_names[i].index & 0xFF) << MIB_COUNTER_INDEX_S);
> -		ksz_pwrite32(dev, port, REG_PORT_MIB_CTRL_STAT__4, data);
> -
> -		timeout = 1000;
> -		do {
> -			ksz_pread32(dev, port, REG_PORT_MIB_CTRL_STAT__4,
> -				    &data);
> -			usleep_range(1, 10);
> -			if (!(data & MIB_COUNTER_READ))
> -				break;
> -		} while (timeout-- > 0);
> -
> -		/* failed to read MIB. get out of loop */
> -		if (!timeout) {
> -			dev_dbg(dev->dev, "Failed to get MIB\n");
> -			break;
> -		}

Hmm. Bug was there already...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 2/7] Clean up code according to patch check suggestions
From: Florian Fainelli @ 2017-10-09 19:54 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-3-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:33 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Clean up code according to patch check suggestions.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH v1 RFC 3/7] Rename some functions with ksz9477 prefix
From: Florian Fainelli @ 2017-10-09 19:56 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-4-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:33 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Rename some functions with ksz9477 prefix to separate chip specific code
> from common code.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: linux-next: manual merge of the drivers-x86 tree with the net-next tree
From: Mark Brown @ 2017-10-09 19:56 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Darren Hart, Mario Limonciello, Yehezkel Bernat, Andy Shevchenko,
	Amir Levy, Michael Jamet, David S. Miller, netdev,
	Linux-Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20171009194301.GP2761@lahna.fi.intel.com>

[-- Attachment #1: Type: text/plain, Size: 390 bytes --]

On Mon, Oct 09, 2017 at 10:43:01PM +0300, Mika Westerberg wrote:

> If possible, I would rather move this chapter to be before "Networking
> over Thunderbolt cable". Reason is that it then follows NVM flashing
> chapter which is typically where you need to force power in the first
> place.

I guess that's something best sorted out either in the relevant trees or
during the merge window?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v1 RFC 4/7] Rename ksz_spi.c to ksz9477_spi.c
From: Florian Fainelli @ 2017-10-09 19:56 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-5-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:33 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to add
> more KSZ switch drivers.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH v1 RFC 0/7] Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
From: Florian Fainelli @ 2017-10-09 19:58 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-1-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:32 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.
> 
> There are several steps to accomplish this achievement.  First is to
> rename some function names with a prefix to indicate chip specific
> function.  Second is to move common code into header that can be shared.
> Last is to modify tag_ksz.c so that it can handle many tail tag formats
> used by different KSZ switch drivers.
> 
> ksz_common.c will contain the common code used by all KSZ switch drivers.
> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> ksz9477_spi.c is renamed from ksz_spi.c.
> ksz9477_reg.h is renamed from ksz_9477_reg.h.
> ksz_common.h is added to provide common code access to KSZ switch
> drivers.
> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> drivers.

Most of this looks fine, and it is probably time to move away from the
RFC state and do a formal patch submission with the Reviewed-by and
Acked-by tags you just collected.

Can you also make sure you prefix your patches with: net: dsa: to be
consistent with other submissions done to that subsystem, e.g:

net: dsa: microchip: Replace license with GPL

Thank you!

> 
> v1
> - Each patch in the set is self-contained
> - Use ksz9477 prefix to indicate KSZ9477 specific code
> - Simplify MIB counter reading code
> - Switch driver code is not accessed from tag_ksz.c
> 
> Tristram Ha (7):
>   Replace license with GPL.
>   Clean up code according to patch check suggestions.
>   Rename some functions with ksz9477 prefix to separate chip specific
>     code from common code.
>   Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to
>     add more KSZ switch drivers.
>   Break KSZ9477 DSA driver into two files in preparation to add more KSZ
>     switch drivers.  Add common functions in ksz_common.h so that other
>     KSZ switch drivers can access code in ksz_common.c.  Add ksz_spi.h
>     for common functions used by KSZ switch SPI drivers.
>   Add MIB counter reading support.  Rename ksz_9477_reg.h to
>     ksz9477_reg.h for consistency as the product name is always KSZ####.
>     Header file ksz_priv.h no longer contains any chip specific data.
>   Modify tag_ksz.c so that tail tag code can be used by other KSZ switch
>     drivers.
> 
>  drivers/net/dsa/microchip/Kconfig                  |   14 +-
>  drivers/net/dsa/microchip/Makefile                 |    4 +-
>  drivers/net/dsa/microchip/ksz9477.c                | 1376 ++++++++++++++++++++
>  .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}    |   23 +-
>  drivers/net/dsa/microchip/ksz9477_spi.c            |  188 +++
>  drivers/net/dsa/microchip/ksz_common.c             | 1210 ++++-------------
>  drivers/net/dsa/microchip/ksz_common.h             |  234 ++++
>  drivers/net/dsa/microchip/ksz_priv.h               |  258 ++--
>  drivers/net/dsa/microchip/ksz_spi.c                |  216 ---
>  drivers/net/dsa/microchip/ksz_spi.h                |   82 ++
>  include/net/dsa.h                                  |    2 +-
>  net/dsa/Kconfig                                    |    4 +
>  net/dsa/dsa.c                                      |    4 +-
>  net/dsa/dsa_priv.h                                 |    2 +-
>  net/dsa/tag_ksz.c                                  |  107 +-
>  15 files changed, 2331 insertions(+), 1393 deletions(-)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477.c
>  rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_common.h
>  delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_spi.h
> 


-- 
Florian

^ permalink raw reply

* RE: [PATCH v1 RFC 1/7] Replace license with GPL
From: Woojung.Huh @ 2017-10-09 19:58 UTC (permalink / raw)
  To: Tristram.Ha, andrew, f.fainelli, pavel, ruediger.schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-2-git-send-email-Tristram.Ha@microchip.com>

> Subject: [PATCH v1 RFC 1/7] Replace license with GPL
> 
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Replace license with GPL.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Woojung Huh <Woojung.Huh@microchip.com>

^ permalink raw reply

* Re: [PATCH v1 RFC 5/7] Break KSZ9477 DSA driver into two files
From: Florian Fainelli @ 2017-10-09 20:01 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-6-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:33 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Break KSZ9477 DSA driver into two files in preparation to add more KSZ
> switch drivers.
> Add common functions in ksz_common.h so that other KSZ switch drivers
> can access code in ksz_common.c.
> Add ksz_spi.h for common functions used by KSZ switch SPI drivers.
> 
> Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH v1 RFC 6/7] Add MIB counter reading support
From: Florian Fainelli @ 2017-10-09 20:07 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: muvarov, nathan.leigh.conrad, vivien.didelot, UNGLinuxDriver,
	netdev, linux-kernel
In-Reply-To: <1507321985-15097-7-git-send-email-Tristram.Ha@microchip.com>

On 10/06/2017 01:33 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> Add MIB counter reading support.
> Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
> name is always KSZ####.
> Header file ksz_priv.h no longer contains any chip specific data.

You should probably explain in the commit message that you are adding a
timer that reads counters every 30 seconds to avoid overflows, as this
is a pretty important piece of information.


> +static void ksz9477_phy_setup(struct ksz_device *dev, int port,
> +			      struct phy_device *phy)
> +{
> +	if (port < dev->phy_port_cnt) {
> +		/* SUPPORTED_Asym_Pause and SUPPORTED_Pause can be removed to
> +		 * disable flow control when rate limiting is used.
> +		 */
> +		phy->advertising = phy->supported;
> +	}
> +}

This appears to be an unrelated change here that you would want to put
in a separate patch.

> +
>  static void ksz9477_port_setup(struct ksz_device *dev, int port, bool cpu_port)
>  {
>  	u8 data8;
> @@ -1159,6 +1198,8 @@ static int ksz9477_setup(struct dsa_switch *ds)
>  	/* start switch */
>  	ksz_cfg(dev, REG_SW_OPERATION, SW_START, true);
>  
> +	ksz_init_mib_timer(dev);
> +
>  	return 0;
>  }
>  
> @@ -1168,6 +1209,7 @@ static int ksz9477_setup(struct dsa_switch *ds)
>  	.set_addr		= ksz9477_set_addr,
>  	.phy_read		= ksz9477_phy_read16,
>  	.phy_write		= ksz9477_phy_write16,
> +	.adjust_link		= ksz_adjust_link,
>  	.port_enable		= ksz_enable_port,
>  	.port_disable		= ksz_disable_port,
>  	.get_strings		= ksz9477_get_strings,
> @@ -1289,6 +1331,7 @@ static int ksz9477_switch_init(struct ksz_device *dev)
>  	if (!dev->ports)
>  		return -ENOMEM;
>  	for (i = 0; i < dev->mib_port_cnt; i++) {
> +		mutex_init(&dev->ports[i].mib.cnt_mutex);
>  		dev->ports[i].mib.counters =
>  			devm_kzalloc(dev->dev,
>  				     sizeof(u64) *
> @@ -1310,7 +1353,12 @@ static void ksz9477_switch_exit(struct ksz_device *dev)
>  	.get_port_addr = ksz9477_get_port_addr,
>  	.cfg_port_member = ksz9477_cfg_port_member,
>  	.flush_dyn_mac_table = ksz9477_flush_dyn_mac_table,
> +	.phy_setup = ksz9477_phy_setup,
>  	.port_setup = ksz9477_port_setup,
> +	.r_mib_cnt = ksz9477_r_mib_cnt,
> +	.r_mib_pkt = ksz9477_r_mib_pkt,
> +	.freeze_mib = ksz9477_freeze_mib,
> +	.port_init_cnt = ksz9477_port_init_cnt,
>  	.shutdown = ksz9477_reset_switch,
>  	.detect = ksz9477_switch_detect,
>  	.init = ksz9477_switch_init,
> diff --git a/drivers/net/dsa/microchip/ksz_9477_reg.h b/drivers/net/dsa/microchip/ksz9477_reg.h
> similarity index 100%
> rename from drivers/net/dsa/microchip/ksz_9477_reg.h
> rename to drivers/net/dsa/microchip/ksz9477_reg.h
> diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
> index 1c9c4c5..885b3e3 100644
> --- a/drivers/net/dsa/microchip/ksz_common.c
> +++ b/drivers/net/dsa/microchip/ksz_common.c
> @@ -50,6 +50,81 @@ void ksz_update_port_member(struct ksz_device *dev, int port)
>  	}
>  }
>  
> +static void port_r_cnt(struct ksz_device *dev, int port)
> +{
> +	struct ksz_port_mib *mib = &dev->ports[port].mib;
> +	u64 *dropped;
> +
> +	/* Some ports may not have MIB counters before SWITCH_COUNTER_NUM. */
> +	while (mib->cnt_ptr < dev->reg_mib_cnt) {
> +		dev->dev_ops->r_mib_cnt(dev, port, mib->cnt_ptr,
> +					&mib->counters[mib->cnt_ptr]);
> +		++mib->cnt_ptr;
> +	}
> +
> +	/* last one in storage */
> +	dropped = &mib->counters[dev->mib_cnt];
> +
> +	/* Some ports may not have MIB counters after SWITCH_COUNTER_NUM. */
> +	while (mib->cnt_ptr < dev->mib_cnt) {
> +		dev->dev_ops->r_mib_pkt(dev, port, mib->cnt_ptr,
> +					dropped, &mib->counters[mib->cnt_ptr]);
> +		++mib->cnt_ptr;
> +	}
> +	mib->cnt_ptr = 0;
> +}
> +
> +static void ksz_mib_read_work(struct work_struct *work)
> +{
> +	struct ksz_device *dev =
> +		container_of(work, struct ksz_device, mib_read);
> +	struct ksz_port *p;
> +	struct ksz_port_mib *mib;
> +	int i;
> +
> +	for (i = 0; i < dev->mib_port_cnt; i++) {
> +		p = &dev->ports[i];
> +		if (!p->on)
> +			continue;
> +		mib = &p->mib;
> +		mutex_lock(&mib->cnt_mutex);
> +
> +		/* read only dropped counters when link is not up */
> +		if (p->link_down)
> +			p->link_down = 0;
> +		else if (!p->link_up)
> +			mib->cnt_ptr = dev->reg_mib_cnt;

Can't you just check the ports' PHY device here instead of caching that
(which can get out of sync)?


> +void ksz_adjust_link(struct dsa_switch *ds, int port,
> +		     struct phy_device *phydev)
> +{
> +	struct ksz_device *dev = ds->priv;
> +	struct ksz_port *p = &dev->ports[port];
> +
> +	if (phydev->link) {
> +		p->speed = phydev->speed;
> +		p->duplex = phydev->duplex;
> +		p->flow_ctrl = phydev->pause;
> +		p->link_up = 1;
> +		dev->live_ports |= (1 << port) & dev->on_ports;
> +	} else if (p->link_up) {
> +		p->link_up = 0;
> +		p->link_down = 1;
> +		dev->live_ports &= ~(1 << port);
> +	}
> +}

It would have been nice to explain why this is needed in the commit
message. dev->live_ports is read-only it seems?
-- 
Florian

^ permalink raw reply

* RE: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set RS bit
From: Keller, Jacob E @ 2017-10-09 20:10 UTC (permalink / raw)
  To: Alexander Duyck, David Laight
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, jogreene@redhat.com
In-Reply-To: <CAKgT0UfZEU++hZkpp1NixYxj=2STfXX-5opz2BoEhG1SfUA5pQ@mail.gmail.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Alexander Duyck
> Sent: Monday, October 09, 2017 9:28 AM
> To: David Laight <David.Laight@aculab.com>
> Cc: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net;
> Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com
> Subject: Re: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set
> RS bit
> 
> On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@aculab.com> wrote:
> > From: Jeff Kirsher
> >> Sent: 06 October 2017 18:57
> >> From: Jacob Keller <jacob.e.keller@intel.com>
> >>
> >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
> >> disable force WB workaround") we've tried to "optimize" setting the
> >> RS bit based around skb->xmit_more. This same logic was refactored
> >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
> >> but ultimately was not functionally changed.
> >
> > I've tried to understand the above, but without definitions of WB
> > and RS and the '4-ness' of something it is quite difficult.
> >
> > I THINK this is all about telling the hardware there is a packet
> > in the ring to transmit?
> >
> > I don't understand the 4-ness. Linux requires that the hardware
> > be notified of a single packet transmit, and that the 'transmit
> > complete' also be notified in 'a timely manner' for a single packet.
> 
> So to clarify some of this. The RS is short for Report Status. It
> tells the hardware that when it has completed the descriptor it should
> trigger a write back, aka WB.
> 
> The 4-ness is because each descriptor is 16 bytes, so if we write back
> once every 4 descriptors that is one 64B cache line which is much
> better for most platforms performance wise.
> 
> > skb->xmit_more ought to be usable to optimise both of these
> > (assuming it isn't a lie).
> 
> Actually it leads to issues if we hold off for too long. If we are
> busy dequeueing a long list, and the Tx queue has gone empty then we
> are stalling Tx without need to do so. We need to be regularly
> notifying the device that it has buffers available to transmit which
> is what xmit_more is good for. However in addition the hardware needs
> to notify us regularly that there are buffers ready to clean which it
> isn't necessarily so useful for, especially if  are filling the entire
> ring and only providing one notification to the driver that there are
> buffers ready since we can defer the Tx interrupt for too long.
> 
> What Jake is trying to fix here is to prevent us from generating what
> looks like a square wave in terms of throughput. Essentially the
> current behavior that is being seen is that all the buffers in the
> ring either belong to software, or belong to hardware. It is best for
> us to have this split half and half so that the hardware has buffers
> to transmit and software has buffers to clean and enqueue new buffers
> onto.
> 

Yes this sums it up pretty well. The test case which found this bug is outlined in the description. Essentially we could see up to dozens or even almost a hundred packets in a row bunched up if we had a bunch of virtual devices layered on top of the network driver. This led to us not indicating to get status from the hardware about finished packets, which results in an increased delay to finish Tx.

> > The driver would need to ensure a ring full of messages isn't
> > generated without either wakeup - but that might be 128 frames.
> 
> That is what is happening here. Basically we are guaranteeing
> writebacks are occurring on a regular basis instead of in large
> bursts.
> 
> > FWIW on the systems I have PCIe writes are relatively cheap
> > (reads are slow). So different counts would be appropriate
> > for delaying doorbell writes and requesting completion interrupts.
> >
> >         David
> 
> The doorbell writes are still being delayed the same amount with this
> patch. The only thing that was split out was the device write backs
> are now occurring on a more regular basis instead of being held off
> until a much larger set is handled.
> 

Exactly. The tail bumps are still functioning the same, but it's the hardware report status requests which are not delayed. Note that there's some further quirks in how our hardware performs write backs which exacerbate the problem because of how the cachelines are setup so that the delay is even worse than we would expect.

Thanks,
Jake

> - Alex

^ permalink raw reply

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
From: Phil Sutter @ 2017-10-09 20:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Hangbin Liu, netdev, Michal Kubecek, Hangbin Liu
In-Reply-To: <20171002103708.0572704b@xeon-e3>

Hi Stephen,

On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:
> On Thu, 28 Sep 2017 21:33:46 +0800
> Hangbin Liu <haliu@redhat.com> wrote:
> 
> > From: Hangbin Liu <liuhangbin@gmail.com>
> > 
> > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > iplink_get()"). After update, we will not need to double the buffer size
> > every time when VFs number increased.
> > 
> > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > length parameter.
> > 
> > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > answer to avoid overwrite data in nlh, because it may has more info after
> > nlh. also this will avoid nlh buffer not enough issue.
> > 
> > We need to free answer after using.
> > 
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > ---
> 
> Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> Can only those places that need that be targeted?

We could probably do that, by having a buffer on stack in __rtnl_talk()
which will be used instead of the allocated one if 'answer' is NULL. Or
maybe even introduce a dedicated API call for the dynamically allocated
receive buffer. But I really doubt that's feasible: AFAICT, that stack
buffer still needs to be reasonably sized since the reply might be
larger than the request (reusing the request buffer would be the most
simple way to tackle this), also there is support for extack which may
bloat the response to arbitrary size. Hangbin has shown in his benchmark
that the overhead of the second syscall is negligible, so why care about
that and increase code complexity even further?

Not saying it's not possible, but I just doubt it's worth the effort.

Cheers, Phil

^ permalink raw reply

* Re: [PATCH v2] net/core: Fix BUG to BUG_ON conditionals.
From: Levin, Alexander (Sasha Levin) @ 2017-10-09 20:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tim Hansen, davem@davemloft.net, willemb@google.com,
	edumazet@google.com, soheil@google.com, pabeni@redhat.com,
	elena.reshetova@intel.com, tom@quantonium.net, Jason@zx2c4.com,
	fw@strlen.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20171009171540.lgmk2slq3bdrptxk@ast-mbp.dhcp.thefacebook.com>

On Mon, Oct 09, 2017 at 10:15:42AM -0700, Alexei Starovoitov wrote:
>On Mon, Oct 09, 2017 at 11:37:59AM -0400, Tim Hansen wrote:
>> Fix BUG() calls to use BUG_ON(conditional) macros.
>>
>> This was found using make coccicheck M=net/core on linux next
>> tag next-2017092
>>
>> Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
>> ---
>>  net/core/skbuff.c | 15 ++++++---------
>>  1 file changed, 6 insertions(+), 9 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index d98c2e3ce2bf..34ce4c1a0f3c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1350,8 +1350,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
>>  	/* Set the tail pointer and length */
>>  	skb_put(n, skb->len);
>>
>> -	if (skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len))
>> -		BUG();
>> +	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
>
>I'm concerned with this change.
>1. Calling non-trivial bit of code inside the macro is a poor coding style (imo)
>2. BUG_ON != BUG. Some archs like mips and ppc have HAVE_ARCH_BUG_ON and implementation
>of BUG and BUG_ON look quite different.

For these archs, wouldn't it then be more efficient to use BUG_ON rather than BUG()?

-- 

Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] once: switch to new jump label API
From: Eric Biggers @ 2017-10-09 20:27 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, netdev, linux-kernel, Jason Baron,
	Peter Zijlstra, Eric Biggers
In-Reply-To: <20170916040751.GA14238@zzz.localdomain>

On Fri, Sep 15, 2017 at 09:07:51PM -0700, Eric Biggers wrote:
> On Tue, Aug 22, 2017 at 02:44:41PM -0400, Hannes Frederic Sowa wrote:
> > Eric Biggers <ebiggers3@gmail.com> writes:
> > 
> > > From: Eric Biggers <ebiggers@google.com>
> > >
> > > Switch the DO_ONCE() macro from the deprecated jump label API to the new
> > > one.  The new one is more readable, and for DO_ONCE() it also makes the
> > > generated code more icache-friendly: now the one-time initialization
> > > code is placed out-of-line at the jump target, rather than at the inline
> > > fallthrough case.
> > >
> > > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > 
> > Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org.
> > 
> > Thanks!
> 
> Great!  Who though is the maintainer for this code?  It seems it was originally
> taken by David Miller through the networking tree.  David, are you taking
> further patches to the "once" functions, or should I be trying to get this into
> -mm, or somewhere else?
> 
> Eric

Ping.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox