Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
From: Jesper Dangaard Brouer @ 2016-11-14 11:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: jasowang, netdev, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Mathieu Desnoyers
In-Reply-To: <20161111044408.1547.92737.stgit@john-Precision-Tower-5810>


On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> ---
>  include/linux/ptr_ring_ll.h |  136 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/skb_array.h   |   25 ++++++++
>  2 files changed, 161 insertions(+)
>  create mode 100644 include/linux/ptr_ring_ll.h
> 
> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h
> new file mode 100644
> index 0000000..bcb11f3
> --- /dev/null
> +++ b/include/linux/ptr_ring_ll.h
> @@ -0,0 +1,136 @@
> +/*
> + *	Definitions for the 'struct ptr_ring_ll' datastructure.
> + *
> + *	Author:
> + *		John Fastabend <john.r.fastabend@intel.com>
[...]
> + *
> + *	This is a limited-size FIFO maintaining pointers in FIFO order, with
> + *	one CPU producing entries and another consuming entries from a FIFO.
> + *	extended from ptr_ring_ll to use cmpxchg over spin lock.

It sounds like this is Single Producer Single Consumer (SPSC)
implementation, but your implementation actually is Multi Producer
Multi Consumer (MPMC) capable.

The implementation looks a lot like my alf_queue[1] implementation:
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h

If the primary use-case is one CPU producing and another consuming,
then the normal ptr_ring (skb_array) will actually be faster!

The reason is ptr_ring avoids bouncing a cache-line between the CPUs on
every ring access.  This is achieved by having the checks for full
(__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the
array (NULL value).

I actually implemented two micro-benchmarks to measure the difference
between skb_array[2] and alf_queue[3]:
 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
 [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c


> + */
> +
> +#ifndef _LINUX_PTR_RING_LL_H
> +#define _LINUX_PTR_RING_LL_H 1
> +
[...]
> +
> +struct ptr_ring_ll {
> +	u32 prod_size;
> +	u32 prod_mask;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_size;
> +	u32 cons_mask;
> +	u32 cons_head;
> +	u32 cons_tail;
> +
> +	void **queue;
> +};

Your implementation doesn't even split the consumer and producer into
different cachelines (which in practice doesn't help much due to how
the empty/full checks are performed).

> +
> +/* Note: callers invoking this in a loop must use a compiler barrier,
> + * for example cpu_relax(). Callers must hold producer_lock.
> + */
> +static inline int __ptr_ring_ll_produce(struct ptr_ring_ll *r, void *ptr)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +
> +	do {
> +		head = READ_ONCE(r->prod_head);
> +		mask = READ_ONCE(r->prod_mask);
> +		tail = READ_ONCE(r->cons_tail);

Problem occur here, as the producer need to access/read the consumers
tail, to determine if the queue is not already full (slots avail).
Thus, the next "consumer-CPU" will see the cacheline in wrong state
(Modified/Invalid or Shared).

> +
> +		slots = mask + tail - head;
> +		if (slots < 1)
> +			return -ENOMEM;
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->prod_head, head, next);
> +	} while (ret != head);
> +
> +	r->queue[head & mask] = ptr;
> +	smp_wmb();
> +
> +	while (r->prod_tail != head)
> +		cpu_relax();
> +
> +	r->prod_tail = next;
> +	return 0;
> +}
> +
> +static inline void *__ptr_ring_ll_consume(struct ptr_ring_ll *r)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +	void *ptr;
> +
> +	do {
> +		head = READ_ONCE(r->cons_head);
> +		mask = READ_ONCE(r->cons_mask);
> +		tail = READ_ONCE(r->prod_tail);

Like wise the consumer is reading the producer tail (for the empty check).

> +
> +		slots = tail - head;
> +		if (slots < 1)
> +			return ERR_PTR(-ENOMEM);
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->cons_head, head, next);
> +	} while (ret != head);
> +
> +	ptr = r->queue[head & mask];
> +	smp_rmb();
> +
> +	while (r->cons_tail != head)
> +		cpu_relax();
> +
> +	r->cons_tail = next;
> +	return ptr;
> +}
> +
> +static inline void **__ptr_ring_ll_init_queue_alloc(int size, gfp_t gfp)
> +{
> +	return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
> +}
> +
> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t gfp)
> +{
> +	r->queue = __ptr_ring_init_queue_alloc(size, gfp);
> +	if (!r->queue)
> +		return -ENOMEM;
> +
> +	r->prod_size = r->cons_size = size;
> +	r->prod_mask = r->cons_mask = size - 1;

Shouldn't we have some check like is_power_of_2(size), as this code
looks like it depend on this.

> +	r->prod_tail = r->prod_head = 0;
> +	r->cons_tail = r->prod_tail = 0;
> +
> +	return 0;
> +}
> +
[...]
> +#endif /* _LINUX_PTR_RING_LL_H  */
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> index f4dfade..9b43dfd 100644
> --- a/include/linux/skb_array.h
> +++ b/include/linux/skb_array.h
[...]
>  
> +static inline int skb_array_ll_produce(struct skb_array_ll *a, struct sk_buff *skb)
> +{
> +	return __ptr_ring_ll_produce(&a->ring, skb);
> +}
> +
[...]
>  
> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a)
> +{
> +	return __ptr_ring_ll_consume(&a->ring);
> +}
> +

Note in the Multi Producer Multi Consumer (MPMC) use-case this type of
queue can be faster than normal ptr_ring.  And in patch2 you implement
bulking, which is where the real benefit shows (in the MPMC case) for
this kind of queue.

What I would really like to see is a lock-free (locked cmpxchg) queue
implementation, what like ptr_ring use the array as empty/full check,
and still (somehow) support bulking.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH] net/phy: add trace events for mdio accesses
From: Uwe Kleine-König @ 2016-11-14 11:03 UTC (permalink / raw)
  To: Florian Fainelli, Steven Rostedt, Ingo Molnar; +Cc: netdev

Make it possible to generate trace events for mdio read and write accesses.

Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
---
 drivers/net/phy/mdio_bus.c  | 15 +++++++++++++++
 include/trace/events/mdio.h | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)
 create mode 100644 include/trace/events/mdio.h

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 09deef4bed09..0f3f207419f6 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -38,6 +38,9 @@
 
 #include <asm/irq.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/mdio.h>
+
 int mdiobus_register_device(struct mdio_device *mdiodev)
 {
 	if (mdiodev->bus->mdio_map[mdiodev->addr])
@@ -461,6 +464,9 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum)
 	retval = bus->read(bus, addr, regnum);
 	mutex_unlock(&bus->mdio_lock);
 
+	if (retval >= 0)
+		trace_mdio_access(bus, 1, addr, regnum, retval);
+
 	return retval;
 }
 EXPORT_SYMBOL(mdiobus_read_nested);
@@ -485,6 +491,9 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
 	retval = bus->read(bus, addr, regnum);
 	mutex_unlock(&bus->mdio_lock);
 
+	if (retval >= 0)
+		trace_mdio_access(bus, 1, addr, regnum, retval);
+
 	return retval;
 }
 EXPORT_SYMBOL(mdiobus_read);
@@ -513,6 +522,9 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
 	err = bus->write(bus, addr, regnum, val);
 	mutex_unlock(&bus->mdio_lock);
 
+	if (!err)
+		trace_mdio_access(bus, 0, addr, regnum, val);
+
 	return err;
 }
 EXPORT_SYMBOL(mdiobus_write_nested);
@@ -538,6 +550,9 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
 	err = bus->write(bus, addr, regnum, val);
 	mutex_unlock(&bus->mdio_lock);
 
+	if (!err)
+		trace_mdio_access(bus, 0, addr, regnum, val);
+
 	return err;
 }
 EXPORT_SYMBOL(mdiobus_write);
diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
new file mode 100644
index 000000000000..dcb2d456a346
--- /dev/null
+++ b/include/trace/events/mdio.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mdio
+
+#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MDIO_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mdio_access,
+
+	TP_PROTO(struct mii_bus *bus, int read,
+		 unsigned addr, unsigned regnum, u16 val),
+
+	TP_ARGS(bus, read, addr, regnum, val),
+
+	TP_STRUCT__entry(
+		__array(char, busid, MII_BUS_ID_SIZE)
+		__field(int, read)
+		__field(unsigned, addr)
+		__field(unsigned, regnum)
+		__field(u16, val)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
+		__entry->read = read;
+		__entry->addr = addr;
+		__entry->regnum = regnum;
+		__entry->val = val;
+	),
+
+	TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
+		  __entry->busid, __entry->read ? "read" : "write",
+		  __entry->addr, __entry->regnum, __entry->val)
+);
+
+#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
2.10.2

^ permalink raw reply related

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-14 11:02 UTC (permalink / raw)
  To: Giuseppe CAVALLARO, Martin Blumenstingl
  Cc: André Roth, Alexandre Torgue, Johnson Leung, linux-amlogic,
	netdev, afaerber
In-Reply-To: <68696e92-d056-5f29-9e71-72066f0c7673@st.com>

On Mon, 2016-11-14 at 08:47 +0100, Giuseppe CAVALLARO wrote:
> Hello Martin
> 
> On 11/7/2016 6:37 PM, Martin Blumenstingl wrote:
> > 
> > Hi Peppe,
> > 
> > On Mon, Nov 7, 2016 at 11:59 AM, Giuseppe CAVALLARO
> > <peppe.cavallaro@st.com> wrote:
> > > 
> > > In the meantime, I will read again the thread just to see if
> > > there is something I am missing.
> > if you are re-reading this thread: please note that there are two
> > devices in discussion here!
> 
> many thx for the sum :-)
> 
> > 
> > Both are using the Amlogic S905 (GXBB) SoC and both are
> > experiencing
> > the same issue (Gbit TX issues, RX with Gbit speeds and RX/TX with
> > 100Mbit speed are NOT affected):
> > - Odroid-C2 (used by Jerome and André Roth)
> > - Tronsmart Vega S95 Meta (my device)
> > 
> > The (Gbit TX) problem seems to be gone on the Odroid-C2 with
> > Jerome's
> > patch which disables EEE in drivers/net/phy/realtek.c (at least in
> > his
> > tests, I don't have that device so I can't verify).
> > The same problem still appears on my Tronsmart Vega S95 Meta even
> > with
> > the patched PHY driver.
> 
> just an doubt, maybe useful, in the past, on GiGa setup I saw similar
> problems and it was due to retiming so maybe 2ns could be necessary
> (or better granularity via PAD logic if available).
> 
> Regards
> Peppe

Peppe, Martin,

With Andre's feedback, I think we can confirm that disabling EEE solve
the problem for the OdroidC2 design.

We do have the same results as Martin on MXQ-Pro based designs. For
these particular boards, disabling EEE does not seems to enough to get
a stable Tx path in 1000Base-T.

I will submit the patch for the Odroidc2 later today.
For the Vega, you should probably check the Tx delay as Peppe suggests.
To do these tests, It would probably be better to disable EEE as well.
Do you want me to include this change for the vega in the patch ?

Cheers
Jerome 

> 
> > 
> > Unfortunately I don't have a second device to rule out that my
> > Tronsmart Vega S95 Meta could be broken (not unlikely, I get DDR
> > errors from time to time in u-boot). Maybe Andreas Faerber can test
> > ethernet with and without Jerome's patch on one of his Tronsmart
> > devices.
> > 
> > 
> > Regards,
> > Martin
> > 
> 

^ permalink raw reply

* [PATCH 1/5] net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, linux-arm-kernel, Radha Mohan Chintakuntla,
	Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>

From: Radha Mohan Chintakuntla <rchintakuntla@cavium.com>

This patch fixes the 'bgx_id' determination on 83xx where there are
4 BGX blocks instead of 2 on other platforms.

Signed-off-by: Radha Mohan Chintakuntla <rchintakuntla@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++--
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index 8bbaedb..050e21f 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -1242,8 +1242,8 @@ static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	pci_read_config_word(pdev, PCI_DEVICE_ID, &sdevid);
 	if (sdevid != PCI_DEVICE_ID_THUNDER_RGX) {
-		bgx->bgx_id =
-		    (pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM) >> 24) & 1;
+		bgx->bgx_id = (pci_resource_start(pdev,
+			PCI_CFG_REG_BAR_NUM) >> 24) & BGX_ID_MASK;
 		bgx->bgx_id += nic_get_node_id(pdev) * MAX_BGX_PER_NODE;
 		bgx->max_lmac = MAX_LMAC_PER_BGX;
 		bgx_vnic[bgx->bgx_id] = bgx;
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
index d59c71e..01cc7c8 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
@@ -28,6 +28,8 @@
 #define    MAX_DMAC_PER_LMAC			8
 #define    MAX_FRAME_SIZE			9216
 
+#define	   BGX_ID_MASK				0x3
+
 #define    MAX_DMAC_PER_LMAC_TNS_BYPASS_MODE	2
 
 /* Registers */
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/5] net: thunderx: Miscellaneous fixes
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham

From: Sunil Goutham <sgoutham@cavium.com>

This patchset includes fixes for incorrect LMAC credits,
unreliable driver statistics, memory leak upon interface
down e.t.c

Radha Mohan Chintakuntla (1):
  net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id

Sunil Goutham (4):
  net: thunderx: Program LMAC credits based on MTU
  net: thunderx: Fix configuration of L3/L4 length checking
  net: thunderx: Fix VF driver's interface statistics
  net: thunderx: Fix memory leak and other issues upon interface toggle

 drivers/net/ethernet/cavium/thunder/nic.h          |  64 +++++----
 drivers/net/ethernet/cavium/thunder/nic_main.c     |  37 +++--
 drivers/net/ethernet/cavium/thunder/nic_reg.h      |   1 +
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c    | 105 +++++++-------
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 153 +++++++++++----------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 118 +++++++++-------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |  24 +---
 drivers/net/ethernet/cavium/thunder/thunder_bgx.c  |   4 +-
 drivers/net/ethernet/cavium/thunder/thunder_bgx.h  |   2 +
 9 files changed, 274 insertions(+), 234 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH 5/5] net: thunderx: Fix memory leak and other issues upon interface toggle
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev; +Cc: Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>

From: Sunil Goutham <sgoutham@cavium.com>

This patch fixes the following
1. When interface is being teardown and queues are being cleaned up,
   free pending SKBs that are in SQ which are either not transmitted
   or freed as NAPI is disabled by that time.
2. While interface initialization, delay CFG_DONE notification till
   the end to avoid corner cases where TXQs are enabled but CQ
   interrupts are not which results blocking transmission and kicking
   off watchdog.
3. Check for IFF_UP while re-enabling RBDR interrupts from tasklet.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 11 +++++------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 14 +++++++++++++-
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 9dc79c05..8a37012 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -473,9 +473,6 @@ int nicvf_set_real_num_queues(struct net_device *netdev,
 static int nicvf_init_resources(struct nicvf *nic)
 {
 	int err;
-	union nic_mbx mbx = {};
-
-	mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE;
 
 	/* Enable Qset */
 	nicvf_qset_config(nic, true);
@@ -488,9 +485,6 @@ static int nicvf_init_resources(struct nicvf *nic)
 		return err;
 	}
 
-	/* Send VF config done msg to PF */
-	nicvf_write_to_mbx(nic, &mbx);
-
 	return 0;
 }
 
@@ -1184,6 +1178,7 @@ int nicvf_open(struct net_device *netdev)
 	struct nicvf *nic = netdev_priv(netdev);
 	struct queue_set *qs = nic->qs;
 	struct nicvf_cq_poll *cq_poll = NULL;
+	union nic_mbx mbx = {};
 
 	netif_carrier_off(netdev);
 
@@ -1271,6 +1266,10 @@ int nicvf_open(struct net_device *netdev)
 	for (qidx = 0; qidx < qs->rbdr_cnt; qidx++)
 		nicvf_enable_intr(nic, NICVF_INTR_RBDR, qidx);
 
+	/* Send VF config done msg to PF */
+	mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE;
+	nicvf_write_to_mbx(nic, &mbx);
+
 	return 0;
 cleanup:
 	nicvf_disable_intr(nic, NICVF_INTR_MBOX, 0);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 86726ab..43ccc9d 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -271,7 +271,8 @@ static void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp)
 			      rbdr_idx, new_rb);
 next_rbdr:
 	/* Re-enable RBDR interrupts only if buffer allocation is success */
-	if (!nic->rb_alloc_fail && rbdr->enable)
+	if (!nic->rb_alloc_fail && rbdr->enable &&
+	    netif_running(nic->pnicvf->netdev))
 		nicvf_enable_intr(nic, NICVF_INTR_RBDR, rbdr_idx);
 
 	if (rbdr_idx)
@@ -362,6 +363,8 @@ static int nicvf_init_snd_queue(struct nicvf *nic,
 
 static void nicvf_free_snd_queue(struct nicvf *nic, struct snd_queue *sq)
 {
+	struct sk_buff *skb;
+
 	if (!sq)
 		return;
 	if (!sq->dmem.base)
@@ -372,6 +375,15 @@ static void nicvf_free_snd_queue(struct nicvf *nic, struct snd_queue *sq)
 				  sq->dmem.q_len * TSO_HEADER_SIZE,
 				  sq->tso_hdrs, sq->tso_hdrs_phys);
 
+	/* Free pending skbs in the queue */
+	smp_rmb();
+	while (sq->head != sq->tail) {
+		skb = (struct sk_buff *)sq->skbuff[sq->head];
+		if (skb)
+			dev_kfree_skb_any(skb);
+		sq->head++;
+		sq->head &= (sq->dmem.q_len - 1);
+	}
 	kfree(sq->skbuff);
 	nicvf_free_q_desc_mem(nic, &sq->dmem);
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4/5] net: thunderx: Fix VF driver's interface statistics
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>

From: Sunil Goutham <sgoutham@cavium.com>

This patch fixes multiple issues
1. Convert all driver statistics to percpu counters for accuracy.
2. To avoid multiple CQEs posted by a TSO packet appended to HW,
   TSO pkt's SQE has 'post_cqe' not set but a dummy SQE is added
   for getting HW transmit completion notification. This dummy
   SQE has 'dont_send' set and HW drops the pkt pointed to in this
   thus Tx drop counter increases. This patch fixes this by subtracting
   SW tx tso counter from HW Tx drop counter for actual packet drop counter.
3. Reset all individual queue's and VNIC HW stats when interface is going down.
4. Getrid off unnecessary counters in hot path.
5. Bringout all CQE error stats i.e both Rx and Tx.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nic.h          |  61 +++++++-----
 drivers/net/ethernet/cavium/thunder/nic_main.c     |   1 +
 .../net/ethernet/cavium/thunder/nicvf_ethtool.c    | 105 +++++++++++---------
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 106 +++++++++++----------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  96 +++++++++----------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |  24 +----
 6 files changed, 197 insertions(+), 196 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index cd2d379..86bd93c 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -178,11 +178,11 @@ enum tx_stats_reg_offset {
 
 struct nicvf_hw_stats {
 	u64 rx_bytes;
+	u64 rx_frames;
 	u64 rx_ucast_frames;
 	u64 rx_bcast_frames;
 	u64 rx_mcast_frames;
-	u64 rx_fcs_errors;
-	u64 rx_l2_errors;
+	u64 rx_drops;
 	u64 rx_drop_red;
 	u64 rx_drop_red_bytes;
 	u64 rx_drop_overrun;
@@ -191,6 +191,19 @@ struct nicvf_hw_stats {
 	u64 rx_drop_mcast;
 	u64 rx_drop_l3_bcast;
 	u64 rx_drop_l3_mcast;
+	u64 rx_fcs_errors;
+	u64 rx_l2_errors;
+
+	u64 tx_bytes;
+	u64 tx_frames;
+	u64 tx_ucast_frames;
+	u64 tx_bcast_frames;
+	u64 tx_mcast_frames;
+	u64 tx_drops;
+};
+
+struct nicvf_drv_stats {
+	/* CQE Rx errs */
 	u64 rx_bgx_truncated_pkts;
 	u64 rx_jabber_errs;
 	u64 rx_fcs_errs;
@@ -216,34 +229,30 @@ struct nicvf_hw_stats {
 	u64 rx_l4_pclp;
 	u64 rx_truncated_pkts;
 
-	u64 tx_bytes_ok;
-	u64 tx_ucast_frames_ok;
-	u64 tx_bcast_frames_ok;
-	u64 tx_mcast_frames_ok;
-	u64 tx_drops;
-};
-
-struct nicvf_drv_stats {
-	/* Rx */
-	u64 rx_frames_ok;
-	u64 rx_frames_64;
-	u64 rx_frames_127;
-	u64 rx_frames_255;
-	u64 rx_frames_511;
-	u64 rx_frames_1023;
-	u64 rx_frames_1518;
-	u64 rx_frames_jumbo;
-	u64 rx_drops;
-
+	/* CQE Tx errs */
+	u64 tx_desc_fault;
+	u64 tx_hdr_cons_err;
+	u64 tx_subdesc_err;
+	u64 tx_max_size_exceeded;
+	u64 tx_imm_size_oflow;
+	u64 tx_data_seq_err;
+	u64 tx_mem_seq_err;
+	u64 tx_lock_viol;
+	u64 tx_data_fault;
+	u64 tx_tstmp_conflict;
+	u64 tx_tstmp_timeout;
+	u64 tx_mem_fault;
+	u64 tx_csum_overlap;
+	u64 tx_csum_overflow;
+
+	/* driver debug stats */
 	u64 rcv_buffer_alloc_failures;
-
-	/* Tx */
-	u64 tx_frames_ok;
-	u64 tx_drops;
 	u64 tx_tso;
 	u64 tx_timeout;
 	u64 txq_stop;
 	u64 txq_wake;
+
+	struct u64_stats_sync   syncp;
 };
 
 struct nicvf {
@@ -297,7 +306,7 @@ struct nicvf {
 
 	/* Stats */
 	struct nicvf_hw_stats   hw_stats;
-	struct nicvf_drv_stats  drv_stats;
+	struct nicvf_drv_stats  __percpu *drv_stats;
 	struct bgx_stats	bgx_stats;
 
 	/* MSI-X  */
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 85c9e62..6677b96 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -851,6 +851,7 @@ static int nic_reset_stat_counters(struct nicpf *nic,
 			nic_reg_write(nic, reg_addr, 0);
 		}
 	}
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
index ad4fddb..432bf6b 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
@@ -36,11 +36,11 @@ struct nicvf_stat {
 
 static const struct nicvf_stat nicvf_hw_stats[] = {
 	NICVF_HW_STAT(rx_bytes),
+	NICVF_HW_STAT(rx_frames),
 	NICVF_HW_STAT(rx_ucast_frames),
 	NICVF_HW_STAT(rx_bcast_frames),
 	NICVF_HW_STAT(rx_mcast_frames),
-	NICVF_HW_STAT(rx_fcs_errors),
-	NICVF_HW_STAT(rx_l2_errors),
+	NICVF_HW_STAT(rx_drops),
 	NICVF_HW_STAT(rx_drop_red),
 	NICVF_HW_STAT(rx_drop_red_bytes),
 	NICVF_HW_STAT(rx_drop_overrun),
@@ -49,50 +49,59 @@ static const struct nicvf_stat nicvf_hw_stats[] = {
 	NICVF_HW_STAT(rx_drop_mcast),
 	NICVF_HW_STAT(rx_drop_l3_bcast),
 	NICVF_HW_STAT(rx_drop_l3_mcast),
-	NICVF_HW_STAT(rx_bgx_truncated_pkts),
-	NICVF_HW_STAT(rx_jabber_errs),
-	NICVF_HW_STAT(rx_fcs_errs),
-	NICVF_HW_STAT(rx_bgx_errs),
-	NICVF_HW_STAT(rx_prel2_errs),
-	NICVF_HW_STAT(rx_l2_hdr_malformed),
-	NICVF_HW_STAT(rx_oversize),
-	NICVF_HW_STAT(rx_undersize),
-	NICVF_HW_STAT(rx_l2_len_mismatch),
-	NICVF_HW_STAT(rx_l2_pclp),
-	NICVF_HW_STAT(rx_ip_ver_errs),
-	NICVF_HW_STAT(rx_ip_csum_errs),
-	NICVF_HW_STAT(rx_ip_hdr_malformed),
-	NICVF_HW_STAT(rx_ip_payload_malformed),
-	NICVF_HW_STAT(rx_ip_ttl_errs),
-	NICVF_HW_STAT(rx_l3_pclp),
-	NICVF_HW_STAT(rx_l4_malformed),
-	NICVF_HW_STAT(rx_l4_csum_errs),
-	NICVF_HW_STAT(rx_udp_len_errs),
-	NICVF_HW_STAT(rx_l4_port_errs),
-	NICVF_HW_STAT(rx_tcp_flag_errs),
-	NICVF_HW_STAT(rx_tcp_offset_errs),
-	NICVF_HW_STAT(rx_l4_pclp),
-	NICVF_HW_STAT(rx_truncated_pkts),
-	NICVF_HW_STAT(tx_bytes_ok),
-	NICVF_HW_STAT(tx_ucast_frames_ok),
-	NICVF_HW_STAT(tx_bcast_frames_ok),
-	NICVF_HW_STAT(tx_mcast_frames_ok),
+	NICVF_HW_STAT(rx_fcs_errors),
+	NICVF_HW_STAT(rx_l2_errors),
+	NICVF_HW_STAT(tx_bytes),
+	NICVF_HW_STAT(tx_frames),
+	NICVF_HW_STAT(tx_ucast_frames),
+	NICVF_HW_STAT(tx_bcast_frames),
+	NICVF_HW_STAT(tx_mcast_frames),
+	NICVF_HW_STAT(tx_drops),
 };
 
 static const struct nicvf_stat nicvf_drv_stats[] = {
-	NICVF_DRV_STAT(rx_frames_ok),
-	NICVF_DRV_STAT(rx_frames_64),
-	NICVF_DRV_STAT(rx_frames_127),
-	NICVF_DRV_STAT(rx_frames_255),
-	NICVF_DRV_STAT(rx_frames_511),
-	NICVF_DRV_STAT(rx_frames_1023),
-	NICVF_DRV_STAT(rx_frames_1518),
-	NICVF_DRV_STAT(rx_frames_jumbo),
-	NICVF_DRV_STAT(rx_drops),
+	NICVF_DRV_STAT(rx_bgx_truncated_pkts),
+	NICVF_DRV_STAT(rx_jabber_errs),
+	NICVF_DRV_STAT(rx_fcs_errs),
+	NICVF_DRV_STAT(rx_bgx_errs),
+	NICVF_DRV_STAT(rx_prel2_errs),
+	NICVF_DRV_STAT(rx_l2_hdr_malformed),
+	NICVF_DRV_STAT(rx_oversize),
+	NICVF_DRV_STAT(rx_undersize),
+	NICVF_DRV_STAT(rx_l2_len_mismatch),
+	NICVF_DRV_STAT(rx_l2_pclp),
+	NICVF_DRV_STAT(rx_ip_ver_errs),
+	NICVF_DRV_STAT(rx_ip_csum_errs),
+	NICVF_DRV_STAT(rx_ip_hdr_malformed),
+	NICVF_DRV_STAT(rx_ip_payload_malformed),
+	NICVF_DRV_STAT(rx_ip_ttl_errs),
+	NICVF_DRV_STAT(rx_l3_pclp),
+	NICVF_DRV_STAT(rx_l4_malformed),
+	NICVF_DRV_STAT(rx_l4_csum_errs),
+	NICVF_DRV_STAT(rx_udp_len_errs),
+	NICVF_DRV_STAT(rx_l4_port_errs),
+	NICVF_DRV_STAT(rx_tcp_flag_errs),
+	NICVF_DRV_STAT(rx_tcp_offset_errs),
+	NICVF_DRV_STAT(rx_l4_pclp),
+	NICVF_DRV_STAT(rx_truncated_pkts),
+
+	NICVF_DRV_STAT(tx_desc_fault),
+	NICVF_DRV_STAT(tx_hdr_cons_err),
+	NICVF_DRV_STAT(tx_subdesc_err),
+	NICVF_DRV_STAT(tx_max_size_exceeded),
+	NICVF_DRV_STAT(tx_imm_size_oflow),
+	NICVF_DRV_STAT(tx_data_seq_err),
+	NICVF_DRV_STAT(tx_mem_seq_err),
+	NICVF_DRV_STAT(tx_lock_viol),
+	NICVF_DRV_STAT(tx_data_fault),
+	NICVF_DRV_STAT(tx_tstmp_conflict),
+	NICVF_DRV_STAT(tx_tstmp_timeout),
+	NICVF_DRV_STAT(tx_mem_fault),
+	NICVF_DRV_STAT(tx_csum_overlap),
+	NICVF_DRV_STAT(tx_csum_overflow),
+
 	NICVF_DRV_STAT(rcv_buffer_alloc_failures),
-	NICVF_DRV_STAT(tx_frames_ok),
 	NICVF_DRV_STAT(tx_tso),
-	NICVF_DRV_STAT(tx_drops),
 	NICVF_DRV_STAT(tx_timeout),
 	NICVF_DRV_STAT(txq_stop),
 	NICVF_DRV_STAT(txq_wake),
@@ -278,8 +287,8 @@ static void nicvf_get_ethtool_stats(struct net_device *netdev,
 				    struct ethtool_stats *stats, u64 *data)
 {
 	struct nicvf *nic = netdev_priv(netdev);
-	int stat;
-	int sqs;
+	int stat, tmp_stats;
+	int sqs, cpu;
 
 	nicvf_update_stats(nic);
 
@@ -289,9 +298,13 @@ static void nicvf_get_ethtool_stats(struct net_device *netdev,
 	for (stat = 0; stat < nicvf_n_hw_stats; stat++)
 		*(data++) = ((u64 *)&nic->hw_stats)
 				[nicvf_hw_stats[stat].index];
-	for (stat = 0; stat < nicvf_n_drv_stats; stat++)
-		*(data++) = ((u64 *)&nic->drv_stats)
-				[nicvf_drv_stats[stat].index];
+	for (stat = 0; stat < nicvf_n_drv_stats; stat++) {
+		tmp_stats = 0;
+		for_each_possible_cpu(cpu)
+			tmp_stats += ((u64 *)per_cpu_ptr(nic->drv_stats, cpu))
+				     [nicvf_drv_stats[stat].index];
+		*(data++) = tmp_stats;
+	}
 
 	nicvf_get_qset_stats(nic, stats, &data);
 
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 8f83361..9dc79c05 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -69,25 +69,6 @@ static inline u8 nicvf_netdev_qidx(struct nicvf *nic, u8 qidx)
 		return qidx;
 }
 
-static inline void nicvf_set_rx_frame_cnt(struct nicvf *nic,
-					  struct sk_buff *skb)
-{
-	if (skb->len <= 64)
-		nic->drv_stats.rx_frames_64++;
-	else if (skb->len <= 127)
-		nic->drv_stats.rx_frames_127++;
-	else if (skb->len <= 255)
-		nic->drv_stats.rx_frames_255++;
-	else if (skb->len <= 511)
-		nic->drv_stats.rx_frames_511++;
-	else if (skb->len <= 1023)
-		nic->drv_stats.rx_frames_1023++;
-	else if (skb->len <= 1518)
-		nic->drv_stats.rx_frames_1518++;
-	else
-		nic->drv_stats.rx_frames_jumbo++;
-}
-
 /* The Cavium ThunderX network controller can *only* be found in SoCs
  * containing the ThunderX ARM64 CPU implementation.  All accesses to the device
  * registers on this platform are implicitly strongly ordered with respect
@@ -514,7 +495,6 @@ static int nicvf_init_resources(struct nicvf *nic)
 }
 
 static void nicvf_snd_pkt_handler(struct net_device *netdev,
-				  struct cmp_queue *cq,
 				  struct cqe_send_t *cqe_tx,
 				  int cqe_type, int budget,
 				  unsigned int *tx_pkts, unsigned int *tx_bytes)
@@ -536,7 +516,7 @@ static void nicvf_snd_pkt_handler(struct net_device *netdev,
 		   __func__, cqe_tx->sq_qs, cqe_tx->sq_idx,
 		   cqe_tx->sqe_ptr, hdr->subdesc_cnt);
 
-	nicvf_check_cqe_tx_errs(nic, cq, cqe_tx);
+	nicvf_check_cqe_tx_errs(nic, cqe_tx);
 	skb = (struct sk_buff *)sq->skbuff[cqe_tx->sqe_ptr];
 	if (skb) {
 		/* Check for dummy descriptor used for HW TSO offload on 88xx */
@@ -630,8 +610,6 @@ static void nicvf_rcv_pkt_handler(struct net_device *netdev,
 		return;
 	}
 
-	nicvf_set_rx_frame_cnt(nic, skb);
-
 	nicvf_set_rxhash(netdev, cqe_rx, skb);
 
 	skb_record_rx_queue(skb, rq_idx);
@@ -703,7 +681,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
 			work_done++;
 		break;
 		case CQE_TYPE_SEND:
-			nicvf_snd_pkt_handler(netdev, cq,
+			nicvf_snd_pkt_handler(netdev,
 					      (void *)cq_desc, CQE_TYPE_SEND,
 					      budget, &tx_pkts, &tx_bytes);
 			tx_done++;
@@ -740,7 +718,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
 		nic = nic->pnicvf;
 		if (netif_tx_queue_stopped(txq) && netif_carrier_ok(netdev)) {
 			netif_tx_start_queue(txq);
-			nic->drv_stats.txq_wake++;
+			this_cpu_inc(nic->drv_stats->txq_wake);
 			if (netif_msg_tx_err(nic))
 				netdev_warn(netdev,
 					    "%s: Transmit queue wakeup SQ%d\n",
@@ -1084,7 +1062,7 @@ static netdev_tx_t nicvf_xmit(struct sk_buff *skb, struct net_device *netdev)
 
 	if (!netif_tx_queue_stopped(txq) && !nicvf_sq_append_skb(nic, skb)) {
 		netif_tx_stop_queue(txq);
-		nic->drv_stats.txq_stop++;
+		this_cpu_inc(nic->drv_stats->txq_stop);
 		if (netif_msg_tx_err(nic))
 			netdev_warn(netdev,
 				    "%s: Transmit ring full, stopping SQ%d\n",
@@ -1202,7 +1180,7 @@ static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
 
 int nicvf_open(struct net_device *netdev)
 {
-	int err, qidx;
+	int cpu, err, qidx;
 	struct nicvf *nic = netdev_priv(netdev);
 	struct queue_set *qs = nic->qs;
 	struct nicvf_cq_poll *cq_poll = NULL;
@@ -1262,6 +1240,11 @@ int nicvf_open(struct net_device *netdev)
 		nicvf_rss_init(nic);
 		if (nicvf_update_hw_max_frs(nic, netdev->mtu))
 			goto cleanup;
+
+		/* Clear percpu stats */
+		for_each_possible_cpu(cpu)
+			memset(per_cpu_ptr(nic->drv_stats, cpu), 0,
+			       sizeof(struct nicvf_drv_stats));
 	}
 
 	err = nicvf_register_interrupts(nic);
@@ -1288,9 +1271,6 @@ int nicvf_open(struct net_device *netdev)
 	for (qidx = 0; qidx < qs->rbdr_cnt; qidx++)
 		nicvf_enable_intr(nic, NICVF_INTR_RBDR, qidx);
 
-	nic->drv_stats.txq_stop = 0;
-	nic->drv_stats.txq_wake = 0;
-
 	return 0;
 cleanup:
 	nicvf_disable_intr(nic, NICVF_INTR_MBOX, 0);
@@ -1383,9 +1363,10 @@ void nicvf_update_lmac_stats(struct nicvf *nic)
 
 void nicvf_update_stats(struct nicvf *nic)
 {
-	int qidx;
+	int qidx, cpu;
+	u64 tmp_stats = 0;
 	struct nicvf_hw_stats *stats = &nic->hw_stats;
-	struct nicvf_drv_stats *drv_stats = &nic->drv_stats;
+	struct nicvf_drv_stats *drv_stats;
 	struct queue_set *qs = nic->qs;
 
 #define GET_RX_STATS(reg) \
@@ -1408,21 +1389,33 @@ void nicvf_update_stats(struct nicvf *nic)
 	stats->rx_drop_l3_bcast = GET_RX_STATS(RX_DRP_L3BCAST);
 	stats->rx_drop_l3_mcast = GET_RX_STATS(RX_DRP_L3MCAST);
 
-	stats->tx_bytes_ok = GET_TX_STATS(TX_OCTS);
-	stats->tx_ucast_frames_ok = GET_TX_STATS(TX_UCAST);
-	stats->tx_bcast_frames_ok = GET_TX_STATS(TX_BCAST);
-	stats->tx_mcast_frames_ok = GET_TX_STATS(TX_MCAST);
+	stats->tx_bytes = GET_TX_STATS(TX_OCTS);
+	stats->tx_ucast_frames = GET_TX_STATS(TX_UCAST);
+	stats->tx_bcast_frames = GET_TX_STATS(TX_BCAST);
+	stats->tx_mcast_frames = GET_TX_STATS(TX_MCAST);
 	stats->tx_drops = GET_TX_STATS(TX_DROP);
 
-	drv_stats->tx_frames_ok = stats->tx_ucast_frames_ok +
-				  stats->tx_bcast_frames_ok +
-				  stats->tx_mcast_frames_ok;
-	drv_stats->rx_frames_ok = stats->rx_ucast_frames +
-				  stats->rx_bcast_frames +
-				  stats->rx_mcast_frames;
-	drv_stats->rx_drops = stats->rx_drop_red +
-			      stats->rx_drop_overrun;
-	drv_stats->tx_drops = stats->tx_drops;
+	/* On T88 pass 2.0, the dummy SQE added for TSO notification
+	 * via CQE has 'dont_send' set. Hence HW drops the pkt pointed
+	 * pointed by dummy SQE and results in tx_drops counter being
+	 * incremented. Subtracting it from tx_tso counter will give
+	 * exact tx_drops counter.
+	 */
+	if (nic->t88 && nic->hw_tso) {
+		for_each_possible_cpu(cpu) {
+			drv_stats = per_cpu_ptr(nic->drv_stats, cpu);
+			tmp_stats += drv_stats->tx_tso;
+		}
+		stats->tx_drops = tmp_stats - stats->tx_drops;
+	}
+	stats->tx_frames = stats->tx_ucast_frames +
+			   stats->tx_bcast_frames +
+			   stats->tx_mcast_frames;
+	stats->rx_frames = stats->rx_ucast_frames +
+			   stats->rx_bcast_frames +
+			   stats->rx_mcast_frames;
+	stats->rx_drops = stats->rx_drop_red +
+			  stats->rx_drop_overrun;
 
 	/* Update RQ and SQ stats */
 	for (qidx = 0; qidx < qs->rq_cnt; qidx++)
@@ -1436,18 +1429,17 @@ static struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev,
 {
 	struct nicvf *nic = netdev_priv(netdev);
 	struct nicvf_hw_stats *hw_stats = &nic->hw_stats;
-	struct nicvf_drv_stats *drv_stats = &nic->drv_stats;
 
 	nicvf_update_stats(nic);
 
 	stats->rx_bytes = hw_stats->rx_bytes;
-	stats->rx_packets = drv_stats->rx_frames_ok;
-	stats->rx_dropped = drv_stats->rx_drops;
+	stats->rx_packets = hw_stats->rx_frames;
+	stats->rx_dropped = hw_stats->rx_drops;
 	stats->multicast = hw_stats->rx_mcast_frames;
 
-	stats->tx_bytes = hw_stats->tx_bytes_ok;
-	stats->tx_packets = drv_stats->tx_frames_ok;
-	stats->tx_dropped = drv_stats->tx_drops;
+	stats->tx_bytes = hw_stats->tx_bytes;
+	stats->tx_packets = hw_stats->tx_frames;
+	stats->tx_dropped = hw_stats->tx_drops;
 
 	return stats;
 }
@@ -1460,7 +1452,7 @@ static void nicvf_tx_timeout(struct net_device *dev)
 		netdev_warn(dev, "%s: Transmit timed out, resetting\n",
 			    dev->name);
 
-	nic->drv_stats.tx_timeout++;
+	this_cpu_inc(nic->drv_stats->tx_timeout);
 	schedule_work(&nic->reset_task);
 }
 
@@ -1594,6 +1586,12 @@ static int nicvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_free_netdev;
 	}
 
+	nic->drv_stats = netdev_alloc_pcpu_stats(struct nicvf_drv_stats);
+	if (!nic->drv_stats) {
+		err = -ENOMEM;
+		goto err_free_netdev;
+	}
+
 	err = nicvf_set_qset_resources(nic);
 	if (err)
 		goto err_free_netdev;
@@ -1652,6 +1650,8 @@ static int nicvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	nicvf_unregister_interrupts(nic);
 err_free_netdev:
 	pci_set_drvdata(pdev, NULL);
+	if (nic->drv_stats)
+		free_percpu(nic->drv_stats);
 	free_netdev(netdev);
 err_release_regions:
 	pci_release_regions(pdev);
@@ -1679,6 +1679,8 @@ static void nicvf_remove(struct pci_dev *pdev)
 		unregister_netdev(pnetdev);
 	nicvf_unregister_interrupts(nic);
 	pci_set_drvdata(pdev, NULL);
+	if (nic->drv_stats)
+		free_percpu(nic->drv_stats);
 	free_netdev(netdev);
 	pci_release_regions(pdev);
 	pci_disable_device(pdev);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 3050177..86726ab 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -104,7 +104,8 @@ static inline int nicvf_alloc_rcv_buffer(struct nicvf *nic, gfp_t gfp,
 		nic->rb_page = alloc_pages(gfp | __GFP_COMP | __GFP_NOWARN,
 					   order);
 		if (!nic->rb_page) {
-			nic->drv_stats.rcv_buffer_alloc_failures++;
+			this_cpu_inc(nic->pnicvf->drv_stats->
+				     rcv_buffer_alloc_failures);
 			return -ENOMEM;
 		}
 		nic->rb_page_offset = 0;
@@ -483,9 +484,12 @@ static void nicvf_reset_rcv_queue_stats(struct nicvf *nic)
 {
 	union nic_mbx mbx = {};
 
-	/* Reset all RXQ's stats */
+	/* Reset all RQ/SQ and VF stats */
 	mbx.reset_stat.msg = NIC_MBOX_MSG_RESET_STAT_COUNTER;
+	mbx.reset_stat.rx_stat_mask = 0x3FFF;
+	mbx.reset_stat.tx_stat_mask = 0x1F;
 	mbx.reset_stat.rq_stat_mask = 0xFFFF;
+	mbx.reset_stat.sq_stat_mask = 0xFFFF;
 	nicvf_send_msg_to_pf(nic, &mbx);
 }
 
@@ -1032,7 +1036,7 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic, struct snd_queue *sq, int qentry,
 		hdr->tso_max_paysize = skb_shinfo(skb)->gso_size;
 		/* For non-tunneled pkts, point this to L2 ethertype */
 		hdr->inner_l3_offset = skb_network_offset(skb) - 2;
-		nic->drv_stats.tx_tso++;
+		this_cpu_inc(nic->pnicvf->drv_stats->tx_tso);
 	}
 }
 
@@ -1164,7 +1168,7 @@ static int nicvf_sq_append_tso(struct nicvf *nic, struct snd_queue *sq,
 
 	nicvf_sq_doorbell(nic, skb, sq_num, desc_cnt);
 
-	nic->drv_stats.tx_tso++;
+	this_cpu_inc(nic->pnicvf->drv_stats->tx_tso);
 	return 1;
 }
 
@@ -1425,8 +1429,6 @@ void nicvf_update_sq_stats(struct nicvf *nic, int sq_idx)
 /* Check for errors in the receive cmp.queue entry */
 int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
 {
-	struct nicvf_hw_stats *stats = &nic->hw_stats;
-
 	if (!cqe_rx->err_level && !cqe_rx->err_opcode)
 		return 0;
 
@@ -1438,76 +1440,76 @@ int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
 
 	switch (cqe_rx->err_opcode) {
 	case CQ_RX_ERROP_RE_PARTIAL:
-		stats->rx_bgx_truncated_pkts++;
+		this_cpu_inc(nic->drv_stats->rx_bgx_truncated_pkts);
 		break;
 	case CQ_RX_ERROP_RE_JABBER:
-		stats->rx_jabber_errs++;
+		this_cpu_inc(nic->drv_stats->rx_jabber_errs);
 		break;
 	case CQ_RX_ERROP_RE_FCS:
-		stats->rx_fcs_errs++;
+		this_cpu_inc(nic->drv_stats->rx_fcs_errs);
 		break;
 	case CQ_RX_ERROP_RE_RX_CTL:
-		stats->rx_bgx_errs++;
+		this_cpu_inc(nic->drv_stats->rx_bgx_errs);
 		break;
 	case CQ_RX_ERROP_PREL2_ERR:
-		stats->rx_prel2_errs++;
+		this_cpu_inc(nic->drv_stats->rx_prel2_errs);
 		break;
 	case CQ_RX_ERROP_L2_MAL:
-		stats->rx_l2_hdr_malformed++;
+		this_cpu_inc(nic->drv_stats->rx_l2_hdr_malformed);
 		break;
 	case CQ_RX_ERROP_L2_OVERSIZE:
-		stats->rx_oversize++;
+		this_cpu_inc(nic->drv_stats->rx_oversize);
 		break;
 	case CQ_RX_ERROP_L2_UNDERSIZE:
-		stats->rx_undersize++;
+		this_cpu_inc(nic->drv_stats->rx_undersize);
 		break;
 	case CQ_RX_ERROP_L2_LENMISM:
-		stats->rx_l2_len_mismatch++;
+		this_cpu_inc(nic->drv_stats->rx_l2_len_mismatch);
 		break;
 	case CQ_RX_ERROP_L2_PCLP:
-		stats->rx_l2_pclp++;
+		this_cpu_inc(nic->drv_stats->rx_l2_pclp);
 		break;
 	case CQ_RX_ERROP_IP_NOT:
-		stats->rx_ip_ver_errs++;
+		this_cpu_inc(nic->drv_stats->rx_ip_ver_errs);
 		break;
 	case CQ_RX_ERROP_IP_CSUM_ERR:
-		stats->rx_ip_csum_errs++;
+		this_cpu_inc(nic->drv_stats->rx_ip_csum_errs);
 		break;
 	case CQ_RX_ERROP_IP_MAL:
-		stats->rx_ip_hdr_malformed++;
+		this_cpu_inc(nic->drv_stats->rx_ip_hdr_malformed);
 		break;
 	case CQ_RX_ERROP_IP_MALD:
-		stats->rx_ip_payload_malformed++;
+		this_cpu_inc(nic->drv_stats->rx_ip_payload_malformed);
 		break;
 	case CQ_RX_ERROP_IP_HOP:
-		stats->rx_ip_ttl_errs++;
+		this_cpu_inc(nic->drv_stats->rx_ip_ttl_errs);
 		break;
 	case CQ_RX_ERROP_L3_PCLP:
-		stats->rx_l3_pclp++;
+		this_cpu_inc(nic->drv_stats->rx_l3_pclp);
 		break;
 	case CQ_RX_ERROP_L4_MAL:
-		stats->rx_l4_malformed++;
+		this_cpu_inc(nic->drv_stats->rx_l4_malformed);
 		break;
 	case CQ_RX_ERROP_L4_CHK:
-		stats->rx_l4_csum_errs++;
+		this_cpu_inc(nic->drv_stats->rx_l4_csum_errs);
 		break;
 	case CQ_RX_ERROP_UDP_LEN:
-		stats->rx_udp_len_errs++;
+		this_cpu_inc(nic->drv_stats->rx_udp_len_errs);
 		break;
 	case CQ_RX_ERROP_L4_PORT:
-		stats->rx_l4_port_errs++;
+		this_cpu_inc(nic->drv_stats->rx_l4_port_errs);
 		break;
 	case CQ_RX_ERROP_TCP_FLAG:
-		stats->rx_tcp_flag_errs++;
+		this_cpu_inc(nic->drv_stats->rx_tcp_flag_errs);
 		break;
 	case CQ_RX_ERROP_TCP_OFFSET:
-		stats->rx_tcp_offset_errs++;
+		this_cpu_inc(nic->drv_stats->rx_tcp_offset_errs);
 		break;
 	case CQ_RX_ERROP_L4_PCLP:
-		stats->rx_l4_pclp++;
+		this_cpu_inc(nic->drv_stats->rx_l4_pclp);
 		break;
 	case CQ_RX_ERROP_RBDR_TRUNC:
-		stats->rx_truncated_pkts++;
+		this_cpu_inc(nic->drv_stats->rx_truncated_pkts);
 		break;
 	}
 
@@ -1515,56 +1517,52 @@ int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
 }
 
 /* Check for errors in the send cmp.queue entry */
-int nicvf_check_cqe_tx_errs(struct nicvf *nic,
-			    struct cmp_queue *cq, struct cqe_send_t *cqe_tx)
+int nicvf_check_cqe_tx_errs(struct nicvf *nic, struct cqe_send_t *cqe_tx)
 {
-	struct cmp_queue_stats *stats = &cq->stats;
-
 	switch (cqe_tx->send_status) {
 	case CQ_TX_ERROP_GOOD:
-		stats->tx.good++;
 		return 0;
 	case CQ_TX_ERROP_DESC_FAULT:
-		stats->tx.desc_fault++;
+		this_cpu_inc(nic->drv_stats->tx_desc_fault);
 		break;
 	case CQ_TX_ERROP_HDR_CONS_ERR:
-		stats->tx.hdr_cons_err++;
+		this_cpu_inc(nic->drv_stats->tx_hdr_cons_err);
 		break;
 	case CQ_TX_ERROP_SUBDC_ERR:
-		stats->tx.subdesc_err++;
+		this_cpu_inc(nic->drv_stats->tx_subdesc_err);
 		break;
 	case CQ_TX_ERROP_MAX_SIZE_VIOL:
-		stats->tx.max_size_exceeded++;
+		this_cpu_inc(nic->drv_stats->tx_max_size_exceeded);
 		break;
 	case CQ_TX_ERROP_IMM_SIZE_OFLOW:
-		stats->tx.imm_size_oflow++;
+		this_cpu_inc(nic->drv_stats->tx_imm_size_oflow);
 		break;
 	case CQ_TX_ERROP_DATA_SEQUENCE_ERR:
-		stats->tx.data_seq_err++;
+		this_cpu_inc(nic->drv_stats->tx_data_seq_err);
 		break;
 	case CQ_TX_ERROP_MEM_SEQUENCE_ERR:
-		stats->tx.mem_seq_err++;
+		this_cpu_inc(nic->drv_stats->tx_mem_seq_err);
 		break;
 	case CQ_TX_ERROP_LOCK_VIOL:
-		stats->tx.lock_viol++;
+		this_cpu_inc(nic->drv_stats->tx_lock_viol);
 		break;
 	case CQ_TX_ERROP_DATA_FAULT:
-		stats->tx.data_fault++;
+		this_cpu_inc(nic->drv_stats->tx_data_fault);
 		break;
 	case CQ_TX_ERROP_TSTMP_CONFLICT:
-		stats->tx.tstmp_conflict++;
+		this_cpu_inc(nic->drv_stats->tx_tstmp_conflict);
 		break;
 	case CQ_TX_ERROP_TSTMP_TIMEOUT:
-		stats->tx.tstmp_timeout++;
+		this_cpu_inc(nic->drv_stats->tx_tstmp_timeout);
 		break;
 	case CQ_TX_ERROP_MEM_FAULT:
-		stats->tx.mem_fault++;
+		this_cpu_inc(nic->drv_stats->tx_mem_fault);
 		break;
 	case CQ_TX_ERROP_CK_OVERLAP:
-		stats->tx.csum_overlap++;
+		this_cpu_inc(nic->drv_stats->tx_csum_overlap);
 		break;
 	case CQ_TX_ERROP_CK_OFLOW:
-		stats->tx.csum_overflow++;
+		this_cpu_inc(nic->drv_stats->tx_csum_overflow);
 		break;
 	}
 
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 8f4718e..2e3c940 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -172,26 +172,6 @@ enum CQ_TX_ERROP_E {
 	CQ_TX_ERROP_ENUM_LAST = 0x8a,
 };
 
-struct cmp_queue_stats {
-	struct tx_stats {
-		u64 good;
-		u64 desc_fault;
-		u64 hdr_cons_err;
-		u64 subdesc_err;
-		u64 max_size_exceeded;
-		u64 imm_size_oflow;
-		u64 data_seq_err;
-		u64 mem_seq_err;
-		u64 lock_viol;
-		u64 data_fault;
-		u64 tstmp_conflict;
-		u64 tstmp_timeout;
-		u64 mem_fault;
-		u64 csum_overlap;
-		u64 csum_overflow;
-	} tx;
-} ____cacheline_aligned_in_smp;
-
 enum RQ_SQ_STATS {
 	RQ_SQ_STATS_OCTS,
 	RQ_SQ_STATS_PKTS,
@@ -243,7 +223,6 @@ struct cmp_queue {
 	spinlock_t	lock;  /* lock to serialize processing CQEs */
 	void		*desc;
 	struct q_desc_mem   dmem;
-	struct cmp_queue_stats	stats;
 	int		irq;
 } ____cacheline_aligned_in_smp;
 
@@ -338,6 +317,5 @@ u64  nicvf_queue_reg_read(struct nicvf *nic,
 void nicvf_update_rq_stats(struct nicvf *nic, int rq_idx);
 void nicvf_update_sq_stats(struct nicvf *nic, int sq_idx);
 int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx);
-int nicvf_check_cqe_tx_errs(struct nicvf *nic,
-			    struct cmp_queue *cq, struct cqe_send_t *cqe_tx);
+int nicvf_check_cqe_tx_errs(struct nicvf *nic, struct cqe_send_t *cqe_tx);
 #endif /* NICVF_QUEUES_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH 3/5] net: thunderx: Fix configuration of L3/L4 length checking
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev; +Cc: Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>

From: Sunil Goutham <sgoutham@cavium.com>

This patch fixes enabling of HW verification of L3/L4 length and
TCP/UDP checksum which is currently being cleared. Also fixed VLAN
stripping config which is being cleared when multiqset is enabled.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index f0e0ca6..3050177 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -538,9 +538,12 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
 	mbx.rq.cfg = (1ULL << 62) | (RQ_CQ_DROP << 8);
 	nicvf_send_msg_to_pf(nic, &mbx);
 
-	nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0, 0x00);
-	if (!nic->sqs_mode)
+	if (!nic->sqs_mode && (qidx == 0)) {
+		/* Enable checking L3/L4 length and TCP/UDP checksums */
+		nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0,
+				      ((1 << 24) | (1 << 23) | (1 << 21)));
 		nicvf_config_vlan_stripping(nic, nic->netdev->features);
+	}
 
 	/* Enable Receive queue */
 	memset(&rq_cfg, 0, sizeof(struct rq_cfg));
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/5] net: thunderx: Program LMAC credits based on MTU
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>

From: Sunil Goutham <sgoutham@cavium.com>

Programming LMAC credits taking 9K frame size by default is incorrect
as for an interface which is one of the many on the same BGX/QLM
no of credits available will be less as Tx FIFO will be divided
across all interfaces. So let's say a BGX with 40G interface and another
BGX with multiple 10G, bandwidth of 10G interfaces will be effected when
traffic is running on both 40G and 10G interfaces simultaneously.

This patch fixes this issue by programming credits based on netdev's MTU.
Also fixed configuring MTU to HW and added CQE counter for pkts which
exceed this value.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
 drivers/net/ethernet/cavium/thunder/nic.h          |  3 +-
 drivers/net/ethernet/cavium/thunder/nic_main.c     | 36 +++++++++++++-------
 drivers/net/ethernet/cavium/thunder/nic_reg.h      |  1 +
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 38 ++++++++++++----------
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  3 ++
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |  2 ++
 6 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index 3042610..cd2d379 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -47,7 +47,7 @@
 
 /* Min/Max packet size */
 #define	NIC_HW_MIN_FRS			64
-#define	NIC_HW_MAX_FRS			9200 /* 9216 max packet including FCS */
+#define	NIC_HW_MAX_FRS			9190 /* Excluding L2 header and FCS */
 
 /* Max pkinds */
 #define	NIC_MAX_PKIND			16
@@ -282,7 +282,6 @@ struct nicvf {
 
 	u8			node;
 	u8			cpi_alg;
-	u16			mtu;
 	bool			link_up;
 	u8			duplex;
 	u32			speed;
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 2bbf4cb..85c9e62 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -11,6 +11,7 @@
 #include <linux/pci.h>
 #include <linux/etherdevice.h>
 #include <linux/of.h>
+#include <linux/if_vlan.h>
 
 #include "nic_reg.h"
 #include "nic.h"
@@ -260,18 +261,31 @@ static void nic_get_bgx_stats(struct nicpf *nic, struct bgx_stats_msg *bgx)
 /* Update hardware min/max frame size */
 static int nic_update_hw_frs(struct nicpf *nic, int new_frs, int vf)
 {
-	if ((new_frs > NIC_HW_MAX_FRS) || (new_frs < NIC_HW_MIN_FRS)) {
-		dev_err(&nic->pdev->dev,
-			"Invalid MTU setting from VF%d rejected, should be between %d and %d\n",
-			   vf, NIC_HW_MIN_FRS, NIC_HW_MAX_FRS);
+	int bgx, lmac, lmac_cnt;
+	u64 lmac_credits;
+
+	if ((new_frs > NIC_HW_MAX_FRS) || (new_frs < NIC_HW_MIN_FRS))
 		return 1;
-	}
-	new_frs += ETH_HLEN;
-	if (new_frs <= nic->pkind.maxlen)
-		return 0;
 
-	nic->pkind.maxlen = new_frs;
-	nic_reg_write(nic, NIC_PF_PKIND_0_15_CFG, *(u64 *)&nic->pkind);
+	bgx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
+	lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
+	lmac += bgx * MAX_LMAC_PER_BGX;
+
+	new_frs += VLAN_ETH_HLEN + ETH_FCS_LEN + 4;
+
+	/* Update corresponding LMAC credits */
+	lmac_cnt = bgx_get_lmac_count(nic->node, bgx);
+	lmac_credits = nic_reg_read(nic, NIC_PF_LMAC_0_7_CREDIT + (lmac * 8));
+	lmac_credits &= ~(0xFFFFFULL << 12);
+	lmac_credits |= (((((48 * 1024) / lmac_cnt) - new_frs) / 16) << 12);
+	nic_reg_write(nic, NIC_PF_LMAC_0_7_CREDIT + (lmac * 8), lmac_credits);
+
+	/* Enforce MTU in HW
+	 * This config is supported only from 88xx pass 2.0 onwards.
+	 */
+	if (!pass1_silicon(nic->pdev))
+		nic_reg_write(nic,
+			      NIC_PF_LMAC_0_7_CFG2 + (lmac * 8), new_frs);
 	return 0;
 }
 
@@ -464,7 +478,7 @@ static int nic_init_hw(struct nicpf *nic)
 
 	/* PKIND configuration */
 	nic->pkind.minlen = 0;
-	nic->pkind.maxlen = NIC_HW_MAX_FRS + ETH_HLEN;
+	nic->pkind.maxlen = NIC_HW_MAX_FRS + VLAN_ETH_HLEN + ETH_FCS_LEN + 4;
 	nic->pkind.lenerr_en = 1;
 	nic->pkind.rx_hdr = 0;
 	nic->pkind.hdr_sl = 0;
diff --git a/drivers/net/ethernet/cavium/thunder/nic_reg.h b/drivers/net/ethernet/cavium/thunder/nic_reg.h
index edf779f..80d4633 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_reg.h
+++ b/drivers/net/ethernet/cavium/thunder/nic_reg.h
@@ -106,6 +106,7 @@
 #define   NIC_PF_MPI_0_2047_CFG			(0x210000)
 #define   NIC_PF_RSSI_0_4097_RQ			(0x220000)
 #define   NIC_PF_LMAC_0_7_CFG			(0x240000)
+#define   NIC_PF_LMAC_0_7_CFG2			(0x240100)
 #define   NIC_PF_LMAC_0_7_SW_XOFF		(0x242000)
 #define   NIC_PF_LMAC_0_7_CREDIT		(0x244000)
 #define   NIC_PF_CHAN_0_255_TX_CFG		(0x400000)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 45a13f7..8f83361 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1189,6 +1189,17 @@ int nicvf_stop(struct net_device *netdev)
 	return 0;
 }
 
+static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
+{
+	union nic_mbx mbx = {};
+
+	mbx.frs.msg = NIC_MBOX_MSG_SET_MAX_FRS;
+	mbx.frs.max_frs = mtu;
+	mbx.frs.vf_id = nic->vf_id;
+
+	return nicvf_send_msg_to_pf(nic, &mbx);
+}
+
 int nicvf_open(struct net_device *netdev)
 {
 	int err, qidx;
@@ -1196,8 +1207,6 @@ int nicvf_open(struct net_device *netdev)
 	struct queue_set *qs = nic->qs;
 	struct nicvf_cq_poll *cq_poll = NULL;
 
-	nic->mtu = netdev->mtu;
-
 	netif_carrier_off(netdev);
 
 	err = nicvf_register_misc_interrupt(nic);
@@ -1248,9 +1257,12 @@ int nicvf_open(struct net_device *netdev)
 	if (nic->sqs_mode)
 		nicvf_get_primary_vf_struct(nic);
 
-	/* Configure receive side scaling */
-	if (!nic->sqs_mode)
+	/* Configure receive side scaling and MTU */
+	if (!nic->sqs_mode) {
 		nicvf_rss_init(nic);
+		if (nicvf_update_hw_max_frs(nic, netdev->mtu))
+			goto cleanup;
+	}
 
 	err = nicvf_register_interrupts(nic);
 	if (err)
@@ -1297,17 +1309,6 @@ int nicvf_open(struct net_device *netdev)
 	return err;
 }
 
-static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
-{
-	union nic_mbx mbx = {};
-
-	mbx.frs.msg = NIC_MBOX_MSG_SET_MAX_FRS;
-	mbx.frs.max_frs = mtu;
-	mbx.frs.vf_id = nic->vf_id;
-
-	return nicvf_send_msg_to_pf(nic, &mbx);
-}
-
 static int nicvf_change_mtu(struct net_device *netdev, int new_mtu)
 {
 	struct nicvf *nic = netdev_priv(netdev);
@@ -1318,10 +1319,13 @@ static int nicvf_change_mtu(struct net_device *netdev, int new_mtu)
 	if (new_mtu < NIC_HW_MIN_FRS)
 		return -EINVAL;
 
+	netdev->mtu = new_mtu;
+
+	if (!netif_running(netdev))
+		return 0;
+
 	if (nicvf_update_hw_max_frs(nic, new_mtu))
 		return -EINVAL;
-	netdev->mtu = new_mtu;
-	nic->mtu = new_mtu;
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index a4fc501..f0e0ca6 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -1530,6 +1530,9 @@ int nicvf_check_cqe_tx_errs(struct nicvf *nic,
 	case CQ_TX_ERROP_SUBDC_ERR:
 		stats->tx.subdesc_err++;
 		break;
+	case CQ_TX_ERROP_MAX_SIZE_VIOL:
+		stats->tx.max_size_exceeded++;
+		break;
 	case CQ_TX_ERROP_IMM_SIZE_OFLOW:
 		stats->tx.imm_size_oflow++;
 		break;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 869f338..8f4718e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -158,6 +158,7 @@ enum CQ_TX_ERROP_E {
 	CQ_TX_ERROP_DESC_FAULT = 0x10,
 	CQ_TX_ERROP_HDR_CONS_ERR = 0x11,
 	CQ_TX_ERROP_SUBDC_ERR = 0x12,
+	CQ_TX_ERROP_MAX_SIZE_VIOL = 0x13,
 	CQ_TX_ERROP_IMM_SIZE_OFLOW = 0x80,
 	CQ_TX_ERROP_DATA_SEQUENCE_ERR = 0x81,
 	CQ_TX_ERROP_MEM_SEQUENCE_ERR = 0x82,
@@ -177,6 +178,7 @@ struct cmp_queue_stats {
 		u64 desc_fault;
 		u64 hdr_cons_err;
 		u64 subdesc_err;
+		u64 max_size_exceeded;
 		u64 imm_size_oflow;
 		u64 data_seq_err;
 		u64 mem_seq_err;
-- 
2.7.4

^ permalink raw reply related

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-14 10:49 UTC (permalink / raw)
  To: André Roth
  Cc: Martin Blumenstingl, Johnson Leung, Giuseppe CAVALLARO,
	linux-amlogic, Alexandre Torgue, netdev
In-Reply-To: <20161113201339.667ac1f7@gmail.com>

On Sun, 2016-11-13 at 20:13 +0100, André Roth wrote:
> > 
> > Andre, the 3.14 kernel you are talking, is it this one ? : 
> > https://github.com/hardkernel/linux/tree/odroidc2-3.14.y
> 
> yes
>  
> > 
> > Because in drivers/net/phy/realtek.c, they disable EEE, but
> > also 1000Base-T Full Duplex advertisement ?
> > 
> > +	/* disable 1000m adv*/
> > +	val = phy_read(phydev, 0x9);
> > +	phy_write(phydev, 0x9, val&(~(1<<9)));
> > 
> > If this is the kernel you are running, you should not be able to
> > have
> > ethernet at 1000MB/s ? Or is it in half duplex mode ?
> 
> ethtool shows 1000Mb/s Full-Duplex and the bandwith is around 300Mb/s
> (as measured by scp). kernel version: 3.14.65-73

Andre,

I checked again the kernel at https://github.com/hardkernel/linux/tree/
odroidc2-3.14.y. The version you mention (3.14.65-73) seems to be:
sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036
tag: jenkins-deb s905 kernel-73

In this particular version, both realtek drivers:
- drivers/net/phy/realtek.c
- drivers/amlogic/ethernet/phy/am_realtek.c

have the hack to disable 1000M advertisement. I don't understand how it
possible for you to have 1000Base-T Full Duplex with this, maybe I'm
missing something here ?

If you did compile the kernel yourself, could you check the 2 file
mentioned above ? Just to be sure there was no patch applied at the
last minute, which would not show up in the git history of hardkernel ?

Thx

Jerome

> 
> 

^ permalink raw reply

* Re: [RFC v4 00/18] Landlock LSM: Unprivileged sandboxing
From: Sargun Dhillon @ 2016-11-14 10:35 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: LKML, Alexei Starovoitov, Andy Lutomirski, Daniel Borkmann,
	Daniel Mack, David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Kees Cook, Paul Moore, Serge E . Hallyn,
	Tejun Heo, Thomas Graf, Will Drewry,
	kernel-hardening-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Linux API, LSM,
	netdev, "open list:CONTROL GROUP (CGRO
In-Reply-To: <5828776A.1010104-WFhQfpSGs3bR7s880joybQ@public.gmane.org>

On Sun, Nov 13, 2016 at 6:23 AM, Mickaël Salaün <mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org> wrote:
> Hi,
>
> After the BoF at LPC last week, we came to a multi-step roadmap to
> upstream Landlock.
>
> A first patch series containing the basic properties needed for a
> "minimum viable product", which means being able to test it, without
> full features. The idea is to set in place the main components which
> include the LSM part (some hooks with the manager logic) and the new
> eBPF type. To have a minimum amount of code, the first userland entry
> point will be the seccomp syscall. This doesn't imply non-upstream
> patches and should be more simple. For the sake of simplicity and to
> ease the review, this first series will only be dedicated to privileged
> processes (i.e. with CAP_SYS_ADMIN). We may want to only allow one level
> of rules at first, instead of dealing with more complex rule inheritance
> (like seccomp-bpf can do).
>
> The second series will focus on the cgroup manager. It will follow the
> same rules of inheritance as the Daniel Mack's patches does.
>
> The third series will try to bring a BPF map of handles for Landlock and
> the dedicated BPF helpers.
>
> Finally, the fourth series will bring back the unprivileged mode (with
> no_new_privs), at least for process hierarchies (via seccomp). This also
> imply to handle multi-level of rules.
>
> Right now, an important point of attention is the userland ABI. We don't
> want LSM hooks to be exposed "as is" to userland. This may have some
> future implications if their semantic and/or enforcement point(s)
> change. In the next series, I will propose a new abstraction over the
> currently used LSM hooks. I'll also propose a new way to deal with
> resource accountability. Finally, I plan to create a minimal (kernel)
> developer documentation and a test suite.
>
> Regards,
>  Mickaël
>
>
> On 26/10/2016 08:56, Mickaël Salaün wrote:
>> Hi,
>>
>> This fourth RFC brings some improvements over the previous one [1]. An important
>> new point is the abstraction from the raw types of LSM hook arguments. It is
>> now possible to call a Landlock function the same way for LSM hooks with
>> different internal argument types. Some parts of the code are revamped with RCU
>> to properly deal with concurrency. From a userland point of view, the only
>> remaining link with seccomp-bpf is the ability to use the seccomp(2) syscall to
>> load and enforce a Landlock rule. Seccomp filters cannot trigger Landlock rules
>> anymore. For now, it is no more possible for an unprivileged user to enforce a
>> Landlock rule on a cgroup through delegation.
>>
>> As suggested, I plan to write documentation for userland and kernel developers
>> with some kind of guiding principles. A remaining question is how to enforce
>> limitations for the rule creation?
>>
>>
>> # Landlock LSM
>>
>> The goal of this new stackable Linux Security Module (LSM) called Landlock is
>> to allow any process, including unprivileged ones, to create powerful security
>> sandboxes comparable to the Seatbelt/XNU Sandbox or the OpenBSD Pledge. This
>> kind of sandbox is expected to help mitigate the security impact of bugs or
>> unexpected/malicious behaviors in userland applications.
>>
>> eBPF programs are used to create a security rule. They are very limited (i.e.
>> can only call a whitelist of functions) and cannot do a denial of service (i.e.
>> no loop). A new dedicated eBPF map allows to collect and compare Landlock
>> handles with system resources (e.g. files or network connections).
>>
>> The approach taken is to add the minimum amount of code while still allowing
>> the userland to create quite complex access rules. A dedicated security policy
>> language as the one used by SELinux, AppArmor and other major LSMs involves a
>> lot of code and is usually dedicated to a trusted user (i.e. root).
>>
>>
>> # eBPF
>>
>> To get an expressive language while still being safe and small, Landlock is
>> based on eBPF. Landlock should be usable by untrusted processes and must then
>> expose a minimal attack surface. The eBPF bytecode is minimal while powerful,
>> widely used and designed to be used by not so trusted application. Reusing this
>> code allows to not reproduce the same mistakes and minimize new code  while
>> still taking a generic approach. Only a few additional features are added like
>> a new kind of arraymap and some dedicated eBPF functions.
>>
>> An eBPF program has access to an eBPF context which contains the LSM hook
>> arguments (as does seccomp-bpf with syscall arguments). They can be used
>> directly or passed to helper functions according to their types. It is then
>> possible to do complex access checks without race conditions nor inconsistent
>> evaluation (i.e. incorrect mirroring of the OS code and state [2]).
>>
>> There is one eBPF program subtype per LSM hook. This allows to statically check
>> which context access is performed by an eBPF program. This is needed to deny
>> kernel address leak and ensure the right use of LSM hook arguments with eBPF
>> functions. Moreover, this safe pointer handling removes the need for runtime
>> check or abstract data, which improves performances. Any user can add multiple
>> Landlock eBPF programs per LSM hook. They are stacked and evaluated one after
>> the other (cf. seccomp-bpf).
>>
>>
>> # LSM hooks
>>
>> Unlike syscalls, LSM hooks are security checkpoints and are not architecture
>> dependent. They are designed to match a security need associated with a
>> security policy (e.g. access to a file). Exposing parts of some LSM hooks
>> instead of using the syscall API for sandboxing should help to avoid bugs and
>> hacks as encountered by the first RFC. Instead of redoing the work of the LSM
>> hooks through syscalls, we should use and expose them as does policies of
>> access control LSM.
>>
>> Only a subset of the hooks are meaningful for an unprivileged sandbox mechanism
>> (e.g. file system or network access control). Landlock uses an abstraction of
>> raw LSM hooks, which allow to deal with possible future API changes of the LSM
>> hook API. Moreover, thanks to the ePBF program typing (per LSM hook) used by
>> Landlock, it should not be hard to make such evolutions backward compatible.
>>
>>
>> # Use case scenario
>>
>> First, a process needs to create a new dedicated eBPF map containing handles.
>> This handles are references to system resources (e.g. file or directory) and
>> grouped in one or multiple maps to be efficiently managed and checked in
>> batches. This kind of map can be passed to Landlock eBPF functions to compare,
>> for example, with a file access request. The handles are only accessible from
>> the eBPF programs created by the same thread.
>>
>> The loaded Landlock eBPF programs can be triggered by a seccomp filter
>> returning RET_LANDLOCK. In addition, a cookie (16-bit value) can be passed from
>> a seccomp filter to eBPF programs. This allow flexible security policies
>> between seccomp and Landlock.
>>
>> Another way to enforce a Landlock security policy is to attach Landlock
>> programs to a dedicated cgroup. All the processes in this cgroup will then be
>> subject to this policy. For unprivileged processes, this can be done thanks to
>> cgroup delegation.
>>
>> A triggered Landlock eBPF program can allow or deny an access, according to
>> its subtype (i.e. LSM hook), thanks to errno return values.
>>
>>
>> # Sandbox example with process hierarchy sandboxing (seccomp)
>>
>>   $ ls /home
>>   user1
>>   $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>>       ./samples/landlock/sandbox /bin/sh -i
>>   Launching a new sandboxed process.
>>   $ ls /home
>>   ls: cannot access '/home': No such file or directory
>>
>>
>> # Sandbox example with conditional access control depending on a cgroup
>>
>>   $ mkdir /sys/fs/cgroup/sandboxed
>>   $ ls /home
>>   user1
>>   $ LANDLOCK_CGROUPS='/sys/fs/cgroup/sandboxed' \
>>       LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>>       ./samples/landlock/sandbox
>>   Ready to sandbox with cgroups.
>>   $ ls /home
>>   user1
>>   $ echo $$ > /sys/fs/cgroup/sandboxed/cgroup.procs
>>   $ ls /home
>>   ls: cannot access '/home': No such file or directory
>>
>>
>> # Current limitations and possible improvements
>>
>> For now, eBPF programs can only return an errno code. It may be interesting to
>> be able to do other actions like seccomp-bpf does (e.g. kill process). Such
>> features can easily be implemented but the main advantage of the current
>> approach is to be able to only execute eBPF programs until one returns an errno
>> code instead of executing all programs like seccomp-bpf does.
>>
>> It is quite easy to add new eBPF functions to extend Landlock. The main concern
>> should be about the possibility to leak information from current process to
>> another one (e.g. through maps) to not reproduce the same security sensitive
>> behavior as ptrace.
>>
>> This design does not seem too intrusive but is flexible enough to allow a
>> powerful sandbox mechanism accessible by any process on Linux. The use of
>> seccomp and Landlock is more suitable with the help of a userland library (e.g.
>> libseccomp) that could help to specify a high-level language to express a
>> security policy instead of raw eBPF programs. Moreover, thanks to LLVM, it is
>> possible to express an eBPF program with a subset of C.
>>
>>
>> # FAQ
>>
>> ## Why does seccomp-bpf is not enough?
>>
>> A seccomp filter can access to raw syscall arguments which means that it is not
>> possible to filter according to pointed such as a file path. As the first
>> version of this patch series demonstrated, filtering at the syscall level is
>> complicated (e.g. need to take care of race conditions). This is mainly because
>> the access control checkpoints of the kernel are not at this high-level but
>> more underneath, at LSM hooks level. The LSM hooks are designed to handle this
>> kind of checks. This series use this approach to leverage the ability of
>> unprivileged users to limit themselves.
>>
>> Cf. "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>>
>>
>> ## Why using the seccomp(2) syscall?
>>
>> Landlock use the same semantic as seccomp to apply access rule restrictions. It
>> add a new layer of security for the current process which is inherited by its
>> childs. It makes sense to use an unique access-restricting syscall (that should
>> be allowed by seccomp-bpf rules) which can only drop privileges. Moreover, a
>> Landlock eBPF program could come from outside a process (e.g. passed through a
>> UNIX socket). It is then useful to differentiate the creation/load of Landlock
>> eBPF programs via bpf(2), from rule enforcing via seccomp(2).
>>
>>
>> ## Why using cgroups?
>>
>> cgroups are designed to handle groups of processes. One use case is to manage
>> containers. Sandboxing based on process hierarchy (seccomp) is design to handle
>> immutable security policies, which is a good security property but does not
>> match all use cases. A user can attach Landlock rules to a cgroup. Doing so,
>> all the processes in that cgroup will be subject to the security policy.
>> However, if the user is allowed to manage this cgroup, it could dynamically
>> move this group of processes to a cgroup with another security policy (or
>> none). Landlock rules can be applied either on a process hierarchy (e.g.
>> application with built-in sandboxing) or a group of processes (e.g. container
>> sandboxing). Both approaches can be combined for the same process.
>>
>>
>> ## Does Landlock can limit network access or other resources?
>>
>> Limiting network access is obviously in the scope of Landlock but it is not yet
>> implemented. The main goal now is to get feedback about the whole concept, the
>> API and the file access control part. More access control types could be
>> implemented in the future.
>>
>> Sargun Dhillon sent a RFC (Checmate) [4] to deal with network manipulation.
>> This could be implemented on top of the Landlock framework.
>>
>>
>> ## Why a new LSM? Are SELinux, AppArmor, Smack or Tomoyo not good enough?
>>
>> The current access control LSMs are fine for their purpose which is to give the
>> *root* the ability to enforce a security policy for the *system*. What is
>> missing is a way to enforce a security policy for any applications by its
>> developer and *unprivileged user* as seccomp can do for raw syscall filtering.
>> Moreover, Landlock handles stacked hook programs from different users. It must
>> then ensure there is no possible malicious interactions between these programs.
>>
>> Differences with other (access control) LSMs:
>> * not only dedicated to administrators (i.e. no_new_priv);
>> * limited kernel attack surface (e.g. policy parsing);
>> * helpers to compare complex objects (path/FD), no access to internal kernel
>>   data (do not leak addresses);
>> * constrained policy rules/programs (no DoS: deterministic execution time);
>> * do not leak more information than the loader process can legitimately have
>>   access to (minimize metadata inference): must compare from an already allowed
>>   file (through a handle).
>>
>>
>> ## Why not use a policy language like used by SElinux or AppArmor?
>>
>> This kind of LSMs are dedicated to administrators. They already manage the
>> system and are not a threat to the system security. However, seccomp, and
>> Landlock too, should be available to anyone, which potentially include
>> untrusted users and processes. To reduce the attack surface, Landlock should
>> expose the minimum amount of code, hence minimal complexity. Moreover, another
>> threat is to make accessible to a malicious code a new way to gain more
>> information. For example, Landlock features should not allow a program to get
>> the file owner if the directory containing this file is not readable. This data
>> could then be exfiltrated thanks to the access result. Thus, we should limit
>> the expressiveness of the available checks. The current approach is to do the
>> checks in such a way that only a comparison with an already accessed resource
>> (e.g. file descriptor) is possible. This allow to have a reference to compare
>> with, without exposing much information.
>>
>>
>> ## As a developer, why do I need this feature?
>>
>> Landlock's goal is to help userland to limit its attack surface.
>> Security-conscious developers would like to protect users from a security bug
>> in their applications and the third-party dependencies they are using. Such a
>> bug can compromise all the user data and help an attacker to perform a
>> privilege escalation. Using an *unprivileged sandbox* feature such as Landlock
>> empowers the developer with the ability to properly compartmentalize its
>> software and limit the impact of vulnerabilities.
>>
>>
>> ## As a user, why do I need a this feature?
>>
>> Any user can already use seccomp-bpf to whitelist a set of syscalls to
>> reduce the kernel attack surface for a predefined set of processes. However an
>> unprivileged user can't create a security policy like the root user can thanks to
>> SELinux and other access control LSMs. Landlock allows any unprivileged user to
>> protect their data from being accessed by any process they run but only an
>> identified subset. User tools can be created to help create such a high-level
>> access control policy. This policy may not be powerful enough to express the
>> same policies as the current access control LSMs, because of the threat an
>> unprivileged user can be to the system, but it should be enough for most
>> use-cases (e.g. blacklist or whitelist a set of file hierarchies).
>>
>>
>> # Changes since RFC v3
>>
>> * use abstract LSM hook arguments with custom types (e.g. *_LANDLOCK_ARG_FS for
>>   struct file, struct inode and struct path)
>> * add more LSM hooks to support full file system access control
>> * improve the sandbox example
>> * fix races and RCU issues:
>>   * eBPF program execution and eBPF helpers
>>   * revamp the arraymap of handles to cleanly deal with update/delete
>> * eBPF program subtype for Landlock:
>>   * remove the "origin" field
>>   * add an "option" field
>> * rebase onto Daniel Mack's patches v7 [3]
>> * remove merged commit 1955351da41c ("bpf: Set register type according to
>>   is_valid_access()")
>> * fix spelling mistakes
>> * cleanup some type and variable names
>> * split patches
>> * for now, remove cgroup delegation handling for unprivileged user
>> * remove extra access check for cgroup_get_from_fd()
>> * remove unused example code dealing with skb
>> * remove seccomp-bpf link:
>>   * no more seccomp cookie
>>   * for now, it is no more possible to check the current syscall properties
>>
>>
>> # Changes since RFC v2
>>
>> * revamp cgroup handling:
>>   * use Daniel Mack's patches "Add eBPF hooks for cgroups" v5
>>   * remove bpf_landlock_cmp_cgroup_beneath()
>>   * make BPF_PROG_ATTACH usable with delegated cgroups
>>   * add a new CGRP_NO_NEW_PRIVS flag for safe cgroups
>>   * handle Landlock sandboxing for cgroups hierarchy
>>   * allow unprivileged processes to attach Landlock eBPF program to cgroups
>> * add subtype to eBPF programs:
>>   * replace Landlock hook identification by custom eBPF program types with a
>>     dedicated subtype field
>>   * manage fine-grained privileged Landlock programs
>>   * register Landlock programs for dedicated trigger origins (e.g. syscall,
>>     return from seccomp filter and/or interruption)
>> * performance and memory optimizations: use an array to access Landlock hooks
>>   directly but do not duplicated it for each thread (seccomp-based)
>> * allow running Landlock programs without seccomp filter
>> * fix seccomp-related issues
>> * remove extra errno bounding check for Landlock programs
>> * add some examples for optional eBPF functions or context access (network
>>   related) according to security checks to allow more features for privileged
>>   programs (e.g. Checmate)
>>
>>
>> # Changes since RFC v1
>>
>> * focus on the LSM hooks, not the syscalls:
>>   * much more simple implementation
>>   * does not need audit cache tricks to avoid race conditions
>>   * more simple to use and more generic because using the LSM hook abstraction
>>     directly
>>   * more efficient because only checking in LSM hooks
>>   * architecture agnostic
>> * switch from cBPF to eBPF:
>>   * new eBPF program types dedicated to Landlock
>>   * custom functions used by the eBPF program
>>   * gain some new features (e.g. 10 registers, can load values of different
>>       size, LLVM translator) but only a few functions allowed and a dedicated map
>>     type
>>   * new context: LSM hook ID, cookie and LSM hook arguments
>>   * need to set the sysctl kernel.unprivileged_bpf_disable to 0 (default value)
>>     to be able to load hook filters as unprivileged users
>> * smaller and simpler:
>>   * no more checker groups but dedicated arraymap of handles
>>   * simpler userland structs thanks to eBPF functions
>> * distinctive name: Landlock
>>
>>
>> This series can be applied on top of Daniel Mack's patches for BPF_PROG_ATTACH
>> v7 [3] on Linux v4.9-rc2. This can be tested with CONFIG_SECURITY_LANDLOCK,
>> CONFIG_SECCOMP_FILTER and CONFIG_CGROUP_BPF. I would really appreciate
>> constructive comments on the usability, architecture, code and userland API of
>> Landlock LSM.
>>
>> [1] https://lkml.kernel.org/r/20160914072415.26021-1-mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org
>> [2] https://crypto.stanford.edu/cs155/papers/traps.pdf
>> [3] https://lkml.kernel.org/r/1477390454-12553-1-git-send-email-daniel@zonque.org
>> [4] https://lkml.kernel.org/r/20160829114542.GA20836-I4sfFR6g6EicJoAdRrHjTitQHAD/DGy2@public.gmane.orgbus-611.internal
>>
>> Regards,
>>
>> Mickaël Salaün (18):
>>   landlock: Add Kconfig
>>   bpf: Move u64_to_ptr() to BPF headers and inline it
>>   bpf,landlock: Add a new arraymap type to deal with (Landlock) handles
>>   bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier
>>   bpf,landlock: Define an eBPF program type for Landlock
>>   fs: Constify path_is_under()'s arguments
>>   landlock: Add LSM hooks
>>   landlock: Handle file comparisons
>>   landlock: Add manager functions
>>   seccomp: Split put_seccomp_filter() with put_seccomp()
>>   seccomp,landlock: Handle Landlock hooks per process hierarchy
>>   bpf: Cosmetic change for bpf_prog_attach()
>>   bpf/cgroup: Replace struct bpf_prog with struct bpf_object
>>   bpf/cgroup: Make cgroup_bpf_update() return an error code
>>   bpf/cgroup: Move capability check
>>   bpf/cgroup,landlock: Handle Landlock hooks per cgroup
>>   landlock: Add update and debug access flags
>>   samples/landlock: Add sandbox example
>>
>>  fs/namespace.c                 |   2 +-
>>  include/linux/bpf-cgroup.h     |  19 +-
>>  include/linux/bpf.h            |  44 +++-
>>  include/linux/cgroup-defs.h    |   2 +
>>  include/linux/filter.h         |   1 +
>>  include/linux/fs.h             |   2 +-
>>  include/linux/landlock.h       |  95 +++++++++
>>  include/linux/lsm_hooks.h      |   5 +
>>  include/linux/seccomp.h        |  12 +-
>>  include/uapi/linux/bpf.h       | 105 ++++++++++
>>  include/uapi/linux/seccomp.h   |   1 +
>>  kernel/bpf/arraymap.c          | 270 +++++++++++++++++++++++++
>>  kernel/bpf/cgroup.c            | 139 ++++++++++---
>>  kernel/bpf/syscall.c           |  71 ++++---
>>  kernel/bpf/verifier.c          |  35 +++-
>>  kernel/cgroup.c                |   6 +-
>>  kernel/fork.c                  |  15 +-
>>  kernel/seccomp.c               |  26 ++-
>>  kernel/trace/bpf_trace.c       |  12 +-
>>  net/core/filter.c              |  26 ++-
>>  samples/Makefile               |   2 +-
>>  samples/bpf/bpf_helpers.h      |   5 +
>>  samples/landlock/.gitignore    |   1 +
>>  samples/landlock/Makefile      |  16 ++
>>  samples/landlock/sandbox.c     | 405 +++++++++++++++++++++++++++++++++++++
>>  security/Kconfig               |   1 +
>>  security/Makefile              |   2 +
>>  security/landlock/Kconfig      |  23 +++
>>  security/landlock/Makefile     |   3 +
>>  security/landlock/checker_fs.c | 152 ++++++++++++++
>>  security/landlock/checker_fs.h |  20 ++
>>  security/landlock/common.h     |  58 ++++++
>>  security/landlock/lsm.c        | 449 +++++++++++++++++++++++++++++++++++++++++
>>  security/landlock/manager.c    | 379 ++++++++++++++++++++++++++++++++++
>>  security/security.c            |   1 +
>>  35 files changed, 2309 insertions(+), 96 deletions(-)
>>  create mode 100644 include/linux/landlock.h
>>  create mode 100644 samples/landlock/.gitignore
>>  create mode 100644 samples/landlock/Makefile
>>  create mode 100644 samples/landlock/sandbox.c
>>  create mode 100644 security/landlock/Kconfig
>>  create mode 100644 security/landlock/Makefile
>>  create mode 100644 security/landlock/checker_fs.c
>>  create mode 100644 security/landlock/checker_fs.h
>>  create mode 100644 security/landlock/common.h
>>  create mode 100644 security/landlock/lsm.c
>>  create mode 100644 security/landlock/manager.c
>>
>

Was there a plan around getting Daniel's patches in as well? Also,
rather than making these handles landlock-specific, can they be
implemented in such a way where we can keep track of (some) of these
in other types of programs?

^ permalink raw reply

* Re: [PATCH net] sctp: change sk state only when it has assocs in sctp_shutdown
From: Marcelo Ricardo Leitner @ 2016-11-14 10:32 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, davem, Neil Horman, Vlad Yasevich,
	andreyknvl
In-Reply-To: <d260f8b59f52d7ef00b83a554fc19d4aa91766a2.1479044677.git.lucien.xin@gmail.com>

On Sun, Nov 13, 2016 at 09:44:37PM +0800, Xin Long wrote:
> Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
> this sock has no connection (assoc), sk state would be changed to
> SCTP_SS_CLOSING, which is not as we expect.
> 
> Besides, after that if users try to listen on this sock, kernel
> could even panic when it dereference sctp_sk(sk)->bind_hash in
> sctp_inet_listen, as bind_hash is null when sock has no assoc.
> 
> This patch is to move sk state change after checking sk assocs
> is not empty, and also merge these two if() conditions and reduce
> indent level.
> 
> Fixes: d46e416c11c8 ("sctp: sctp should change socket state when shutdown is received")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  net/sctp/socket.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index faa48ff..f23ad91 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4285,19 +4285,18 @@ static void sctp_shutdown(struct sock *sk, int how)
>  {
>  	struct net *net = sock_net(sk);
>  	struct sctp_endpoint *ep;
> -	struct sctp_association *asoc;
>  
>  	if (!sctp_style(sk, TCP))
>  		return;
>  
> -	if (how & SEND_SHUTDOWN) {
> +	ep = sctp_sk(sk)->ep;
> +	if (how & SEND_SHUTDOWN && !list_empty(&ep->asocs)) {
> +		struct sctp_association *asoc;
> +
>  		sk->sk_state = SCTP_SS_CLOSING;
> -		ep = sctp_sk(sk)->ep;
> -		if (!list_empty(&ep->asocs)) {
> -			asoc = list_entry(ep->asocs.next,
> -					  struct sctp_association, asocs);
> -			sctp_primitive_SHUTDOWN(net, asoc, NULL);
> -		}
> +		asoc = list_entry(ep->asocs.next,
> +				  struct sctp_association, asocs);
> +		sctp_primitive_SHUTDOWN(net, asoc, NULL);
>  	}
>  }
>  
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH net-next v2 4/5] ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115760-1182-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional
implementation needs to be done in the individual PHY drivers.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
 net/core/ethtool.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 61aebdf..e9b45567 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -122,6 +122,7 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
 static const char
 phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
 	[ETHTOOL_ID_UNSPEC]     = "Unspec",
+	[ETHTOOL_PHY_DOWNSHIFT]	= "phy-downshift",
 };
 
 static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
@@ -2435,6 +2436,11 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
 static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
 {
 	switch (tuna->id) {
+	case ETHTOOL_PHY_DOWNSHIFT:
+		if (tuna->len != sizeof(u8) ||
+		    tuna->type_id != ETHTOOL_TUNABLE_U8)
+			return -EINVAL;
+		break;
 	default:
 		return -EINVAL;
 	}
-- 
2.7.3

^ permalink raw reply related

* [patch net] mlxsw: spectrum_router: Flush FIB tables during fini
From: Jiri Pirko @ 2016-11-14 10:26 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz

From: Ido Schimmel <idosch@mellanox.com>

Since commit b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications
instead of switchdev calls") we reflect to the device the entire FIB
table and not only FIBs that point to netdevs created by the driver.

During module removal, FIBs of the second type are removed following
NETDEV_UNREGISTER events sent. The other FIBs are still present in both
the driver's cache and the device's table.

Fix this by iterating over all the FIB tables in the device and flush
them. There's no need to take locks, as we're the only writer.

Fixes: b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index cbeeddd..e83072d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -594,8 +594,11 @@ static int mlxsw_sp_vrs_init(struct mlxsw_sp *mlxsw_sp)
 	return 0;
 }
 
+static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
+
 static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
 {
+	mlxsw_sp_router_fib_flush(mlxsw_sp);
 	kfree(mlxsw_sp->router.vrs);
 }
 
@@ -1867,18 +1870,18 @@ static int mlxsw_sp_router_set_abort_trap(struct mlxsw_sp *mlxsw_sp)
 	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ralue), ralue_pl);
 }
 
-static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
+static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp)
 {
 	struct mlxsw_resources *resources;
 	struct mlxsw_sp_fib_entry *fib_entry;
 	struct mlxsw_sp_fib_entry *tmp;
 	struct mlxsw_sp_vr *vr;
 	int i;
-	int err;
 
 	resources = mlxsw_core_resources_get(mlxsw_sp->core);
 	for (i = 0; i < resources->max_virtual_routers; i++) {
 		vr = &mlxsw_sp->router.vrs[i];
+
 		if (!vr->used)
 			continue;
 
@@ -1894,6 +1897,13 @@ static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
 				break;
 		}
 	}
+}
+
+static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
+{
+	int err;
+
+	mlxsw_sp_router_fib_flush(mlxsw_sp);
 	mlxsw_sp->router.aborted = true;
 	err = mlxsw_sp_router_set_abort_trap(mlxsw_sp);
 	if (err)
-- 
2.7.4

^ permalink raw reply related

* RE: [PATCH net-next v7 03/10] dpaa_eth: add option to use one buffer pool set
From: Madalin-Cristian Bucur @ 2016-11-14 10:25 UTC (permalink / raw)
  To: David Miller
  Cc: pebolle@tiscali.nl, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, ppc@mindchasers.com,
	oss@buserror.net, joe@perches.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20161113.124617.2176429700446266337.davem@davemloft.net>

> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, November 13, 2016 7:46 PM
> 
> From: Madalin Bucur <madalin.bucur@nxp.com>
> Date: Fri, 11 Nov 2016 10:20:00 +0200
> 
> > @@ -8,3 +8,12 @@ menuconfig FSL_DPAA_ETH
> >  	  supporting the Freescale QorIQ chips.
> >  	  Depends on Freescale Buffer Manager and Queue Manager
> >  	  driver and Frame Manager Driver.
> > +
> > +if FSL_DPAA_ETH
> > +config FSL_DPAA_ETH_COMMON_BPOOL
> > +	bool "Use a common buffer pool set for all the interfaces"
> > +	---help---
> > +	  The DPAA Ethernet netdevices require buffer pools for storing the
> buffers
> > +	  used by the FMan hardware for reception. One can use a single
> buffer pool
> > +	  set for all interfaces or a dedicated buffer pool set for each
> interface.
> > +endif # FSL_DPAA_ETH
> 
> This in no way belongs in Kconfig.  If you want to support this,
> support it wit a run time configuration choice via ethtool flags
> or similar.  Do not use debugfs, do not use sysfs, do not use
> module options.
> 
> If you put it in Kconfig, distributions will have to pick one way or
> another which means that users who want the other choice lose.  This
> never works.

I've introduced this Kconfig option as a backwards compatible option, to
be able to run comparative tests between the independent buffer pool setup
and the previous common buffer pool setup. There are not so many reasons
to use the same buffer pool besides "having the old setup", the memory
saving is marginal, in all other aspects the separate buffer pools setup
fares better.

I'll remove this patch from the next submission. Should anyone care for
this I can add an entry to the feature backlog to add runtime support but
it will be quite low in priority.

Thank you for your review. 

Madalin

^ permalink raw reply

* Re: [PATCH nf-next,RFC] netfilter: nft_meta: add cgroup version 2 support
From: Daniel Mack @ 2016-11-14 10:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel-u79uwXL29TY76Z2rM5mHXA
  Cc: htejun-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479114761-19534-1-git-send-email-pablo-Cap9r6Oaw4JrovVCs/uTlw@public.gmane.org>

Hi Pablo,

On 11/14/2016 10:12 AM, Pablo Neira Ayuso wrote:
> Add cgroup version 2 support to nf_tables.
> 
> This extension allows us to fetch the cgroup i-node number from the
> cgroup socket data, place it in a register, then match it against any
> value specified by user. This approach scales up nicely since it
> integrates well in the existing nf_tables map infrastructure.
> 
> Contrary to what iptables cgroup v2 match does, this patch doesn't use
> cgroup_is_descendant() because this call cannot guarantee that the cgroup
> hierarchy is honored in anyway given that the cgroup v2 field becomes yet
> another packet selector that you can use to build your filtering policy.
> 
> Actually, using the i-node approach, it should be easy to build a policy
> that honors the hierarchy if you need this, eg.
> 
> 	meta cgroup2 vmap { "/A/B" : jump b-cgroup-chain,
> 			    "/A/C" : jump c-cgroup-chain,
> 			    "/A" : jump a-cgroup-chain }
> 
> then, the b-cgroup-chain looks like:
> 
> 	jump a-cgroup-chain
> 	... # specific policy b-cgroup-chain goes here
> 
> similarly, the c-cgroup-chain looks like:
> 
> 	jump a-cgroup-chain
> 	... # specific policy c-cgroup-chain goes here
> 
> So both B and C would evaluate A's ruleset. Note that cgroup A would
> also jump to the root cgroup chain policy.
> 
> Anyway, this cgroup i-node approach provides way more flexibility since
> it is up to the sysadmin to decide if he wants to honor the hierarchy or
> simply define a fast path to skip any further classification.

I don't think this can work. The problem is that inodes in cgroupfs are
dynamically allocated when a cgroup is created, so the sysadmin cannot
install the jump rules before that. Worse yet, inode numbers in pseudo
filesystems are recycled, so if a cgroup goes away and a new one is
created, the latter may well end up having the same inode than the old
one. As cgroupfs decoupled from netfilter tables, this will lead to
major chaos in the field.

Note that this was different with the netclass controller in v1 that
would assign a user-controlled numeric value to each cgroup, so both
sides were in the control of the sysadmin. It is also different with the
path matching logic for v2 which does a full path string comparison.
That's potentially expensive, it does lead to predictable runtime behavior.

One way forward here would be to assign a atomically increasing 64-bit
sequence number to each cgroup and expose that. I've recently talked to
Tejun about that. While that won't solve the predictability issue, it
would at least make it practically impossible to have re-used IDs.

Anyway - I think it would be great to have an alternative to the v2 path
matching here, but of course this patch does not solve the ingress issue
we've been discussing. It is still impossible to reliably determine the
cgroup of a local receiver at the time when the netfilter rules are
processed, even for unicast packets.



Thanks,
Daniel


> 
> Signed-off-by: Pablo Neira Ayuso <pablo-Cap9r6Oaw4JrovVCs/uTlw@public.gmane.org>
> ---
>  include/uapi/linux/netfilter/nf_tables.h |  2 ++
>  net/netfilter/nft_meta.c                 | 15 +++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> index 0da7ccf65511..5d4d08367a87 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -729,6 +729,7 @@ enum nft_exthdr_attributes {
>   * @NFT_META_OIFGROUP: packet output interface group
>   * @NFT_META_CGROUP: socket control group (skb->sk->sk_classid)
>   * @NFT_META_PRANDOM: a 32bit pseudo-random number
> + * @NFT_META_CGROUP2: socket control group v2 (skb->sk->sk_cgrp_data)
>   */
>  enum nft_meta_keys {
>  	NFT_META_LEN,
> @@ -756,6 +757,7 @@ enum nft_meta_keys {
>  	NFT_META_OIFGROUP,
>  	NFT_META_CGROUP,
>  	NFT_META_PRANDOM,
> +	NFT_META_CGROUP2,
>  };
>  
>  /**
> diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
> index 6c1e0246706e..1e793e133903 100644
> --- a/net/netfilter/nft_meta.c
> +++ b/net/netfilter/nft_meta.c
> @@ -190,6 +190,18 @@ void nft_meta_get_eval(const struct nft_expr *expr,
>  		*dest = prandom_u32_state(state);
>  		break;
>  	}
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> +	case NFT_META_CGROUP2: {
> +		struct cgroup *cgrp;
> +
> +		if (!skb->sk || !sk_fullsock(skb->sk))
> +			goto err;
> +
> +		cgrp = sock_cgroup_ptr(&skb->sk->sk_cgrp_data);
> +		*dest = cgrp->kn->ino;
> +		break;
> +	}
> +#endif
>  	default:
>  		WARN_ON(1);
>  		goto err;
> @@ -273,6 +285,9 @@ int nft_meta_get_init(const struct nft_ctx *ctx,
>  #ifdef CONFIG_CGROUP_NET_CLASSID
>  	case NFT_META_CGROUP:
>  #endif
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> +	case NFT_META_CGROUP2:
> +#endif
>  		len = sizeof(u32);
>  		break;
>  	case NFT_META_IIFNAME:
> 

^ permalink raw reply

* RE: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures
From: ASIX_Allan [Office] @ 2016-11-14  9:45 UTC (permalink / raw)
  To: 'Jon Hunter', robert.foss-ZGY8ohtN/8qB+jHODAdFcQ,
	freddy-knRN6Y/kmf1NUHwG+Fw1Kw,
	Dean_Jenkins-nmGgyN9QBj3QT0dZR+AlfA,
	Mark_Craske-nmGgyN9QBj3QT0dZR+AlfA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A,
	vpalatin-F7+t8E8rja9g9hUCZPvPmw,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	grundler-F7+t8E8rja9g9hUCZPvPmw,
	changchias-Re5JQEeQqe8AvxtiuMwx3w, andrew-g2DYL2Zd6BY,
	tremyfr-Re5JQEeQqe8AvxtiuMwx3w, colin.king-Z7WLFzj8eWMS+FvcfC7Uqw,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	vpalatin-hpIqsD4AKlfQT0dZR+AlfA
In-Reply-To: <f0d21edc-060b-bd9b-fced-faf1f51e8467-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>

Hi Jon,

Please help to double check if the USB host controller of your Terga
platform had been powered OFF while running the ax88772_suspend() routine or
not? 

---
Best regards,
Allan Chou


-----Original Message-----
From: Jon Hunter [mailto:jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org] 
Sent: Monday, November 14, 2016 5:34 PM
To: allan-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org; robert.foss-ZGY8ohtN/8qB+jHODAdFcQ@public.gmane.org; freddy-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org;
Dean_Jenkins-nmGgyN9QBj3QT0dZR+AlfA@public.gmane.org; Mark_Craske-nmGgyN9QBj3QT0dZR+AlfA@public.gmane.org; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org;
ivecera-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org; vpalatin-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org;
stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org; grundler-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org; changchias-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org;
andrew-g2DYL2Zd6BY@public.gmane.org; tremyfr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org;
linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; vpalatin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
Subject: Re: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures

Hi Allan,

On 14/11/16 08:50, ASIX_Allan [Home] wrote:
> It seems the AX88772x dongle had been unexpectedly removed while 
> running the
> ax88772_suspend() routine. If yes, you might see these error messages
> because the hardware had been absent.    

In my case the hardware was never removed. The boards are in a test fixture
that are not touched. This is seen on more than one board. By reverting this
change I no longer see the error messages and appears to be 100%
reproducible.

Jon

--
nvpublic

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH ethtool v2 1/2] ethtool-copy.h:sync with net
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115787-1265-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

This covers kernel changes upto:

commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a
Author: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Date:   Wed Nov 9 16:33:09 2016 +0530

    ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE

    For operation in cabling environments that are incompatible with 1000BAST-T,
    PHY device may provide an automatic link speed downshift operation. When
    enabled, the device automatically changes its 1000BAST-T auto-negotiation
    to the next slower speed after a configured number of failed attempts at
    1000BAST-T.
    This feature is useful in setting up in networks using older cable
    installations that include only pairs A and B, and not pairs C and D.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
 ethtool-copy.h | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..2e2448f 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -247,6 +247,19 @@ struct ethtool_tunable {
 	void	*data[0];
 };
 
+#define DOWNSHIFT_DEV_DEFAULT_COUNT	0xff
+#define DOWNSHIFT_DEV_DISABLE		0
+
+enum phy_tunable_id {
+	ETHTOOL_PHY_ID_UNSPEC,
+	ETHTOOL_PHY_DOWNSHIFT,
+	/*
+	 * Add your fresh new phy tunable attribute above and remember to update
+	 * phy_tunable_strings[] in net/core/ethtool.c
+	 */
+	__ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
 /**
  * struct ethtool_regs - hardware register dump
  * @cmd: Command number = %ETHTOOL_GREGS
@@ -547,6 +560,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
  */
 enum ethtool_stringset {
 	ETH_SS_TEST		= 0,
@@ -557,6 +571,7 @@ enum ethtool_stringset {
 	ETH_SS_RSS_HASH_FUNCS,
 	ETH_SS_TUNABLES,
 	ETH_SS_PHY_STATS,
+	ETH_SS_PHY_TUNABLES,
 };
 
 /**
@@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op {
 
 #define ETHTOOL_GLINKSETTINGS	0x0000004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS	0x0000004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE	0x0000004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE	0x0000004f /* Set PHY tunable configuration */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
-- 
2.7.3

^ permalink raw reply related

* Re: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures
From: Jon Hunter @ 2016-11-14  9:34 UTC (permalink / raw)
  To: allan, robert.foss, freddy, Dean_Jenkins, Mark_Craske, davem,
	ivecera, john.stultz, vpalatin, stephen, grundler, changchias,
	andrew, tremyfr, colin.king, linux-usb, netdev, linux-kernel,
	vpalatin
In-Reply-To: <00e501d23e54$337d4b70$9a77e250$@asix.com.tw>

Hi Allan,

On 14/11/16 08:50, ASIX_Allan [Home] wrote:
> It seems the AX88772x dongle had been unexpectedly removed while running the
> ax88772_suspend() routine. If yes, you might see these error messages
> because the hardware had been absent.    

In my case the hardware was never removed. The boards are in a test
fixture that are not touched. This is seen on more than one board. By
reverting this change I no longer see the error messages and appears to
be 100% reproducible.

Jon

-- 
nvpublic

^ permalink raw reply

* RE: [PATCH net-next v5] cadence: Add LSO support.
From: Rafal Ozieblo @ 2016-11-14  9:32 UTC (permalink / raw)
  To: David Miller
  Cc: nicolas.ferre@atmel.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

From: David Miller [mailto:davem@davemloft.net] 
Sent: 10 listopada 2016 18:01
To: Rafal Ozieblo
Cc: nicolas.ferre@atmel.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next v5]] cadence: Add LSO support.

From: Rafal Ozieblo <rafalo@cadence.com>
Date: Wed, 9 Nov 2016 13:41:02 +0000

> First, please remove the spurious closing bracket in your Subject line in future submittions.
>
>> +	if (is_udp) /* is_udp is only set when (is_lso) is checked */
>> +		/* zero UDP checksum, not calculated by h/w for UFO */
>> +		udp_hdr(skb)->check = 0;
>
> This is really not ok.
> 
> If UFO is in use it should not silently disable UDP checksums.
> 
> If you cannot support UFO with proper checksumming, then you cannot enable support for that feature.

According Cadence Gigabit Ethernet MAC documentation:

"Hardware will not calculate the UDP checksum or modify the UDP checksum field. Therefore software
must set a value of zero in the checksum field in the UDP header (in the first payload buffer)
to indicate to the receiver that the UDP datagram does not include a checksum."

It is hardware requirement.

^ permalink raw reply

* [PATCH ethtool v2 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115787-1265-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

Add ethtool get and set tunable to access PHY drivers.

Ethtool Help: ethtool -h for PHY tunables
    ethtool --set-phy-tunable DEVNAME      Set PHY tunable
                [ downshift on|off [count N] ]
    ethtool --get-phy-tunable DEVNAME      Get PHY tunable
                [ downshift ]

Ethtool ex:
  ethtool --set-phy-tuanble eth0 downshift on
  ethtool --set-phy-tuanble eth0 downshift off
  ethtool --set-phy-tuanble eth0 downshift on count 2

  ethtool --get-phy-tunable eth0 downshift

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
 ethtool.8.in |  39 ++++++++++++++++
 ethtool.c    | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 183 insertions(+)

diff --git a/ethtool.8.in b/ethtool.8.in
index 9631847..2242e90 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -340,6 +340,18 @@ ethtool \- query or control network driver and hardware settings
 .B2 tx-lpi on off
 .BN tx-timer
 .BN advertise
+.HP
+.B ethtool \-\-set\-phy\-tunable
+.I devname
+.RB [
+.B downshift
+.A1 on off
+.BN count
+.RB ]
+.HP
+.B ethtool \-\-get\-phy\-tunable
+.I devname
+.RB [ downshift ]
 .
 .\" Adjust lines (i.e. full justification) and hyphenate.
 .ad
@@ -947,6 +959,33 @@ Values are as for
 Sets the amount of time the device should stay in idle mode prior to asserting
 its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled.
 .RE
+.TP
+.B \-\-set\-phy\-tunable
+Sets the PHY tunable parameters.
+.RS 4
+.TP
+.A2 downshift on off
+Specifies whether downshift should be enabled
+.TS
+nokeep;
+lB	l.
+.BI count \ N
+Sets the PHY downshift re-tries count.
+.TE
+.PD
+.RE
+.TP
+.B \-\-get\-phy\-tunable
+Gets the PHY tunable parameters.
+.RS 4
+.TP
+.B downshift
+For operation in cabling environments that are incompatible with 1000BAST-T,
+PHY device provides an automatic link speed downshift operation.
+Link speed downshift after N failed 1000BASE-T auto-negotiation attempts.
+
+Gets the PHY downshift count/status.
+.RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
 .SH AUTHOR
diff --git a/ethtool.c b/ethtool.c
index 49ac94e..3b7ea4e 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx)
 	return 0;
 }
 
+static int do_get_phy_tunable(struct cmd_context *ctx)
+{
+	int argc = ctx->argc;
+	char **argp = ctx->argp;
+	int err, i;
+	u8 downshift_changed = 0;
+
+	if (argc < 1)
+		exit_bad_args();
+	for (i = 0; i < argc; i++) {
+		if (!strcmp(argp[i], "downshift")) {
+			downshift_changed = 1;
+			i += 1;
+		} else  {
+			exit_bad_args();
+		}
+	}
+
+	if (downshift_changed) {
+		struct ethtool_tunable ds;
+		u8 count = 0;
+
+		ds.cmd = ETHTOOL_PHY_GTUNABLE;
+		ds.id = ETHTOOL_PHY_DOWNSHIFT;
+		ds.type_id = ETHTOOL_TUNABLE_U8;
+		ds.len = 1;
+		ds.data[0] = &count;
+		err = send_ioctl(ctx, &ds);
+		if (err < 0) {
+			perror("Cannot Get PHY downshift count");
+			err = 87;
+		}
+		count = *((u8 *)&ds.data[0]);
+		if (count)
+			fprintf(stdout, "  Downshift count: %d\n",
+				count);
+		else
+			fprintf(stdout, " Downshift disabled\n");
+	}
+
+	return err;
+}
+
+static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on)
+{
+	if (ctx->argc < 2)
+		return 0;
+
+	if (strcmp(*ctx->argp, name))
+		return 0;
+
+	if (!strcmp(*(ctx->argp + 1), "on")) {
+		*on = 1;
+	} else if (!strcmp(*(ctx->argp + 1), "off")) {
+		*on = 0;
+	} else {
+		fprintf(stderr, "Invalid boolean\n");
+		exit_bad_args();
+	}
+
+	ctx->argc -= 2;
+	ctx->argp += 2;
+
+	return 1;
+}
+
+static int parse_named_u8(struct cmd_context *ctx, const char *name, u8 *val)
+{
+	if (ctx->argc < 2)
+		return 0;
+
+	if (strcmp(*ctx->argp, name))
+		return 0;
+
+	*val = get_uint_range(*(ctx->argp + 1), 0, 0xff);
+
+	ctx->argc -= 2;
+	ctx->argp += 2;
+
+	return 1;
+}
+
+static int do_set_phy_tunable(struct cmd_context *ctx)
+{
+	int err = 0;
+	u8 ds_cnt = DOWNSHIFT_DEV_DEFAULT_COUNT;
+	u8 ds_changed = 0, ds_has_cnt = 0, ds_enable = 0;
+
+	if (ctx->argc == 0)
+		exit_bad_args();
+
+	/* Parse arguments */
+	while (ctx->argc) {
+		if (parse_named_bool(ctx, "downshift", &ds_enable)) {
+			ds_changed = 1;
+			ds_has_cnt = parse_named_u8(ctx, "count", &ds_cnt);
+		} else {
+			exit_bad_args();
+		}
+	}
+
+	/* Validate parameters */
+	if (ds_changed) {
+		if (!ds_enable && ds_has_cnt) {
+			fprintf(stderr, "'count' may not be set when downshift "
+				        "is off.\n");
+			exit_bad_args();
+		}
+
+		if (ds_enable && ds_has_cnt && ds_cnt == 0) {
+			fprintf(stderr, "'count' may not be zero.\n");
+			exit_bad_args();
+		}
+
+		if (!ds_enable) {
+		    ds_cnt = DOWNSHIFT_DEV_DISABLE;
+		}
+	}
+
+	/* Do it */
+	if (ds_changed) {
+		struct ethtool_tunable ds;
+		u8 count;
+
+		ds.cmd = ETHTOOL_PHY_STUNABLE;
+		ds.id = ETHTOOL_PHY_DOWNSHIFT;
+		ds.type_id = ETHTOOL_TUNABLE_U8;
+		ds.len = 1;
+		ds.data[0] = &count;
+		*((u8 *)&ds.data[0]) = ds_cnt;
+		err = send_ioctl(ctx, &ds);
+		if (err < 0) {
+			perror("Cannot Set PHY downshift count");
+			err = 87;
+		}
+	}
+
+	return err;
+}
+
 #ifndef TEST_ETHTOOL
 int send_ioctl(struct cmd_context *ctx, void *cmd)
 {
@@ -4681,6 +4821,10 @@ static const struct option {
 	  "		[ advertise %x ]\n"
 	  "		[ tx-lpi on|off ]\n"
 	  "		[ tx-timer %d ]\n"},
+	{ "--set-phy-tunable", 1, do_set_phy_tunable, "Set PHY tunable",
+	  "		[ downshift on|off [count N] ]\n"},
+	{ "--get-phy-tunable", 1, do_get_phy_tunable, "Get PHY tunable",
+	  "		[ downshift ]\n"},
 	{ "-h|--help", 0, show_usage, "Show this help" },
 	{ "--version", 0, do_version, "Show version number" },
 	{}
-- 
2.7.3

^ permalink raw reply related

* [PATCH ethtool v2 0/2] Adding downshift support to ethtool
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen

This is a follow-up on the patch series posted at Fri, 4 Nov 2016 11:27:31 +0100
(old cover letter included below).


The following is changed/added/addressed:
- Syntax is changed from "--set-phy-tunable downshift on|off|%d" to
  "--set-phy-tunable [downshift on|off [count N]]" - as requested by Andrew.
- Short description of downshifting is added to the man page.


Please review

Best regards
Allan and Raju

Raju Lakkaraju (2):
  ethtool-copy.h:sync with net
  Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY
    downshift

 ethtool-copy.h |  18 +++++++-
 ethtool.8.in   |  39 ++++++++++++++++
 ethtool.c      | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 200 insertions(+), 1 deletion(-)

Old cover letter:
> From 5d926ca56f13e283aaa98e820d4720305be4da8f Mon Sep 17 00:00:00 2001
> From: "Allan W. Nielsen" <allan.nielsen@microsemi.com>
> Date: Fri, 4 Nov 2016 11:27:31 +0100
> Subject: [PATCH ethtool 0/2] Adding downshift support to ethtool
> 
> Hi All,
> 
> This patch implements for set/get downshifting.
> 
> Downshifting can either be turned on/off, or it can be configured to a
> specifc count.
> 
> If no "count" are provided, then it is up to the individual PHY driver to
> configure a default count.

-- 
2.7.3

^ permalink raw reply

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Dr. David Alan Gilbert @ 2016-11-14  9:29 UTC (permalink / raw)
  To: Russell King - ARM Linux, jasowang, vyasevic, stefanha
  Cc: David Woodhouse, netdev, qemu-devel
In-Reply-To: <20161111224427.GG1041@n2100.armlinux.org.uk>

* Russell King - ARM Linux (linux@armlinux.org.uk) wrote:
> On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> > It's also *fairly* unlikely that the kernel in the guest has developed
> > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > that qemu isn't properly emulating those bits. But at first glance at
> > the code, it looks like *that's* been there for the last decade too...
> 
> I take issue with that, having looked at the qemu rtl8139 code:
> 
>                 if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
>                 {
>                     int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> 
>                     DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
>                         "frame data %d specified MSS=%d\n", ETH_MTU,
>                         ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> 
> That's the only reference to "large_send_mss" there, other than that,
> the MSS value that gets stuck into the field by 8139cp.c is completely
> unused.  Instead, qemu does this:
> 
>                 eth_payload_data = saved_buffer + ETH_HLEN;
>                 eth_payload_len  = saved_size   - ETH_HLEN;
> 
>                 ip = (ip_header*)eth_payload_data;
> 
>                     hlen = IP_HEADER_LENGTH(ip);
>                     ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> 
>                     tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
>                     int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> 
>                     /* ETH_MTU = ip header len + tcp header len + payload */
>                     int tcp_data_len = ip_data_len - tcp_hlen;
>                     int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> 
>                     for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
>                     {
> 
> It uses a fixed value of ETH_MTU to calculate the size of the TCP
> data chunks, and this is not surprisingly the well known:
> 
> #define ETH_MTU     1500
> 
> Qemu seems to be buggy - it ignores the MSS value, and always tries to
> send 1500 byte frames.

cc'ing in Stefan who last touched that code and Jason and Vlad who
know the net code.

Dave

> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
> 

^ permalink raw reply

* [PATCH net-next v2 5/5] net: phy: Add downshift get/set support in Microsemi PHYs driver
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115760-1182-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

Implements the phy tunable function pointers and implement downshift
functionality for MSCC PHYs.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
 drivers/net/phy/mscc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index d0026ab..92018ba 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -46,8 +46,15 @@ enum rgmii_rx_clock_delay {
 
 #define MSCC_EXT_PAGE_ACCESS		  31
 #define MSCC_PHY_PAGE_STANDARD		  0x0000 /* Standard registers */
+#define MSCC_PHY_PAGE_EXTENDED		  0x0001 /* Extended registers */
 #define MSCC_PHY_PAGE_EXTENDED_2	  0x0002 /* Extended reg - page 2 */
 
+/* Extended Page 1 Registers */
+#define MSCC_PHY_ACTIPHY_CNTL		  20
+#define DOWNSHIFT_CNTL_MASK		  0x001C
+#define DOWNSHIFT_EN			  0x0010
+#define DOWNSHIFT_CNTL_POS		  2
+
 /* Extended Page 2 Registers */
 #define MSCC_PHY_RGMII_CNTL		  20
 #define RGMII_RX_CLK_DELAY_MASK		  0x0070
@@ -75,6 +82,8 @@ enum rgmii_rx_clock_delay {
 #define MSCC_VDDMAC_2500		  2500
 #define MSCC_VDDMAC_3300		  3300
 
+#define DOWNSHIFT_COUNT_MAX		  5
+
 struct vsc8531_private {
 	int rate_magic;
 };
@@ -101,6 +110,66 @@ static int vsc85xx_phy_page_set(struct phy_device *phydev, u8 page)
 	return rc;
 }
 
+static int vsc85xx_downshift_get(struct phy_device *phydev, u8 *count)
+{
+	int rc;
+	u16 reg_val;
+
+	mutex_lock(&phydev->lock);
+	rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+	if (rc != 0)
+		goto out_unlock;
+
+	reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+	reg_val &= DOWNSHIFT_CNTL_MASK;
+	if (!(reg_val & DOWNSHIFT_EN))
+		*count = DOWNSHIFT_DEV_DISABLE;
+	else
+		*count = ((reg_val & ~DOWNSHIFT_EN) >> DOWNSHIFT_CNTL_POS) + 2;
+	rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+	mutex_unlock(&phydev->lock);
+
+	return rc;
+}
+
+static int vsc85xx_downshift_set(struct phy_device *phydev, u8 count)
+{
+	int rc;
+	u16 reg_val;
+
+	if (count == DOWNSHIFT_DEV_DEFAULT_COUNT) {
+		/* Default downshift count 3 (i.e. Bit3:2 = 0b01) */
+		count = ((1 << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+	} else if (count > DOWNSHIFT_COUNT_MAX || count == 1) {
+		phydev_err(phydev, "Downshift count should be 2,3,4 or 5\n");
+		return -ERANGE;
+	} else if (count) {
+		/* Downshift count is either 2,3,4 or 5 */
+		count = (((count - 2) << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+	}
+
+	mutex_lock(&phydev->lock);
+	rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+	if (rc != 0)
+		goto out_unlock;
+
+	reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+	reg_val &= ~(DOWNSHIFT_CNTL_MASK);
+	reg_val |= count;
+	rc = phy_write(phydev, MSCC_PHY_ACTIPHY_CNTL, reg_val);
+	if (rc != 0)
+		goto out_unlock;
+
+	rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+	mutex_unlock(&phydev->lock);
+
+	return rc;
+}
+
 static int vsc85xx_wol_set(struct phy_device *phydev,
 			   struct ethtool_wolinfo *wol)
 {
@@ -329,6 +398,29 @@ static int vsc85xx_default_config(struct phy_device *phydev)
 	return rc;
 }
 
+static int vsc85xx_get_tunable(struct phy_device *phydev,
+			       struct ethtool_tunable *tuna, void *data)
+{
+	switch (tuna->id) {
+	case ETHTOOL_PHY_DOWNSHIFT:
+		return vsc85xx_downshift_get(phydev, (u8 *)data);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int vsc85xx_set_tunable(struct phy_device *phydev,
+			       struct ethtool_tunable *tuna,
+			       const void *data)
+{
+	switch (tuna->id) {
+	case ETHTOOL_PHY_DOWNSHIFT:
+		return vsc85xx_downshift_set(phydev, *(u8 *)data);
+	default:
+		return -EINVAL;
+	}
+}
+
 static int vsc85xx_config_init(struct phy_device *phydev)
 {
 	int rc;
@@ -418,6 +510,8 @@ static struct phy_driver vsc85xx_driver[] = {
 	.probe		= &vsc85xx_probe,
 	.set_wol	= &vsc85xx_wol_set,
 	.get_wol	= &vsc85xx_wol_get,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
 },
 {
 	.phy_id		= PHY_ID_VSC8531,
@@ -437,6 +531,8 @@ static struct phy_driver vsc85xx_driver[] = {
 	.probe		= &vsc85xx_probe,
 	.set_wol	= &vsc85xx_wol_set,
 	.get_wol	= &vsc85xx_wol_get,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
 },
 {
 	.phy_id		= PHY_ID_VSC8540,
@@ -456,6 +552,8 @@ static struct phy_driver vsc85xx_driver[] = {
 	.probe		= &vsc85xx_probe,
 	.set_wol	= &vsc85xx_wol_set,
 	.get_wol	= &vsc85xx_wol_get,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
 },
 {
 	.phy_id		= PHY_ID_VSC8541,
@@ -475,6 +573,8 @@ static struct phy_driver vsc85xx_driver[] = {
 	.probe		= &vsc85xx_probe,
 	.set_wol	= &vsc85xx_wol_set,
 	.get_wol	= &vsc85xx_wol_get,
+	.get_tunable	= &vsc85xx_get_tunable,
+	.set_tunable	= &vsc85xx_set_tunable,
 }
 
 };
-- 
2.7.3

^ permalink raw reply related

* [PATCH net-next v2 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
From: Allan W. Nielsen @ 2016-11-14  9:29 UTC (permalink / raw)
  To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115760-1182-1-git-send-email-allan.nielsen@microsemi.com>

From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>

For operation in cabling environments that are incompatible with
1000BAST-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BAST-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BAST-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
 include/uapi/linux/ethtool.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 42f696f..f0db778 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,9 +248,12 @@ struct ethtool_tunable {
 	void	*data[0];
 };

+#define DOWNSHIFT_DEV_DEFAULT_COUNT	0xff
+#define DOWNSHIFT_DEV_DISABLE		0
+
 enum phy_tunable_id {
 	ETHTOOL_PHY_ID_UNSPEC,
-
+	ETHTOOL_PHY_DOWNSHIFT,
 	/*
 	 * Add your fresh new phy tunable attribute above and remember to update
 	 * phy_tunable_strings[] in net/core/ethtool.c
-- 
2.7.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox