Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 1/2] Revert "net: use lib/percpu_counter API for fragmentation mem accounting"
From: Jesper Dangaard Brouer @ 2017-09-01  9:26 UTC (permalink / raw)
  To: netdev; +Cc: mkubecek, Florian Westphal, liujian56, Jesper Dangaard Brouer
In-Reply-To: <150425790711.22227.12264977619066874632.stgit@firesoul>

This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.

There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.

We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:

1) On systems with CPUs > 24, the heavier fully locked
   __percpu_counter_sum() is always invoked, which will be more
   expensive than the atomic_t that is reverted to.

Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option.  To mitigate this, the batch size could be
decreased and thresh be increased.

2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
   CPU, before SKBs are pushed into sockets on remote CPUs.  Given
   NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
   likely be limited.  Thus, a fair chance that atomic add+dec happen
   on the same CPU.

Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/inet_frag.h  |   30 +++++++++---------------------
 net/ipv4/inet_fragment.c |    4 +---
 2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6fdcd2427776..fa635aa6d0b9 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,14 +1,9 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__

-#include <linux/percpu_counter.h>
-
 struct netns_frags {
-	/* The percpu_counter "mem" need to be cacheline aligned.
-	 *  mem.count must not share cacheline with other writers
-	 */
-	struct percpu_counter   mem ____cacheline_aligned_in_smp;
-
+	/* Keep atomic mem on separate cachelines in structs that include it */
+	atomic_t		mem ____cacheline_aligned_in_smp;
 	/* sysctls */
 	int			timeout;
 	int			high_thresh;
@@ -110,11 +105,11 @@ void inet_frags_fini(struct inet_frags *);

 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
-	return percpu_counter_init(&nf->mem, 0, GFP_KERNEL);
+	atomic_set(&nf->mem, 0);
+	return 0;
 }
 static inline void inet_frags_uninit_net(struct netns_frags *nf)
 {
-	percpu_counter_destroy(&nf->mem);
 }

 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
@@ -140,31 +135,24 @@ static inline bool inet_frag_evicting(struct inet_frag_queue *q)

 /* Memory Tracking Functions. */

-/* The default percpu_counter batch size is not big enough to scale to
- * fragmentation mem acct sizes.
- * The mem size of a 64K fragment is approx:
- *  (44 fragments * 2944 truesize) + frag_queue struct(200) = 129736 bytes
- */
-static unsigned int frag_percpu_counter_batch = 130000;
-
 static inline int frag_mem_limit(struct netns_frags *nf)
 {
-	return percpu_counter_read(&nf->mem);
+	return atomic_read(&nf->mem);
 }

 static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	percpu_counter_add_batch(&nf->mem, -i, frag_percpu_counter_batch);
+	atomic_sub(i, &nf->mem);
 }

 static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
 {
-	percpu_counter_add_batch(&nf->mem, i, frag_percpu_counter_batch);
+	atomic_add(i, &nf->mem);
 }

-static inline unsigned int sum_frag_mem_limit(struct netns_frags *nf)
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
-	return percpu_counter_sum_positive(&nf->mem);
+	return atomic_read(&nf->mem);
 }

 /* RFC 3168 support :
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 96e95e83cc61..af74d0433453 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -234,10 +234,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 	cond_resched();

 	if (read_seqretry(&f->rnd_seqlock, seq) ||
-	    percpu_counter_sum(&nf->mem))
+	    sum_frag_mem_limit(nf))
 		goto evict_again;
-
-	percpu_counter_destroy(&nf->mem);
 }
 EXPORT_SYMBOL(inet_frags_exit_net);

^ permalink raw reply related

* [PATCH net 2/2] Revert "net: fix percpu memory leaks"
From: Jesper Dangaard Brouer @ 2017-09-01  9:26 UTC (permalink / raw)
  To: netdev; +Cc: mkubecek, Florian Westphal, liujian56, Jesper Dangaard Brouer
In-Reply-To: <150425790711.22227.12264977619066874632.stgit@firesoul>

This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.

After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch.  As percpu_counter is no longer used, it cannot
memory leak it any-longer.

Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/inet_frag.h                 |    7 +------
 net/ieee802154/6lowpan/reassembly.c     |   11 +++--------
 net/ipv4/ip_fragment.c                  |   12 +++---------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   12 +++---------
 net/ipv6/reassembly.c                   |   12 +++---------
 5 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index fa635aa6d0b9..fc59e0775e00 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -103,15 +103,10 @@ struct inet_frags {
 int inet_frags_init(struct inet_frags *);
 void inet_frags_fini(struct inet_frags *);
 
-static inline int inet_frags_init_net(struct netns_frags *nf)
+static inline void inet_frags_init_net(struct netns_frags *nf)
 {
 	atomic_set(&nf->mem, 0);
-	return 0;
 }
-static inline void inet_frags_uninit_net(struct netns_frags *nf)
-{
-}
-
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
 
 void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 30d875dff6b5..f85b08baff16 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -580,19 +580,14 @@ static int __net_init lowpan_frags_init_net(struct net *net)
 {
 	struct netns_ieee802154_lowpan *ieee802154_lowpan =
 		net_ieee802154_lowpan(net);
-	int res;
 
 	ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
 
-	res = inet_frags_init_net(&ieee802154_lowpan->frags);
-	if (res)
-		return res;
-	res = lowpan_frags_ns_sysctl_register(net);
-	if (res)
-		inet_frags_uninit_net(&ieee802154_lowpan->frags);
-	return res;
+	inet_frags_init_net(&ieee802154_lowpan->frags);
+
+	return lowpan_frags_ns_sysctl_register(net);
 }
 
 static void __net_exit lowpan_frags_exit_net(struct net *net)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9a8cfac503dc..46408c220d9d 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -844,8 +844,6 @@ static void __init ip4_frags_ctl_register(void)
 
 static int __net_init ipv4_frags_init_net(struct net *net)
 {
-	int res;
-
 	/* Fragment cache limits.
 	 *
 	 * The fragment memory accounting code, (tries to) account for
@@ -871,13 +869,9 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 
 	net->ipv4.frags.max_dist = 64;
 
-	res = inet_frags_init_net(&net->ipv4.frags);
-	if (res)
-		return res;
-	res = ip4_frags_ns_ctl_register(net);
-	if (res)
-		inet_frags_uninit_net(&net->ipv4.frags);
-	return res;
+	inet_frags_init_net(&net->ipv4.frags);
+
+	return ip4_frags_ns_ctl_register(net);
 }
 
 static void __net_exit ipv4_frags_exit_net(struct net *net)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 986d4ca38832..b263bf3a19f7 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -622,18 +622,12 @@ EXPORT_SYMBOL_GPL(nf_ct_frag6_gather);
 
 static int nf_ct_net_init(struct net *net)
 {
-	int res;
-
 	net->nf_frag.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->nf_frag.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->nf_frag.frags.timeout = IPV6_FRAG_TIMEOUT;
-	res = inet_frags_init_net(&net->nf_frag.frags);
-	if (res)
-		return res;
-	res = nf_ct_frag6_sysctl_register(net);
-	if (res)
-		inet_frags_uninit_net(&net->nf_frag.frags);
-	return res;
+	inet_frags_init_net(&net->nf_frag.frags);
+
+	return nf_ct_frag6_sysctl_register(net);
 }
 
 static void nf_ct_net_exit(struct net *net)
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e1da5b888cc4..846012eae526 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -714,19 +714,13 @@ static void ip6_frags_sysctl_unregister(void)
 
 static int __net_init ipv6_frags_init_net(struct net *net)
 {
-	int res;
-
 	net->ipv6.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->ipv6.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
 
-	res = inet_frags_init_net(&net->ipv6.frags);
-	if (res)
-		return res;
-	res = ip6_frags_ns_sysctl_register(net);
-	if (res)
-		inet_frags_uninit_net(&net->ipv6.frags);
-	return res;
+	inet_frags_init_net(&net->ipv6.frags);
+
+	return ip6_frags_ns_sysctl_register(net);
 }
 
 static void __net_exit ipv6_frags_exit_net(struct net *net)

^ permalink raw reply related

* Re: [PATCH net-next 3/3] bpf: Only set node->ref = 1 if it has not been set
From: Daniel Borkmann @ 2017-09-01  9:28 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team
In-Reply-To: <20170901062713.1842249-4-kafai@fb.com>

On 09/01/2017 08:27 AM, Martin KaFai Lau wrote:
> This patch writes 'node->ref = 1' only if node->ref is 0.
> The number of lookups/s for a ~1M entries LRU map increased by
> ~30% (260097 to 343313).
>
> Other writes on 'node->ref = 0' is not changed.  In those cases, the
> same cache line has to be changed anyway.
>
> First column: Size of the LRU hash
> Second column: Number of lookups/s
>
> Before:
>> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
> 1048577: 260097
>
> After:
>> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
> 1048577: 343313
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net 1/2] Revert "net: use lib/percpu_counter API for fragmentation mem accounting"
From: Florian Westphal @ 2017-09-01  9:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev
In-Reply-To: <150425796850.22227.6149901788992481211.stgit@firesoul>

Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.
> 
> There is a bug in fragmentation codes use of the percpu_counter API,
> that can cause issues on systems with many CPUs.

Acked-by: Florian Westphal <fw@strlen.de>

Thanks Jesper.

^ permalink raw reply

* [net 0/3] gianfar: Tx flow control fix (adjust_link)
From: Claudiu Manoil @ 2017-09-01  9:41 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller

Fix a small blunder in the Tx pause frame settings, that
went unnoticed in the tangled code of adjust_link().
I followed up with a couple of simple refactoring patches,
aiming to make adjust_link() more manageable.

(The last 2 patches may be postponed if they are too much
for the current stage of net.)

Claudiu Manoil (3):
  gianfar: Fix Tx flow control deactivation
  gianfar: Refactor link speed update (adjust_link)
  gianfar: Refactor link flow control update (adjust_link)

 drivers/net/ethernet/freescale/gianfar.c | 134 ++++++++++++++++---------------
 1 file changed, 70 insertions(+), 64 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [net 1/3] gianfar: Fix Tx flow control deactivation
From: Claudiu Manoil @ 2017-09-01  9:41 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, stable
In-Reply-To: <1504258863-2058-1-git-send-email-claudiu.manoil@nxp.com>

Indeed, the wrong register is checked for the Tx flow control bit,
it should have been maccfg1 not maccfg2.
This went unnoticed for so long probably because the impact is
hardly visible, not to mention the tangled code from adjust_link().
First, link flow control (i.e. handling of Rx/Tx link level pause frames)
is disabled by default (needs to be enabled via 'ethtool -A').
Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few
old boards), which results in Tx flow control remaining always on
once activated.

Fixes: 45b679c9a3ccd9e34f28e6ec677b812a860eb8eb -
	"gianfar: Implement PAUSE frame generation support"

Cc: stable@kernel.org

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index c4b4b0a..5be52d8 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -3687,7 +3687,7 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 		u32 tempval1 = gfar_read(&regs->maccfg1);
 		u32 tempval = gfar_read(&regs->maccfg2);
 		u32 ecntrl = gfar_read(&regs->ecntrl);
-		u32 tx_flow_oldval = (tempval & MACCFG1_TX_FLOW);
+		u32 tx_flow_oldval = (tempval1 & MACCFG1_TX_FLOW);

 		if (phydev->duplex != priv->oldduplex) {
 			if (!(phydev->duplex))
-- 
2.7.4

^ permalink raw reply related

* [net 2/3] gianfar: Refactor link speed update (adjust_link)
From: Claudiu Manoil @ 2017-09-01  9:41 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller
In-Reply-To: <1504258863-2058-1-git-send-email-claudiu.manoil@nxp.com>

Encapsulate link speed update logic. These settings affect only
the maccfg2 and ecntrl regs.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 92 +++++++++++++++++---------------
 1 file changed, 49 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 5be52d8..46880a9 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -3672,6 +3672,54 @@ static u32 gfar_get_flowctrl_cfg(struct gfar_private *priv)
 	return val;
 }
 
+static void gfar_update_link_speed(struct gfar_private *priv)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	struct phy_device *phydev = priv->ndev->phydev;
+	u32 maccfg2, ecntrl;
+
+	maccfg2 = gfar_read(&regs->maccfg2);
+	ecntrl = gfar_read(&regs->ecntrl);
+
+	if (phydev->duplex != priv->oldduplex) {
+		if (!(phydev->duplex))
+			maccfg2 &= ~(MACCFG2_FULL_DUPLEX);
+		else
+			maccfg2 |= MACCFG2_FULL_DUPLEX;
+
+		priv->oldduplex = phydev->duplex;
+	}
+
+	if (phydev->speed != priv->oldspeed) {
+		switch (phydev->speed) {
+		case 1000:
+			maccfg2 = ((maccfg2 & ~(MACCFG2_IF)) | MACCFG2_GMII);
+
+			ecntrl &= ~(ECNTRL_R100);
+			break;
+		case 100:
+		case 10:
+			maccfg2 = ((maccfg2 & ~(MACCFG2_IF)) | MACCFG2_MII);
+
+			/* Reduced mode distinguishes between 10 and 100 */
+			if (phydev->speed == SPEED_100)
+				ecntrl |= ECNTRL_R100;
+			else
+				ecntrl &= ~(ECNTRL_R100);
+			break;
+		default:
+			netif_warn(priv, link, priv->ndev,
+				   "Invalid link speed!\n");
+			break;
+		}
+
+		priv->oldspeed = phydev->speed;
+	}
+
+	gfar_write(&regs->maccfg2, maccfg2);
+	gfar_write(&regs->ecntrl, ecntrl);
+}
+
 static noinline void gfar_update_link_state(struct gfar_private *priv)
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
@@ -3685,49 +3733,9 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 
 	if (phydev->link) {
 		u32 tempval1 = gfar_read(&regs->maccfg1);
-		u32 tempval = gfar_read(&regs->maccfg2);
-		u32 ecntrl = gfar_read(&regs->ecntrl);
 		u32 tx_flow_oldval = (tempval1 & MACCFG1_TX_FLOW);
 
-		if (phydev->duplex != priv->oldduplex) {
-			if (!(phydev->duplex))
-				tempval &= ~(MACCFG2_FULL_DUPLEX);
-			else
-				tempval |= MACCFG2_FULL_DUPLEX;
-
-			priv->oldduplex = phydev->duplex;
-		}
-
-		if (phydev->speed != priv->oldspeed) {
-			switch (phydev->speed) {
-			case 1000:
-				tempval =
-				    ((tempval & ~(MACCFG2_IF)) | MACCFG2_GMII);
-
-				ecntrl &= ~(ECNTRL_R100);
-				break;
-			case 100:
-			case 10:
-				tempval =
-				    ((tempval & ~(MACCFG2_IF)) | MACCFG2_MII);
-
-				/* Reduced mode distinguishes
-				 * between 10 and 100
-				 */
-				if (phydev->speed == SPEED_100)
-					ecntrl |= ECNTRL_R100;
-				else
-					ecntrl &= ~(ECNTRL_R100);
-				break;
-			default:
-				netif_warn(priv, link, priv->ndev,
-					   "Ack!  Speed (%d) is not 10/100/1000!\n",
-					   phydev->speed);
-				break;
-			}
-
-			priv->oldspeed = phydev->speed;
-		}
+		gfar_update_link_speed(priv);
 
 		tempval1 &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
 		tempval1 |= gfar_get_flowctrl_cfg(priv);
@@ -3749,8 +3757,6 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 			priv->tx_actual_en = 0;
 
 		gfar_write(&regs->maccfg1, tempval1);
-		gfar_write(&regs->maccfg2, tempval);
-		gfar_write(&regs->ecntrl, ecntrl);
 
 		if (!priv->oldlink)
 			priv->oldlink = 1;
-- 
2.7.4

^ permalink raw reply related

* [net 3/3] gianfar: Refactor link flow control update (adjust_link)
From: Claudiu Manoil @ 2017-09-01  9:41 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller
In-Reply-To: <1504258863-2058-1-git-send-email-claudiu.manoil@nxp.com>

Encapsulate link layer flow control logic. These settings
are touching maccfg1 reg exclusively.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 68 ++++++++++++++++----------------
 1 file changed, 34 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 46880a9..1648173 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -3672,6 +3672,36 @@ static u32 gfar_get_flowctrl_cfg(struct gfar_private *priv)
 	return val;
 }
 
+static void gfar_update_link_flowctrl(struct gfar_private *priv)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 maccfg1, tx_flow_oldval;
+	int i;
+
+	maccfg1 = gfar_read(&regs->maccfg1);
+	tx_flow_oldval = (maccfg1 & MACCFG1_TX_FLOW);
+
+	maccfg1 &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
+	maccfg1 |= gfar_get_flowctrl_cfg(priv);
+
+	/* Turn last free buffer recording on */
+	if ((maccfg1 & MACCFG1_TX_FLOW) && !tx_flow_oldval) {
+		for (i = 0; i < priv->num_rx_queues; i++) {
+			u32 bdp_dma;
+
+			bdp_dma = gfar_rxbd_dma_lastfree(priv->rx_queue[i]);
+			gfar_write(priv->rx_queue[i]->rfbptr, bdp_dma);
+		}
+
+		priv->tx_actual_en = 1;
+	}
+
+	if (unlikely(!(maccfg1 & MACCFG1_TX_FLOW) && tx_flow_oldval))
+		priv->tx_actual_en = 0;
+
+	gfar_write(&regs->maccfg1, maccfg1);
+}
+
 static void gfar_update_link_speed(struct gfar_private *priv)
 {
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
@@ -3722,44 +3752,14 @@ static void gfar_update_link_speed(struct gfar_private *priv)
 
 static noinline void gfar_update_link_state(struct gfar_private *priv)
 {
-	struct gfar __iomem *regs = priv->gfargrp[0].regs;
-	struct net_device *ndev = priv->ndev;
-	struct phy_device *phydev = ndev->phydev;
-	struct gfar_priv_rx_q *rx_queue = NULL;
-	int i;
-
 	if (unlikely(test_bit(GFAR_RESETTING, &priv->state)))
 		return;
 
-	if (phydev->link) {
-		u32 tempval1 = gfar_read(&regs->maccfg1);
-		u32 tx_flow_oldval = (tempval1 & MACCFG1_TX_FLOW);
-
+	if (priv->ndev->phydev->link) {
+		gfar_update_link_flowctrl(priv);
 		gfar_update_link_speed(priv);
 
-		tempval1 &= ~(MACCFG1_TX_FLOW | MACCFG1_RX_FLOW);
-		tempval1 |= gfar_get_flowctrl_cfg(priv);
-
-		/* Turn last free buffer recording on */
-		if ((tempval1 & MACCFG1_TX_FLOW) && !tx_flow_oldval) {
-			for (i = 0; i < priv->num_rx_queues; i++) {
-				u32 bdp_dma;
-
-				rx_queue = priv->rx_queue[i];
-				bdp_dma = gfar_rxbd_dma_lastfree(rx_queue);
-				gfar_write(rx_queue->rfbptr, bdp_dma);
-			}
-
-			priv->tx_actual_en = 1;
-		}
-
-		if (unlikely(!(tempval1 & MACCFG1_TX_FLOW) && tx_flow_oldval))
-			priv->tx_actual_en = 0;
-
-		gfar_write(&regs->maccfg1, tempval1);
-
-		if (!priv->oldlink)
-			priv->oldlink = 1;
+		priv->oldlink = 1;
 
 	} else if (priv->oldlink) {
 		priv->oldlink = 0;
@@ -3768,7 +3768,7 @@ static noinline void gfar_update_link_state(struct gfar_private *priv)
 	}
 
 	if (netif_msg_link(priv))
-		phy_print_status(phydev);
+		phy_print_status(priv->ndev->phydev);
 }
 
 static const struct of_device_id gfar_match[] =
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v7 08/10] bpf: Add a Landlock sandbox example
From: Alban Crequy @ 2017-09-01 10:25 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel@vger.kernel.org, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun 
In-Reply-To: <20170821000933.13024-9-mic@digikod.net>

Hi Mickaël,

On 21 August 2017 at 02:09, Mickaël Salaün <mic@digikod.net> wrote:
> Add a basic sandbox tool to create a process isolated from some part of
> the system. This sandbox create a read-only environment. It is only
> allowed to write to a character device such as a TTY:
...
> +       /*
> +        * This check allows the action on the file if it is a directory or a
> +        * pipe. Otherwise, a message is printed to the eBPF log.
> +        */
> +       if (S_ISCHR(ret) || S_ISFIFO(ret))
> +               return 0;


The comment says "directory", but the code checks for "character device".

Thanks!
Alban

^ permalink raw reply

* [net-next PATCH] ixgbe: add counter for times rx pages gets allocated, not recycled
From: Jesper Dangaard Brouer @ 2017-09-01 10:54 UTC (permalink / raw)
  To: netdev; +Cc: Jeff Kirsher, Alexander Duyck, Jesper Dangaard Brouer

The ixgbe driver have page recycle scheme based around the RX-ring
queue, where a RX page is shared between two packets. Based on the
refcnt, the driver can determine if the RX-page is currently only used
by a single packet, if so it can then directly refill/recycle the
RX-slot by with the opposite "side" of the page.

While this is a clever trick, it is hard to determine when this
recycling is successful and when it fails.  Adding a counter, which is
available via ethtool --statistics as 'alloc_rx_page'.  Which counts
the number of times the recycle fails and the real page allocator is
invoked.  When interpreting the stats, do remember that every alloc
will serve two packets.

The counter is collected per rx_ring, but is summed and ethtool
exported as 'alloc_rx_page'.  It would be relevant to know what
rx_ring that cannot keep up, but that can be exported later if
someone experience a need for this.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |    2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |    4 ++++
 3 files changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index dd5578756ae0..008d0085e01f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -275,6 +275,7 @@ struct ixgbe_rx_queue_stats {
 	u64 rsc_count;
 	u64 rsc_flush;
 	u64 non_eop_descs;
+	u64 alloc_rx_page;
 	u64 alloc_rx_page_failed;
 	u64 alloc_rx_buff_failed;
 	u64 csum_err;
@@ -655,6 +656,7 @@ struct ixgbe_adapter {
 	u64 rsc_total_count;
 	u64 rsc_total_flush;
 	u64 non_eop_descs;
+	u32 alloc_rx_page;
 	u32 alloc_rx_page_failed;
 	u32 alloc_rx_buff_failed;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 72c565712a5f..d96d9d6c3492 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -104,6 +104,7 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
 	{"tx_flow_control_xoff", IXGBE_STAT(stats.lxofftxc)},
 	{"rx_flow_control_xoff", IXGBE_STAT(stats.lxoffrxc)},
 	{"rx_csum_offload_errors", IXGBE_STAT(hw_csum_rx_error)},
+	{"alloc_rx_page", IXGBE_STAT(alloc_rx_page)},
 	{"alloc_rx_page_failed", IXGBE_STAT(alloc_rx_page_failed)},
 	{"alloc_rx_buff_failed", IXGBE_STAT(alloc_rx_buff_failed)},
 	{"rx_no_dma_resources", IXGBE_STAT(hw_rx_no_dma_resources)},
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d962368d08d0..7d2e4b08cdf4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1598,6 +1598,7 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,
 		rx_ring->rx_stats.alloc_rx_page_failed++;
 		return false;
 	}
+	rx_ring->rx_stats.alloc_rx_page++;
 
 	/* map page for use */
 	dma = dma_map_page_attrs(rx_ring->dev, page, 0,
@@ -6771,6 +6772,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
 	u32 i, missed_rx = 0, mpc, bprc, lxon, lxoff, xon_off_tot;
 	u64 non_eop_descs = 0, restart_queue = 0, tx_busy = 0;
 	u64 alloc_rx_page_failed = 0, alloc_rx_buff_failed = 0;
+	u64 alloc_rx_page = 0;
 	u64 bytes = 0, packets = 0, hw_csum_rx_error = 0;
 
 	if (test_bit(__IXGBE_DOWN, &adapter->state) ||
@@ -6791,6 +6793,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
 	for (i = 0; i < adapter->num_rx_queues; i++) {
 		struct ixgbe_ring *rx_ring = adapter->rx_ring[i];
 		non_eop_descs += rx_ring->rx_stats.non_eop_descs;
+		alloc_rx_page += rx_ring->rx_stats.alloc_rx_page;
 		alloc_rx_page_failed += rx_ring->rx_stats.alloc_rx_page_failed;
 		alloc_rx_buff_failed += rx_ring->rx_stats.alloc_rx_buff_failed;
 		hw_csum_rx_error += rx_ring->rx_stats.csum_err;
@@ -6798,6 +6801,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
 		packets += rx_ring->stats.packets;
 	}
 	adapter->non_eop_descs = non_eop_descs;
+	adapter->alloc_rx_page = alloc_rx_page;
 	adapter->alloc_rx_page_failed = alloc_rx_page_failed;
 	adapter->alloc_rx_buff_failed = alloc_rx_buff_failed;
 	adapter->hw_csum_rx_error = hw_csum_rx_error;

^ permalink raw reply related

* Re: [PATCH net] bridge: switchdev: Clear forward mark when transmitting packet
From: Nikolay Aleksandrov @ 2017-09-01 11:45 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: davem, stephen, jiri, yotamg, mlxsw, bridge
In-Reply-To: <20170901092225.31597-1-idosch@mellanox.com>

On 01/09/17 12:22, Ido Schimmel wrote:
> Commit 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for
> stacked devices") added the 'offload_fwd_mark' bit to the skb in order
> to allow drivers to indicate to the bridge driver that they already
> forwarded the packet in L2.
> 
> In case the bit is set, before transmitting the packet from each port,
> the port's mark is compared with the mark stored in the skb's control
> block. If both marks are equal, we know the packet arrived from a switch
> device that already forwarded the packet and it's not re-transmitted.
> 
> However, if the packet is transmitted from the bridge device itself
> (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
> stored in the skb's control block isn't valid.
> 
> This scenario can happen in rare cases where a packet was trapped during
> L3 forwarding and forwarded by the kernel to a bridge device.
> 
> Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices")
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Reported-by: Yotam Gigi <yotamg@mellanox.com>
> Tested-by: Yotam Gigi <yotamg@mellanox.com>
> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  net/bridge/br_device.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 861ae2a165f4..5a7be3bddfa9 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -53,6 +53,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>  	brstats->tx_bytes += skb->len;
>  	u64_stats_update_end(&brstats->syncp);
>  
> +#ifdef CONFIG_NET_SWITCHDEV
> +	skb->offload_fwd_mark = 0;
> +#endif
>  	BR_INPUT_SKB_CB(skb)->brdev = dev;
>  
>  	skb_reset_mac_header(skb);
> 

Good catch, just one minor nit since there is already an ifdef
switchdev/else in br_private.h, why not make this a helper and avoid the
ifdef/endif in here ? Currently there is no ifdef switchdev anywhere else.

Thanks,
 Nik

^ permalink raw reply

* [PATCH v2 0/5] net: mdio-mux: Misc fix
From: Corentin Labbe @ 2017-09-01 11:55 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe

Hello

This patch series fix minor problems found when working on the
dwmac-sun8i syscon mdio-mux.

Regards

Changes since v1:
- Removed obsolete comment about of_mdio_find_bus/put_device
- removed more DRV_VERSION

Corentin Labbe (5):
  net: mdio-mux: Fix NULL Comparison style
  net: mdio-mux: Remove unnecessary 'out of memory' message
  net: mdio-mux: printing driver version is useless
  net: mdio-mux-mmioreg: Can handle 8/16/32 bits registers
  net: mdio-mux: fix unbalanced put_device

 drivers/net/phy/Kconfig    |  2 +-
 drivers/net/phy/mdio-mux.c | 19 ++++---------------
 2 files changed, 5 insertions(+), 16 deletions(-)

-- 
2.13.5

^ permalink raw reply

* [PATCH v2 1/5] net: mdio-mux: Fix NULL Comparison style
From: Corentin Labbe @ 2017-09-01 11:56 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe
In-Reply-To: <20170901115604.27513-1-clabbe.montjoie@gmail.com>

This patch fix checkpatch warning about NULL Comparison style.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/mdio-mux.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c
index 942ceaf3fd3f..b18ad7082b88 100644
--- a/drivers/net/phy/mdio-mux.c
+++ b/drivers/net/phy/mdio-mux.c
@@ -120,7 +120,7 @@ int mdio_mux_init(struct device *dev,
 	}
 
 	pb = devm_kzalloc(dev, sizeof(*pb), GFP_KERNEL);
-	if (pb == NULL) {
+	if (!pb) {
 		ret_val = -ENOMEM;
 		goto err_pb_kz;
 	}
@@ -144,7 +144,7 @@ int mdio_mux_init(struct device *dev,
 		}
 
 		cb = devm_kzalloc(dev, sizeof(*cb), GFP_KERNEL);
-		if (cb == NULL) {
+		if (!cb) {
 			dev_err(dev,
 				"Error: Failed to allocate memory for child %pOF\n",
 				child_bus_node);
-- 
2.13.5

^ permalink raw reply related

* [PATCH v2 3/5] net: mdio-mux: printing driver version is useless
From: Corentin Labbe @ 2017-09-01 11:56 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe
In-Reply-To: <20170901115604.27513-1-clabbe.montjoie@gmail.com>

Remove the driver version information because this information
is not useful in an upstream kernel driver.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
---
 drivers/net/phy/mdio-mux.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c
index 5e08e89465c5..282828551bdd 100644
--- a/drivers/net/phy/mdio-mux.c
+++ b/drivers/net/phy/mdio-mux.c
@@ -13,7 +13,6 @@
 #include <linux/module.h>
 #include <linux/phy.h>
 
-#define DRV_VERSION "1.0"
 #define DRV_DESCRIPTION "MDIO bus multiplexer driver"
 
 struct mdio_mux_child_bus;
@@ -179,7 +178,6 @@ int mdio_mux_init(struct device *dev,
 	}
 	if (pb->children) {
 		*mux_handle = pb;
-		dev_info(dev, "Version " DRV_VERSION "\n");
 		return 0;
 	}
 
@@ -212,6 +210,5 @@ void mdio_mux_uninit(void *mux_handle)
 EXPORT_SYMBOL_GPL(mdio_mux_uninit);
 
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
-MODULE_VERSION(DRV_VERSION);
 MODULE_AUTHOR("David Daney");
 MODULE_LICENSE("GPL");
-- 
2.13.5

^ permalink raw reply related

* [PATCH v2 4/5] net: mdio-mux-mmioreg: Can handle 8/16/32 bits registers
From: Corentin Labbe @ 2017-09-01 11:56 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe
In-Reply-To: <20170901115604.27513-1-clabbe.montjoie@gmail.com>

This patch fix an old information that mdio-mux-mmioreg can only handle
8bit registers.
This is not true anymore.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 5afe6fdcc968..a9d16a3af514 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -85,7 +85,7 @@ config MDIO_BUS_MUX_MMIOREG
 	  parent bus.  Child bus selection is under the control of one of
 	  the FPGA's registers.
 
-	  Currently, only 8-bit registers are supported.
+	  Currently, only 8/16/32 bits registers are supported.
 
 config MDIO_CAVIUM
 	tristate
-- 
2.13.5

^ permalink raw reply related

* [PATCH v2 5/5] net: mdio-mux: fix unbalanced put_device
From: Corentin Labbe @ 2017-09-01 11:56 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe
In-Reply-To: <20170901115604.27513-1-clabbe.montjoie@gmail.com>

mdio_mux_uninit() call put_device (unconditionally) because of
of_mdio_find_bus() in mdio_mux_init.
But of_mdio_find_bus is only called if mux_bus is empty.
If mux_bus is set, mdio_mux_uninit will print a "refcount_t: underflow"
trace.

This patch add a get_device in the other branch of "if (mux_bus)".

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
---
 drivers/net/phy/mdio-mux.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c
index 282828551bdd..6f75e9f27fed 100644
--- a/drivers/net/phy/mdio-mux.c
+++ b/drivers/net/phy/mdio-mux.c
@@ -116,6 +116,7 @@ int mdio_mux_init(struct device *dev,
 	} else {
 		parent_bus_node = NULL;
 		parent_bus = mux_bus;
+		get_device(&parent_bus->dev);
 	}
 
 	pb = devm_kzalloc(dev, sizeof(*pb), GFP_KERNEL);
@@ -184,9 +185,7 @@ int mdio_mux_init(struct device *dev,
 	dev_err(dev, "Error: No acceptable child buses found\n");
 	devm_kfree(dev, pb);
 err_pb_kz:
-	/* balance the reference of_mdio_find_bus() took */
-	if (!mux_bus)
-		put_device(&parent_bus->dev);
+	put_device(&parent_bus->dev);
 err_parent_bus:
 	of_node_put(parent_bus_node);
 	return ret_val;
@@ -204,7 +203,6 @@ void mdio_mux_uninit(void *mux_handle)
 		cb = cb->next;
 	}
 
-	/* balance the reference of_mdio_find_bus() in mdio_mux_init() took */
 	put_device(&pb->mii_bus->dev);
 }
 EXPORT_SYMBOL_GPL(mdio_mux_uninit);
-- 
2.13.5

^ permalink raw reply related

* [PATCH v2 2/5] net: mdio-mux: Remove unnecessary 'out of memory' message
From: Corentin Labbe @ 2017-09-01 11:56 UTC (permalink / raw)
  To: andrew, f.fainelli; +Cc: netdev, linux-kernel, Corentin Labbe
In-Reply-To: <20170901115604.27513-1-clabbe.montjoie@gmail.com>

This patch fix checkpatch warning about unnecessary 'out of memory'
message.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/mdio-mux.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/phy/mdio-mux.c b/drivers/net/phy/mdio-mux.c
index b18ad7082b88..5e08e89465c5 100644
--- a/drivers/net/phy/mdio-mux.c
+++ b/drivers/net/phy/mdio-mux.c
@@ -145,9 +145,6 @@ int mdio_mux_init(struct device *dev,
 
 		cb = devm_kzalloc(dev, sizeof(*cb), GFP_KERNEL);
 		if (!cb) {
-			dev_err(dev,
-				"Error: Failed to allocate memory for child %pOF\n",
-				child_bus_node);
 			ret_val = -ENOMEM;
 			continue;
 		}
@@ -156,9 +153,6 @@ int mdio_mux_init(struct device *dev,
 
 		cb->mii_bus = mdiobus_alloc();
 		if (!cb->mii_bus) {
-			dev_err(dev,
-				"Error: Failed to allocate MDIO bus for child %pOF\n",
-				child_bus_node);
 			ret_val = -ENOMEM;
 			devm_kfree(dev, cb);
 			continue;
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net] bridge: switchdev: Clear forward mark when transmitting packet
From: Jiri Pirko @ 2017-09-01 12:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Ido Schimmel, netdev, davem, stephen, jiri, yotamg, mlxsw, bridge
In-Reply-To: <0384baf6-0786-2d07-8919-fdd1ca5d88eb@cumulusnetworks.com>

Fri, Sep 01, 2017 at 01:45:15PM CEST, nikolay@cumulusnetworks.com wrote:
>On 01/09/17 12:22, Ido Schimmel wrote:
>> Commit 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for
>> stacked devices") added the 'offload_fwd_mark' bit to the skb in order
>> to allow drivers to indicate to the bridge driver that they already
>> forwarded the packet in L2.
>> 
>> In case the bit is set, before transmitting the packet from each port,
>> the port's mark is compared with the mark stored in the skb's control
>> block. If both marks are equal, we know the packet arrived from a switch
>> device that already forwarded the packet and it's not re-transmitted.
>> 
>> However, if the packet is transmitted from the bridge device itself
>> (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
>> stored in the skb's control block isn't valid.
>> 
>> This scenario can happen in rare cases where a packet was trapped during
>> L3 forwarding and forwarded by the kernel to a bridge device.
>> 
>> Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices")
>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>> Reported-by: Yotam Gigi <yotamg@mellanox.com>
>> Tested-by: Yotam Gigi <yotamg@mellanox.com>
>> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  net/bridge/br_device.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
>> index 861ae2a165f4..5a7be3bddfa9 100644
>> --- a/net/bridge/br_device.c
>> +++ b/net/bridge/br_device.c
>> @@ -53,6 +53,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	brstats->tx_bytes += skb->len;
>>  	u64_stats_update_end(&brstats->syncp);
>>  
>> +#ifdef CONFIG_NET_SWITCHDEV
>> +	skb->offload_fwd_mark = 0;
>> +#endif
>>  	BR_INPUT_SKB_CB(skb)->brdev = dev;
>>  
>>  	skb_reset_mac_header(skb);
>> 
>
>Good catch, just one minor nit since there is already an ifdef
>switchdev/else in br_private.h, why not make this a helper and avoid the
>ifdef/endif in here ? Currently there is no ifdef switchdev anywhere else.

I think it would be better to convert this to a helper in -net-next and
take the patch as it is for -net

^ permalink raw reply

* Re: virtio_net: ethtool supported link modes
From: Radu Rendec @ 2017-09-01 12:01 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, virtualization, netdev, linux-kernel
In-Reply-To: <38286395-683f-f285-aaa5-68f0e8b68675@redhat.com>

On Fri, 2017-09-01 at 11:36 +0800, Jason Wang wrote:
> 
> On 2017年09月01日 01:04, Radu Rendec wrote:
> > Hello,
> > 
> > Looking at the code in virtnet_set_link_ksettings, it seems the speed
> > and duplex can be set to any valid value. The driver will "remember"
> > them and report them back in virtnet_get_link_ksettings.
> > 
> > However, the supported link modes (link_modes.supported in struct
> > ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> > setting is supported.
> > 
> > Does it make more sense to set (at least a few of) the supported link
> > modes, such as 10baseT_Half ... 10000baseT_Full?
> > 
> > I would expect to see consistency between what is reported in
> > link_modes.supported and what can actually be set. Could you please
> > share your opinion on this?
> 
> I think the may make sense only if there's a hardware implementation for 
> virtio. And we probably need to extend virtio spec for adding new commands.

So you're saying that the "hardware" should provide the supported link
modes (e.g. by using feature bits at the virtio layer) and the
virtio_net driver should just translate them and expose them as
link_modes.supported?

Then for consistency, I assume setting speed/duplex via ethtool should
also go into the virtio layer (currently virtio_net seems to just store
them for future retrieval via ethtool).

Thanks,
Radu

^ permalink raw reply

* Re: [PATCH net] bridge: switchdev: Clear forward mark when transmitting packet
From: Nikolay Aleksandrov @ 2017-09-01 12:03 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: mlxsw, yotamg, netdev, bridge, Ido Schimmel, jiri, davem
In-Reply-To: <20170901120120.GA1962@nanopsycho>

On 01/09/17 15:01, Jiri Pirko wrote:
> Fri, Sep 01, 2017 at 01:45:15PM CEST, nikolay@cumulusnetworks.com wrote:
>> On 01/09/17 12:22, Ido Schimmel wrote:
>>> Commit 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for
>>> stacked devices") added the 'offload_fwd_mark' bit to the skb in order
>>> to allow drivers to indicate to the bridge driver that they already
>>> forwarded the packet in L2.
>>>
>>> In case the bit is set, before transmitting the packet from each port,
>>> the port's mark is compared with the mark stored in the skb's control
>>> block. If both marks are equal, we know the packet arrived from a switch
>>> device that already forwarded the packet and it's not re-transmitted.
>>>
>>> However, if the packet is transmitted from the bridge device itself
>>> (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
>>> stored in the skb's control block isn't valid.
>>>
>>> This scenario can happen in rare cases where a packet was trapped during
>>> L3 forwarding and forwarded by the kernel to a bridge device.
>>>
>>> Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices")
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Reported-by: Yotam Gigi <yotamg@mellanox.com>
>>> Tested-by: Yotam Gigi <yotamg@mellanox.com>
>>> Reviewed-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  net/bridge/br_device.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
>>> index 861ae2a165f4..5a7be3bddfa9 100644
>>> --- a/net/bridge/br_device.c
>>> +++ b/net/bridge/br_device.c
>>> @@ -53,6 +53,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>>>  	brstats->tx_bytes += skb->len;
>>>  	u64_stats_update_end(&brstats->syncp);
>>>  
>>> +#ifdef CONFIG_NET_SWITCHDEV
>>> +	skb->offload_fwd_mark = 0;
>>> +#endif
>>>  	BR_INPUT_SKB_CB(skb)->brdev = dev;
>>>  
>>>  	skb_reset_mac_header(skb);
>>>
>>
>> Good catch, just one minor nit since there is already an ifdef
>> switchdev/else in br_private.h, why not make this a helper and avoid the
>> ifdef/endif in here ? Currently there is no ifdef switchdev anywhere else.
> 
> I think it would be better to convert this to a helper in -net-next and
> take the patch as it is for -net
> 

Either way is fine I guess, it's just more work for something as simple. :-)

Whichever way you choose,

Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH net-next v6 3/3] openvswitch: enable NSH support
From: Jan Scheurich @ 2017-09-01 12:11 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Mooney, Sean K
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, e@erig.me,
	jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
In-Reply-To: <87inh56q8u.fsf-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>

> > [Mooney, Sean K]
> > Having the nsh context headers in the flow is quite useful It would
> > allow loadblancing on values stored in the context headers Or other
> > use. I belive odl previously used context header 4 to store a Flow id
> > so this could potentialy be used with the multipath action to have ovs
> > Choose between several possible next hops in the chain.
> 
> In OVS, masks are a list(!) for matching. How can this work for different
> paths that might require different masks? If they can't be unified you even
> get exact matches. Thus, for OVS the context should not be part of the
> flow.

The NSH support in OVS 2.8 (for the user-space datapath only, so far) supports matching on and manipulating the fixed size MD1 context headers C1-C4. They are part of the flow and there are corresponding OXM fields defined. It is up to the SDN controller to program pipelines that match on or set these fields. 

The goal was to support all relevant NSH use cases for MD1: Classifier, SFF, and (with certain limitations) NSH proxy, and SF.

We also support MD2 TLV context headers but not yet for matching and setting, so MD2 TLVs are not part of the flow. OVS 2.8 can add MD2 context TLVs with the encap(nsh) action (classifier use case), can transparently forward MD2 headers (SFF use case) and pop an NSH header with MD2 context (final SFF use case). Support for matching and setting MD2 TLVs is FFS and can be added in a later release.

BR, Jan

^ permalink raw reply

* Re: [PATCH] DSA support for Micrel KSZ8895
From: Pavel Machek @ 2017-09-01 12:15 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: Woojung.Huh, nathan.leigh.conrad, vivien.didelot, f.fainelli,
	netdev, linux-kernel, andrew
In-Reply-To: <93AF473E2DA327428DE3D46B72B1E9FD4111FA48@CHN-SV-EXMX02.mchp-main.com>

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

Hi!

On Wed 2017-08-30 21:32:07, Tristram.Ha@microchip.com wrote:
> > On Mon 2017-08-28 16:09:27, Andrew Lunn wrote:
> > > > I may be confused here, but AFAICT:
> > > >
> > > > 1) Yes, it has standard layout when accessed over MDIO.
> > >
> > >
> > > Section 4.8 of the datasheet says:
> > >
> > > 	All the registers defined in this section can be also accessed
> > > 	via the SPI interface.
> > >
> > > Meaning all PHY registers can be access via the SPI interface. So you
> > > should be able to make a standard Linux MDIO bus driver which performs
> > > SPI reads.
> > 
> > As far as I can tell (and their driver confirms) -- yes, all those registers can be
> > accessed over the SPI, they are just shuffled around... hence MDIO
> > emulation code. I copied it from their code (see the copyrights) so no, I don't
> > believe there's nicer solution.
> > 
> > Best regards,
> 
> Can you hold on your developing work on KSZ8895 driver?  I am afraid your effort may be in vain.  We at Microchip are planning to release DSA drivers for all KSZ switches, starting at KSZ8795, then KSZ8895, and KSZ8863.
>

Well, thanks for heads up... but its too late to stop now. I already
have working code, without the advanced features.

I don't know how far away you are with the development. You may want
to start from my driver (but its probably too late now).

> The driver files all follow the structures of the current KSZ9477 DSA driver, and the file tag_ksz.c will be updated to handle the tail tag of different chips, which requires including the ksz_priv.h header.  That is required nevertheless to support using the offload_fwd_mark indication.
> 
> The KSZ8795 driver will be submitted after Labor Day (9/4) if
> testing reveals no problem.  The KSZ8895 driver will be submitted
> right after that.  You should have no problem using the driver right
> away.

Well, normally world can help with the testing, too. It would be nice
to see the code, as [RFC]. There's great chance that you'll have to
modify the code to adress review feedback, which is why seeing code
soon is useful.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: Add limits for encapsulation and EH
From: Simon Horman @ 2017-09-01 12:22 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, alex.popov, hannes
In-Reply-To: <20170831222239.21509-3-tom@quantonium.net>

On Thu, Aug 31, 2017 at 03:22:39PM -0700, Tom Herbert wrote:
> In flow dissector there are no limits to the number of nested
> encapsulations that might be dissected which makes for a nice DOS
> attack. This patch limits for dissecting nested encapsulations
> as well as for dissecting over extension headers.
> 
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Tom Herbert <tom@quantonium.net>
> ---
>  net/core/flow_dissector.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 5110180a3e96..1bca748de27d 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -396,6 +396,35 @@ __skb_flow_dissect_ipv6(const struct sk_buff *skb,
>  	key_ip->ttl = iph->hop_limit;
>  }
>  
> +/* Maximum number of nested encapsulations that can be processed in
> + * __skb_flow_dissect
> + */
> +#define MAX_FLOW_DISSECT_ENCAPS	5
> +
> +static bool skb_flow_dissect_encap_allowed(int *num_encaps, unsigned int *flags)
> +{
> +	++*num_encaps;
> +
> +	if (*num_encaps >= MAX_FLOW_DISSECT_ENCAPS) {
> +		if (*num_encaps == MAX_FLOW_DISSECT_ENCAPS) {
> +			/* Allow one more pass but ignore disregard

It seems that 'ignore' or 'disregard' should be dropped from the text above.

> +			 * further encapsulations
> +			 */
> +			*flags |= FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> +		} else {
> +			/* Max encaps reached */
> +			return  false;

There are two spaces between 'return' and 'false'.

...

^ permalink raw reply

* Re: [PATCH net-next 1/2] flow_dissector: Cleanup control flow
From: Simon Horman @ 2017-09-01 12:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, alex.popov, hannes
In-Reply-To: <20170831222239.21509-2-tom@quantonium.net>

Hi Tom,

On Thu, Aug 31, 2017 at 03:22:38PM -0700, Tom Herbert wrote:
> __skb_flow_dissect is riddled with gotos that make discerning the flow,
> debugging, and extending the capability difficult. This patch
> reorganizes things so that we only perform goto's after the two main
> switch statements (no gotos within the cases now). It also eliminates
> several goto labels so that there are only two labels that can be target
> for goto.

I agree that the flow of __skb_flow_dissect() is difficult to follow
but its not obvious that this significant change in terms of loc
takes us to a better place.

Maybe it makes follow-up work easier. If so perhaps it should be motivated
along those lines.

In any case I won't stand in the way of this change but I did want to throw
my 2c worth in.

> 
> Reported-by: Alexander Popov <alex.popov@linux.com>
> Signed-off-by: Tom Herbert <tom@quantonium.net>
> ---
>  include/net/flow_dissector.h |   9 ++
>  net/core/flow_dissector.c    | 225 ++++++++++++++++++++++++++++---------------
>  2 files changed, 156 insertions(+), 78 deletions(-)
> 
> diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> index e2663e900b0a..c358c3ff6acc 100644
> --- a/include/net/flow_dissector.h
> +++ b/include/net/flow_dissector.h
> @@ -19,6 +19,15 @@ struct flow_dissector_key_control {
>  #define FLOW_DIS_FIRST_FRAG	BIT(1)
>  #define FLOW_DIS_ENCAPSULATION	BIT(2)
>  
> +enum flow_dissect_ret {
> +	FLOW_DISSECT_RET_OUT_GOOD,
> +	FLOW_DISSECT_RET_OUT_BAD,
> +	FLOW_DISSECT_RET_PROTO_AGAIN,
> +	FLOW_DISSECT_RET_IPPROTO_AGAIN,
> +	FLOW_DISSECT_RET_IPPROTO_AGAIN_EH,
> +	FLOW_DISSECT_RET_CONTINUE,
> +};

Minor nit:

My reading is that this patch does not seem to differentiate between the
handling of FLOW_DISSECT_RET_IPPROTO_AGAIN and
FLOW_DISSECT_RET_IPPROTO_AGAIN_EH.  Perhaps it would be better to add
FLOW_DISSECT_RET_IPPROTO_AGAIN_EH in the following patch where it is used.

> +
>  /**
>   * struct flow_dissector_key_basic:
>   * @thoff: Transport header offset

...

^ permalink raw reply

* Re: [PATCH net-next 1/2] flow_dissector: Cleanup control flow
From: Hannes Frederic Sowa @ 2017-09-01 12:35 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, alex.popov
In-Reply-To: <20170831222239.21509-2-tom@quantonium.net>

Tom Herbert <tom@quantonium.net> writes:

> __skb_flow_dissect is riddled with gotos that make discerning the flow,
> debugging, and extending the capability difficult. This patch
> reorganizes things so that we only perform goto's after the two main
> switch statements (no gotos within the cases now). It also eliminates
> several goto labels so that there are only two labels that can be target
> for goto.

The problem with the

fdret = ... ;
break;

is that we now have to count curly braces to look what break does
actually break and where fdret is being processed. With the number of
context lines you send for the patch this is hard to review.

I tend to like the gotos a bit more for now.

[...]

Bye,
Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox