Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Daniel Borkmann @ 2016-11-29 18:42 UTC (permalink / raw)
  To: Mintz, Yuval, Jakub Kicinski
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	alexei.starovoitov@gmail.com
In-Reply-To: <BL2PR07MB23061A22119C747AA7E9A2FA8D8D0@BL2PR07MB2306.namprd07.prod.outlook.com>

On 11/29/2016 06:51 PM, Mintz, Yuval wrote:
>>> You also need to wrap this under rcu_read_lock() (at least I haven't seen
>>> it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
>>> protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
>>> disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
>>> doesn't.
>
>> My understanding was that Yuval is always doing full stop()/start() so
>> there should be no RX packets in flight while the XDP prog is being
>> changed.  But thinking about it again, perhaps is worth adding the
>> optimization to forego the full qede_reload() in qede_xdp_set() if there
>> is a program already loaded and just do the xchg()+put() (and add RCU
>> protection on the fast path)?
>
> Yeps. That the current state of the code.
> I'll surely pursue this later, but I don't think all this added complexity
> is required for the initial submission.
>
> BTW, one of the problems [or perhaps 'lack of motivation' is a better term]
> I had with the program switching scenario was that there was no sample
> application that did it.
> If it's really an interesting [basic] scenario, perhaps it's worthy to add
> a sample user app. that will repeatedly switch the attached eBPF?

Fwiw, I'm still waiting for Stephen to process his queue, and then I have
a patch for iproute2 to add a minimal initial front-end that can be useful
for experimenting/testing. The atomic switching scenario w/o stop()/start()
would definitely be useful when you need to fix an issue or modify behavior
in your currently loaded program on the fly when you cannot afford small
downtime.

^ permalink raw reply

* Re: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: David Miller @ 2016-11-29 18:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt
In-Reply-To: <1480088780.8455.543.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Nov 2016 07:46:20 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> mlx4 stats are chaotic because a deferred work queue is responsible
> to update them every 250 ms.
> 
> Even sampling stats every one second with "sar -n DEV 1" gives
> variations like the following :
 ...
> This patch allows rx/tx bytes/packets counters being folded at the
> time we need stats.
> 
> We now can fetch stats every 1 ms if we want to check NIC behavior
> on a small time window. It is also easier to detect anomalies.
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>

Applied to net-next, thanks Eric.

^ permalink raw reply

* [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Robert Shearman @ 2016-11-29 17:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, Alexander Duyck, Robert Shearman

With certain distributions of routes it can take a long time to add
and delete further routes at scale. For example, with a route for
10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
is update_suffix:

      40.39%  [kernel]                      [k] update_suffix
       8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup

With these changes, the time is reduced to 4s for the same scale and
distribution of routes.

The issue is that update_suffix does an O(n) walk on the children of a
node and the with a dense distribtion of routes the number of children
in a node tends towards the number of nodes in the tree.

In the add case it isn't necessary to walk all the other children to
update the largest suffix length of the node (which is what
update_suffix is doing) since we already know what the largest suffix
length of all the other children is and we only need to update it if
the new node/leaf has a larger suffix. However, it is currently called
from resize which is called on any update to rebalance the trie.

Therefore improve the scaling by moving the responsibility of
recalculating the node's largest suffix length out of the resize
function into its callers. For fib_insert_node this can be taken care
of by extending put_child to not only update the largest suffix length
in the parent node, but to propagate it all the way up the trie as
required, using a new function, node_push_suffix, based on
leaf_push_suffix, but renamed to reflect its purpose and made safe if
the node has no parent.

No changes are required to inflate/halve since the maximum suffix
length of the sub-trie starting from the node's parent isn't changed,
and the suffix lengths of the halved/inflated nodes are updated
through the creation of the new nodes and put_child called to add the
children to them.

In both fib_table_flush and fib_table_flush_external, given that a
walk of the entire trie is being done there's a choice of optimising
either for the case of a small number of leafs being flushed by
updating the suffix on a change and propagating it up, or optimising
for large numbers of leafs being flushed by deferring the updating of
the largest suffix length to the walk up to the parent node. I opted
for the latter to keep the algorithm linear in complexity in the case
of flushing all leafs and because it is close to the status quo.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/ipv4/fib_trie.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 026f309c51e9..701cae8af44a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n)
 	return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
 }
 
+static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
+{
+	while (tn->slen < l->slen) {
+		tn->slen = l->slen;
+		tn = node_parent(tn);
+		if (!tn)
+			break;
+	}
+}
+
 /* Add a child at position i overwriting the old value.
  * Update the value of full_children and empty_children.
+ *
+ * The suffix length of the parent node and the rest of the tree is
+ * updated (if required) when adding/replacing a node, but is caller's
+ * responsibility on removal.
  */
 static void put_child(struct key_vector *tn, unsigned long i,
 		      struct key_vector *n)
@@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i,
 	else if (!wasfull && isfull)
 		tn_info(tn)->full_children++;
 
-	if (n && (tn->slen < n->slen))
-		tn->slen = n->slen;
+	if (n)
+		node_push_suffix(tn, n);
 
 	rcu_assign_pointer(tn->tnode[i], n);
 }
@@ -919,34 +933,35 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
 	if (max_work != MAX_WORK)
 		return tp;
 
-	/* push the suffix length to the parent node */
-	if (tn->slen > tn->pos) {
-		unsigned char slen = update_suffix(tn);
-
-		if (slen > tp->slen)
-			tp->slen = slen;
-	}
-
 	return tp;
 }
 
-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
+static void node_set_suffix(struct key_vector *tp, unsigned char slen)
 {
-	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-		if (update_suffix(tp) > l->slen)
+	if (slen > tp->slen)
+		tp->slen = slen;
+}
+
+static void node_pull_suffix(struct key_vector *tn)
+{
+	struct key_vector *tp;
+	unsigned char slen;
+
+	slen = update_suffix(tn);
+	tp = node_parent(tn);
+	while ((tp->slen > tp->pos) && (tp->slen > slen)) {
+		if (update_suffix(tp) > slen)
 			break;
 		tp = node_parent(tp);
 	}
 }
 
-static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
+static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
 {
-	/* if this is a new leaf then tn will be NULL and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
-	while (tn->slen < l->slen) {
-		tn->slen = l->slen;
-		tn = node_parent(tn);
+	while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
+		if (update_suffix(tp) > l->slen)
+			break;
+		tp = node_parent(tp);
 	}
 }
 
@@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
 	/* if we added to the tail node then we need to update slen */
 	if (l->slen < new->fa_slen) {
 		l->slen = new->fa_slen;
-		leaf_push_suffix(tp, l);
+		node_push_suffix(tp, l);
 	}
 
 	return 0;
@@ -1495,12 +1510,15 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
 	/* remove the fib_alias from the list */
 	hlist_del_rcu(&old->fa_list);
 
-	/* if we emptied the list this leaf will be freed and we can sort
-	 * out parent suffix lengths as a part of trie_rebalance
-	 */
+	/* if we emptied the list this leaf will be freed */
 	if (hlist_empty(&l->leaf)) {
 		put_child_root(tp, l->key, NULL);
 		node_free(l);
+		/* only need to update suffixes if this alias was
+		 * possibly the one with the largest suffix in the parent
+		 */
+		if (tp->slen == old->fa_slen)
+			node_pull_suffix(tp);
 		trie_rebalance(t, tp);
 		return;
 	}
@@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* push the suffix length to the parent node,
+			 * since a previous leaf removal may have
+			 * caused it to change
+			 */
+			if (pn->slen > pn->pos) {
+				unsigned char slen = update_suffix(pn);
+
+				node_set_suffix(node_parent(pn), slen);
+			}
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
@@ -1849,6 +1877,16 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 			if (IS_TRIE(pn))
 				break;
 
+			/* push the suffix length to the parent node,
+			 * since a previous leaf removal may have
+			 * caused it to change
+			 */
+			if (pn->slen > pn->pos) {
+				unsigned char slen = update_suffix(pn);
+
+				node_set_suffix(node_parent(pn), slen);
+			}
+
 			/* resize completed node */
 			pn = resize(t, pn);
 			cindex = get_index(pkey, pn);
-- 
2.1.4

^ permalink raw reply related

* [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
From: Nikita Yushchenko @ 2016-11-29 18:35 UTC (permalink / raw)
  To: David S. Miller, Fugang Duan, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg, netdev
  Cc: Chris Healy, Fabio Estevam, linux-kernel, Nikita Yushchenko

Fec driver uses Rx buffers of 2k, but programs hardware to limit
incoming frames to 1522 bytes. This raises issues when FEC device
is used with DSA (since DSA tag can make frame larger), and also
disallows manual sending and receiving larger frames.

This patch removes the limitation, allowing Rx size up to entire buffer.
At the same time possible Tx size is increased as well, because hardware
uses the same register field to limit Rx and Tx.

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 33 +++++++++++--------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 73ac35780611..c09789a71024 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -171,30 +171,12 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 #endif
 #endif /* CONFIG_M5272 */
 
-/* The FEC stores dest/src/type/vlan, data, and checksum for receive packets.
- */
-#define PKT_MAXBUF_SIZE		1522
-#define PKT_MINBUF_SIZE		64
-#define PKT_MAXBLR_SIZE		1536
-
 /* FEC receive acceleration */
 #define FEC_RACC_IPDIS		(1 << 1)
 #define FEC_RACC_PRODIS		(1 << 2)
 #define FEC_RACC_SHIFT16	BIT(7)
 #define FEC_RACC_OPTIONS	(FEC_RACC_IPDIS | FEC_RACC_PRODIS)
 
-/*
- * The 5270/5271/5280/5282/532x RX control register also contains maximum frame
- * size bits. Other FEC hardware does not, so we need to take that into
- * account when setting it.
- */
-#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
-    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
-#define	OPT_FRAME_SIZE	(PKT_MAXBUF_SIZE << 16)
-#else
-#define	OPT_FRAME_SIZE	0
-#endif
-
 /* FEC MII MMFR bits definition */
 #define FEC_MMFR_ST		(1 << 30)
 #define FEC_MMFR_OP_READ	(2 << 28)
@@ -847,7 +829,8 @@ static void fec_enet_enable_ring(struct net_device *ndev)
 	for (i = 0; i < fep->num_rx_queues; i++) {
 		rxq = fep->rx_queue[i];
 		writel(rxq->bd.dma, fep->hwp + FEC_R_DES_START(i));
-		writel(PKT_MAXBLR_SIZE, fep->hwp + FEC_R_BUFF_SIZE(i));
+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align,
+		       fep->hwp + FEC_R_BUFF_SIZE(i));
 
 		/* enable DMA1/2 */
 		if (i)
@@ -895,9 +878,17 @@ fec_restart(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	u32 val;
 	u32 temp_mac[2];
-	u32 rcntl = OPT_FRAME_SIZE | 0x04;
+	u32 rcntl = 0x04;
 	u32 ecntl = 0x2; /* ETHEREN */
 
+	/* The 5270/5271/5280/5282/532x RX control register also contains
+	 * maximum frame * size bits. Other FEC hardware does not.
+	 */
+#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
+    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
+	rcntl |= (FEC_ENET_RX_FRSIZE - fep->rx_align) << 16;
+#endif
+
 	/* Whack a reset.  We should wait for this.
 	 * For i.MX6SX SOC, enet use AXI bus, we use disable MAC
 	 * instead of reset MAC itself.
@@ -953,7 +944,7 @@ fec_restart(struct net_device *ndev)
 		else
 			val &= ~FEC_RACC_OPTIONS;
 		writel(val, fep->hwp + FEC_RACC);
-		writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align, fep->hwp + FEC_FTRL);
 	}
 #endif
 
-- 
2.1.4

^ permalink raw reply related

* [mm PATCH 3/3] mm: Add documentation for page fragment APIs
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This is a first pass at trying to add documentation for the page_frag APIs.
They may still change over time but for now I thought I would try to get
these documented so that as more network drivers and stack calls make use
of them we have one central spot to document how they are meant to be used.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 Documentation/vm/page_frags |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/vm/page_frags

diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags
new file mode 100644
index 000000000000..a6714565dbf9
--- /dev/null
+++ b/Documentation/vm/page_frags
@@ -0,0 +1,42 @@
+Page fragments
+--------------
+
+A page fragment is an arbitrary-length arbitrary-offset area of memory
+which resides within a 0 or higher order compound page.  Multiple
+fragments within that page are individually refcounted, in the page's
+reference counter.
+
+The page_frag functions, page_frag_alloc and page_frag_free, provide a
+simple allocation framework for page fragments.  This is used by the
+network stack and network device drivers to provide a backing region of
+memory for use as either an sk_buff->head, or to be used in the "frags"
+portion of skb_shared_info.
+
+In order to make use of the page fragment APIs a backing page fragment
+cache is needed.  This provides a central point for the fragment allocation
+and tracks allows multiple calls to make use of a cached page.  The
+advantage to doing this is that multiple calls to get_page can be avoided
+which can be expensive at allocation time.  However due to the nature of
+this caching it is required that any calls to the cache be protected by
+either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
+to be disabled when executing the fragment allocation.
+
+The network stack uses two separate caches per CPU to handle fragment
+allocation.  The netdev_alloc_cache is used by callers making use of the
+__netdev_alloc_frag and __netdev_alloc_skb calls.  The napi_alloc_cache is
+used by callers of the __napi_alloc_frag and __napi_alloc_skb calls.  The
+main difference between these two calls is the context in which they may be
+called.  The "netdev" prefixed functions are usable in any context as these
+functions will disable interrupts, while the "napi" prefixed functions are
+only usable within the softirq context.
+
+Many network device drivers use a similar methodology for allocating page
+fragments, but the page fragments are cached at the ring or descriptor
+level.  In order to enable these cases it is necessary to provide a generic
+way of tearing down a page cache.  For this reason __page_frag_cache_drain
+was implemented.  It allows for freeing multiple references from a single
+page via a single call.  The advantage to doing this is that it allows for
+cleaning up the multiple references that were added to a page in order to
+avoid calling get_page per allocation.
+
+Alexander Duyck, Nov 29, 2016.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 2/3] mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or refilling
the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |    6 +++---
 include/linux/gfp.h                       |    3 +--
 mm/page_alloc.c                           |   13 +++++++------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 5e66cdeb7ee3..7363503eab80 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,8 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 				     PAGE_SIZE,
 				     DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(buffer_info->page, 0,
-				  buffer_info->pagecnt_bias);
+		__page_frag_cache_drain(buffer_info->page,
+					buffer_info->pagecnt_bias);
 
 		buffer_info->page = NULL;
 	}
@@ -6987,7 +6987,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 				     PAGE_SIZE, DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
+		__page_frag_cache_drain(page, rx_buffer->pagecnt_bias);
 	}
 
 	/* clear contents of rx_buffer */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6238c74e0a01..884080404e24 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,8 +506,7 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
-extern void __page_frag_drain(struct page *page, unsigned int order,
-			      unsigned int count);
+extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask);
 extern void page_frag_free(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4218795a4694..9559f52e740d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3896,8 +3896,8 @@ void free_pages(unsigned long addr, unsigned int order)
  * drivers to provide a backing region of memory for use as either an
  * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
  */
-static struct page *__page_frag_refill(struct page_frag_cache *nc,
-				       gfp_t gfp_mask)
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+					     gfp_t gfp_mask)
 {
 	struct page *page = NULL;
 	gfp_t gfp = gfp_mask;
@@ -3917,19 +3917,20 @@ static struct page *__page_frag_refill(struct page_frag_cache *nc,
 	return page;
 }
 
-void __page_frag_drain(struct page *page, unsigned int order,
-		       unsigned int count)
+void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count)) {
+		unsigned int order = compound_order(page);
+
 		if (order == 0)
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
 }
-EXPORT_SYMBOL(__page_frag_drain);
+EXPORT_SYMBOL(__page_frag_cache_drain);
 
 void *page_frag_alloc(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask)
@@ -3940,7 +3941,7 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 
 	if (unlikely(!nc->va)) {
 refill:
-		page = __page_frag_refill(nc, gfp_mask);
+		page = __page_frag_cache_refill(nc, gfp_mask);
 		if (!page)
 			return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 1/3] mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel
In-Reply-To: <20161129182010.13445.31256.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch renames the page frag functions to be more consistent with other
APIs.  Specifically we place the name page_frag first in the name and then
have either an alloc or free call name that we append as the suffix.  This
makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are technically
no longer a backing interface and instead the front end that is called from
the networking APIs.

The last bit I changed is I rebased page_frag_free to actually function
very similar to the function free_pages, the only real difference now is
the fact that we have to get the page order by calling compound page
instead of having it passed as a part of the function call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/gfp.h    |    6 +++---
 include/linux/skbuff.h |    2 +-
 mm/page_alloc.c        |   20 ++++++++++++--------
 net/core/skbuff.c      |    8 ++++----
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..6238c74e0a01 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,9 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 struct page_frag_cache;
 extern void __page_frag_drain(struct page *page, unsigned int order,
 			      unsigned int count);
-extern void *__alloc_page_frag(struct page_frag_cache *nc,
-			       unsigned int fragsz, gfp_t gfp_mask);
-extern void __free_page_frag(void *addr);
+extern void *page_frag_alloc(struct page_frag_cache *nc,
+			     unsigned int fragsz, gfp_t gfp_mask);
+extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fbccf2c..95799826a1e7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2471,7 +2471,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 
 static inline void skb_free_frag(void *addr)
 {
-	__free_page_frag(addr);
+	page_frag_free(addr);
 }
 
 void *napi_alloc_frag(unsigned int fragsz);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb668eab5ee4..4218795a4694 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3931,8 +3931,8 @@ void __page_frag_drain(struct page *page, unsigned int order,
 }
 EXPORT_SYMBOL(__page_frag_drain);
 
-void *__alloc_page_frag(struct page_frag_cache *nc,
-			unsigned int fragsz, gfp_t gfp_mask)
+void *page_frag_alloc(struct page_frag_cache *nc,
+		      unsigned int fragsz, gfp_t gfp_mask)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
@@ -3983,19 +3983,23 @@ void *__alloc_page_frag(struct page_frag_cache *nc,
 
 	return nc->va + offset;
 }
-EXPORT_SYMBOL(__alloc_page_frag);
+EXPORT_SYMBOL(page_frag_alloc);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
-void __free_page_frag(void *addr)
+void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
+	struct page *page;
+
+	if (addr != 0) {
+		VM_BUG_ON(!virt_addr_valid(addr));
+		page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+		__free_pages(page, compound_order(page));
+	}
 }
-EXPORT_SYMBOL(__free_page_frag);
+EXPORT_SYMBOL(page_frag_free);
 
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4c96cb18c214..6cf779a9ad4c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -369,7 +369,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 
 	local_irq_save(flags);
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, fragsz, gfp_mask);
+	data = page_frag_alloc(nc, fragsz, gfp_mask);
 	local_irq_restore(flags);
 	return data;
 }
@@ -391,7 +391,7 @@ static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
+	return page_frag_alloc(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -441,7 +441,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	local_irq_save(flags);
 
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = page_frag_alloc(nc, len, gfp_mask);
 	pfmemalloc = nc->pfmemalloc;
 
 	local_irq_restore(flags);
@@ -505,7 +505,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(&nc->page, len, gfp_mask);
+	data = page_frag_alloc(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [mm PATCH 0/3] Page fragment updates
From: Alexander Duyck @ 2016-11-29 18:23 UTC (permalink / raw)
  To: linux-mm, akpm; +Cc: netdev, edumazet, davem, jeffrey.t.kirsher, linux-kernel

This patch series takes care of a few cleanups for the page fragments API.

First we do some renames so that things are much more consistent.  First we
move the page_frag_ portion of the name to the front of the functions
names.  Secondly we split out the cache specific functions from the other
page fragment functions by adding the word "cache" to the name.

Second I did some minor clean-up on the function calls so that they are
more inline with the existing __free_pages calls in terms of how they
operate.

Finally I added a bit of documentation that will hopefully help to explain
some of this.  I plan to revisit this later as we get things more ironed
out in the near future with the changes planned for the DMA setup to
support eXpress Data Path.

---

Alexander Duyck (3):
      mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
      mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
      mm: Add documentation for page fragment APIs


 Documentation/vm/page_frags               |   42 +++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/igb/igb_main.c |    6 ++--
 include/linux/gfp.h                       |    9 +++---
 include/linux/skbuff.h                    |    2 +
 mm/page_alloc.c                           |   33 +++++++++++++----------
 net/core/skbuff.c                         |    8 +++---
 6 files changed, 73 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/vm/page_frags

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Andrew Lunn @ 2016-11-29 18:20 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Uwe Kleine-König, Rob Herring, Frank Rowand,
	Andreas Färber, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Michal Hrusecki, Tomas Hlavacek, Bed??icha Ko??atu,
	Florian Fainelli, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <87oa0yb29b.fsf-BOtmAI95c/+pXNIQCVAXCG0Lkn3mC4nZ0tOlhedn3YvkypF1WZHjJXhe7Zk3YmMvjmZSf7Nhrd8@public.gmane.org>

On Tue, Nov 29, 2016 at 12:54:24PM -0500, Vivien Didelot wrote:
> Hi,
> 
> Uwe Kleine-König <uwe-rXY34ruvC2xidJT2blvkqNi2O/JbrIOy@public.gmane.org> writes:
> 
> > Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
> > hardware is really a "marvell,mv88e6176".
> 
> I agree. It might be complex for a user to dig into the driver in order
> to figure out how the switch ID is read and which compatible to choose.
> 
> I've sent a patch to change this https://lkml.org/lkml/2016/6/8/1198 but
> Andrew had a stronger opinion on compatible strings, which makes sense.
> 
> >> Linus has said he does not like ARM devices because of all the busses
> >> which are not enumerable. Here we have a device which with a little
> >> bit of help we can enumerate. So we should. 
> >
> > If you write
> >
> > 	compatible = "marvell,mv88e6176", "marvell,mv88e6085";
> >
> > you can still enumerate in the same way as before.
> 
> So we don't break the existing DTS files, I like this.
> 
> The driver already prints info about the detected switch. Instead of
> failing at probe, which seems against the notion of compatible and
> breaks the existing behavior, it could report the eventual mismatch?

I'm still against it.

Say we let the driver probe and warn when the compatible string is
wrong. Is this likely to get fixed? Probably not, it is just a
warning, people ignore them.

A few years later, we have accumulated a number of broken device
trees. And suddenly we really do have a Marvell device with a broken
ID in its port register, or more likely, the revision number was not
incremented but there was incompatible register changes. We suddenly
do have to look at the compatible string. But we know many are actually
wrong. How do we know which to trust? We basically have to say, if the
compatible is "marvell,mv88e6042" we trust it, all the others we don't
trust and ignore.

Isn't that madness?

What we have today is verified correct. If this hypothetical
"marvell,mv88e6042" does happen, then no problems, we add a compatible
string for it, and it works.

We should probably add a comment to the mv88e6xxx_of_match array,
explaining how it currently works.

      Andrew
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 2/2] tcp: allow to turn tcp timestamp randomization off
From: Eric Dumazet @ 2016-11-29 18:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480434342-12367-2-git-send-email-fw@strlen.de>

On Tue, 2016-11-29 at 16:45 +0100, Florian Westphal wrote:
> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
> 
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
> 
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Excellent, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net-next 1/2] tcp: randomize tcp timestamp offsets for each connection
From: Eric Dumazet @ 2016-11-29 18:06 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Mirja Kühlewind
In-Reply-To: <1480434342-12367-1-git-send-email-fw@strlen.de>

On Tue, 2016-11-29 at 16:45 +0100, Florian Westphal wrote:
> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
> 
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
> 
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>    in addition to the ISN.
> 
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
> 
> Includes fixes from Eric Dumazet.
> 
> Cc: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Very nice work, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net-next v3 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli
In-Reply-To: <20161129175718.20213-1-f.fainelli@gmail.com>

Broadcom BCM7xxx internal PHYs can leverage the Broadcom PHY library
module PHY error counters helper functions, just implement the
appropriate PHY driver function calls to do so. We need to allocate some
storage space for our PHY statistics, and provide it to the Broadcom PHY
library, so do this in a specific probe function, and slightly wrap the
get_stats function call.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/bcm7xxx.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 5b3be4c67be8..aae00bde5980 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -45,6 +45,10 @@
 #define AFE_VDAC_OTHERS_0		MISC_ADDR(0x39, 3)
 #define AFE_HPF_TRIM_OTHERS		MISC_ADDR(0x3a, 0)
 
+struct bcm7xxx_phy_priv {
+	u64	*stats;
+};
+
 static void r_rc_cal_reset(struct phy_device *phydev)
 {
 	/* Reset R_CAL/RC_CAL Engine */
@@ -350,6 +354,33 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	return genphy_restart_aneg(phydev);
 }
 
+static void bcm7xxx_28nm_get_phy_stats(struct phy_device *phydev,
+				       struct ethtool_stats *stats, u64 *data)
+{
+	struct bcm7xxx_phy_priv *priv = phydev->priv;
+
+	bcm_phy_get_stats(phydev, priv->stats, stats, data);
+}
+
+static int bcm7xxx_28nm_probe(struct phy_device *phydev)
+{
+	struct bcm7xxx_phy_priv *priv;
+
+	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	phydev->priv = priv;
+
+	priv->stats = devm_kcalloc(&phydev->mdio.dev,
+				   bcm_phy_get_sset_count(phydev), sizeof(u64),
+				   GFP_KERNEL);
+	if (!priv->stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
 #define BCM7XXX_28NM_GPHY(_oui, _name)					\
 {									\
 	.phy_id		= (_oui),					\
@@ -364,6 +395,10 @@ static int bcm7xxx_28nm_set_tunable(struct phy_device *phydev,
 	.resume		= bcm7xxx_28nm_resume,				\
 	.get_tunable	= bcm7xxx_28nm_get_tunable,			\
 	.set_tunable	= bcm7xxx_28nm_set_tunable,			\
+	.get_sset_count	= bcm_phy_get_sset_count,			\
+	.get_strings	= bcm_phy_get_strings,				\
+	.get_stats	= bcm7xxx_28nm_get_phy_stats,			\
+	.probe		= bcm7xxx_28nm_probe,				\
 }
 
 #define BCM7XXX_40NM_EPHY(_oui, _name)					\
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 1/2] net: phy: broadcom: Add support code for reading PHY counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli
In-Reply-To: <20161129175718.20213-1-f.fainelli@gmail.com>

Broadcom PHYs expose a number of PHY error counters: receive errors,
false carrier sense, SerDes BER count, local and remote receive errors.
Add support code to allow retrieving these error counters. Since the
Broadcom PHY library code is used by several drivers, make it possible
for them to specify the storage for the software copy of the statistics.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/bcm-phy-lib.c | 70 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/phy/bcm-phy-lib.h |  5 ++++
 include/linux/brcmphy.h       |  3 ++
 3 files changed, 78 insertions(+)

diff --git a/drivers/net/phy/bcm-phy-lib.c b/drivers/net/phy/bcm-phy-lib.c
index 3156ce6d5861..ab9ad689617c 100644
--- a/drivers/net/phy/bcm-phy-lib.c
+++ b/drivers/net/phy/bcm-phy-lib.c
@@ -17,6 +17,7 @@
 #include <linux/mdio.h>
 #include <linux/module.h>
 #include <linux/phy.h>
+#include <linux/ethtool.h>
 
 #define MII_BCM_CHANNEL_WIDTH     0x2000
 #define BCM_CL45VEN_EEE_ADV       0x3c
@@ -317,6 +318,75 @@ int bcm_phy_downshift_set(struct phy_device *phydev, u8 count)
 }
 EXPORT_SYMBOL_GPL(bcm_phy_downshift_set);
 
+struct bcm_phy_hw_stat {
+	const char *string;
+	u8 reg;
+	u8 shift;
+	u8 bits;
+};
+
+/* Counters freeze at either 0xffff or 0xff, better than nothing */
+static const struct bcm_phy_hw_stat bcm_phy_hw_stats[] = {
+	{ "phy_receive_errors", MII_BRCM_CORE_BASE12, 0, 16 },
+	{ "phy_serdes_ber_errors", MII_BRCM_CORE_BASE13, 8, 8 },
+	{ "phy_false_carrier_sense_errors", MII_BRCM_CORE_BASE13, 0, 8 },
+	{ "phy_local_rcvr_nok", MII_BRCM_CORE_BASE14, 8, 8 },
+	{ "phy_remote_rcv_nok", MII_BRCM_CORE_BASE14, 0, 8 },
+};
+
+int bcm_phy_get_sset_count(struct phy_device *phydev)
+{
+	return ARRAY_SIZE(bcm_phy_hw_stats);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_sset_count);
+
+void bcm_phy_get_strings(struct phy_device *phydev, u8 *data)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
+		memcpy(data + i * ETH_GSTRING_LEN,
+		       bcm_phy_hw_stats[i].string, ETH_GSTRING_LEN);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_strings);
+
+#ifndef UINT64_MAX
+#define UINT64_MAX              (u64)(~((u64)0))
+#endif
+
+/* Caller is supposed to provide appropriate storage for the library code to
+ * access the shadow copy
+ */
+static u64 bcm_phy_get_stat(struct phy_device *phydev, u64 *shadow,
+			    unsigned int i)
+{
+	struct bcm_phy_hw_stat stat = bcm_phy_hw_stats[i];
+	int val;
+	u64 ret;
+
+	val = phy_read(phydev, stat.reg);
+	if (val < 0) {
+		ret = UINT64_MAX;
+	} else {
+		val >>= stat.shift;
+		val = val & ((1 << stat.bits) - 1);
+		shadow[i] += val;
+		ret = shadow[i];
+	}
+
+	return ret;
+}
+
+void bcm_phy_get_stats(struct phy_device *phydev, u64 *shadow,
+		       struct ethtool_stats *stats, u64 *data)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(bcm_phy_hw_stats); i++)
+		data[i] = bcm_phy_get_stat(phydev, shadow, i);
+}
+EXPORT_SYMBOL_GPL(bcm_phy_get_stats);
+
 MODULE_DESCRIPTION("Broadcom PHY Library");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Broadcom Corporation");
diff --git a/drivers/net/phy/bcm-phy-lib.h b/drivers/net/phy/bcm-phy-lib.h
index a117f657c6d7..7c73808cbbde 100644
--- a/drivers/net/phy/bcm-phy-lib.h
+++ b/drivers/net/phy/bcm-phy-lib.h
@@ -42,4 +42,9 @@ int bcm_phy_downshift_get(struct phy_device *phydev, u8 *count);
 
 int bcm_phy_downshift_set(struct phy_device *phydev, u8 count);
 
+int bcm_phy_get_sset_count(struct phy_device *phydev);
+void bcm_phy_get_strings(struct phy_device *phydev, u8 *data);
+void bcm_phy_get_stats(struct phy_device *phydev, u64 *shadow,
+		       struct ethtool_stats *stats, u64 *data);
+
 #endif /* _LINUX_BCM_PHY_LIB_H */
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index f9f8aaf9c943..4f7d8be9ddbf 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -244,6 +244,9 @@
 #define LPI_FEATURE_EN_DIG1000X		0x4000
 
 /* Core register definitions*/
+#define MII_BRCM_CORE_BASE12	0x12
+#define MII_BRCM_CORE_BASE13	0x13
+#define MII_BRCM_CORE_BASE14	0x14
 #define MII_BRCM_CORE_BASE1E	0x1E
 #define MII_BRCM_CORE_EXPB0	0xB0
 #define MII_BRCM_CORE_EXPB1	0xB1
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v3 0/2] net: phy: broadcom: Support for PHY counters
From: Florian Fainelli @ 2016-11-29 17:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, andrew, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju, Florian Fainelli

Hi all,

This patch series adds support for reading the Broadcom PHYs internal counters.

Changes in v3:

- fixed the allocation of priv->stats in bcm7xxx

Changes in v2:

- fixed modular build reported by kbuild

- constify array of stats

Florian Fainelli (2):
  net: phy: broadcom: Add support code for reading PHY counters
  net: phy: bcm7xxx: Plug in support for reading PHY error counters

 drivers/net/phy/bcm-phy-lib.c | 70 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/phy/bcm-phy-lib.h |  5 ++++
 drivers/net/phy/bcm7xxx.c     | 35 ++++++++++++++++++++++
 include/linux/brcmphy.h       |  3 ++
 4 files changed, 113 insertions(+)

-- 
2.9.3

^ permalink raw reply

* [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: Lorenzo Colitti @ 2016-11-29 17:56 UTC (permalink / raw)
  To: netdev; +Cc: davem, erezsh, Lorenzo Colitti

Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
protocols.") made __build_flow_key call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..6402d74 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -531,13 +531,14 @@ static void __build_flow_key(const struct net *net, struct flowi4 *fl4,
 static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
 			       const struct sock *sk)
 {
+	const struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 	int oif = skb->dev->ifindex;
 	u8 tos = RT_TOS(iph->tos);
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;
 
-	__build_flow_key(sock_net(sk), fl4, sk, iph, oif, tos, prot, mark, 0);
+	__build_flow_key(net, fl4, sk, iph, oif, tos, prot, mark, 0);
 }
 
 static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Vivien Didelot @ 2016-11-29 17:54 UTC (permalink / raw)
  To: Uwe Kleine-König, Andrew Lunn
  Cc: Rob Herring, Frank Rowand, Andreas Färber, netdev,
	linux-arm-kernel, Michal Hrusecki, Tomas Hlavacek,
	Bed??icha Ko??atu, Florian Fainelli, linux-kernel, devicetree
In-Reply-To: <2c59cc79-b6dc-9920-1725-a7785ff3b6bf@kleine-koenig.org>

Hi,

Uwe Kleine-König <uwe@kleine-koenig.org> writes:

> Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
> hardware is really a "marvell,mv88e6176".

I agree. It might be complex for a user to dig into the driver in order
to figure out how the switch ID is read and which compatible to choose.

I've sent a patch to change this https://lkml.org/lkml/2016/6/8/1198 but
Andrew had a stronger opinion on compatible strings, which makes sense.

>> Linus has said he does not like ARM devices because of all the busses
>> which are not enumerable. Here we have a device which with a little
>> bit of help we can enumerate. So we should. 
>
> If you write
>
> 	compatible = "marvell,mv88e6176", "marvell,mv88e6085";
>
> you can still enumerate in the same way as before.

So we don't break the existing DTS files, I like this.

The driver already prints info about the detected switch. Instead of
failing at probe, which seems against the notion of compatible and
breaks the existing behavior, it could report the eventual mismatch?

We have examples for both usage, still I don't know what the best
practices are. My _preference_ would go with enumerating them all.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Florian Fainelli @ 2016-11-29 17:54 UTC (permalink / raw)
  To: Woojung.Huh, davem, andrew; +Cc: netdev, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D4096E268@CHN-SV-EXMX02.mchp-main.com>

On 11/28/2016 12:03 PM, Woojung.Huh@microchip.com wrote:
> From: Woojung Huh <woojung.huh@microchip.com>
> 
> Add LAN7801 MAC-only configuration support.
> 
> Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
> ---
>  drivers/net/usb/Kconfig   |   5 +++
>  drivers/net/usb/lan78xx.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/net/usb/lan78xx.h |  14 ++++++
>  3 files changed, 128 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
> index cdde590..3dd490f5 100644
> --- a/drivers/net/usb/Kconfig
> +++ b/drivers/net/usb/Kconfig
> @@ -114,6 +114,11 @@ config USB_LAN78XX
>  	help
>  	  This option adds support for Microchip LAN78XX based USB 2
>  	  & USB 3 10/100/1000 Ethernet adapters.
> +	  LAN7800 : USB 3 to 10/100/1000 Ethernet adapter
> +	  LAN7850 : USB 2 to 10/100/1000 Ethernet adapter
> +	  LAN7801 : USB 3 to 10/100/1000 Ethernet adapter (MAC only)
> +
> +	  Proper PHY driver is required for LAN7801.
>  
>  	  To compile this driver as a module, choose M here: the
>  	  module will be called lan78xx.
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index 0c459e9..08f2895 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -40,7 +40,7 @@
>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>  #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
>  #define DRIVER_NAME	"lan78xx"
> -#define DRIVER_VERSION	"1.0.5"
> +#define DRIVER_VERSION	"1.0.6"
>  
>  #define TX_TIMEOUT_JIFFIES		(5 * HZ)
>  #define THROTTLE_JIFFIES		(HZ / 8)
> @@ -67,6 +67,7 @@
>  #define LAN78XX_USB_VENDOR_ID		(0x0424)
>  #define LAN7800_USB_PRODUCT_ID		(0x7800)
>  #define LAN7850_USB_PRODUCT_ID		(0x7850)
> +#define LAN7801_USB_PRODUCT_ID		(0x7801)
>  #define LAN78XX_EEPROM_MAGIC		(0x78A5)
>  #define LAN78XX_OTP_MAGIC		(0x78F3)
>  
> @@ -400,6 +401,21 @@ struct lan78xx_net {
>  	struct irq_domain_data	domain_data;
>  };
>  
> +/* define external phy id */
> +#define	PHY_LAN8835			(0x0007C130)
> +#define	PHY_KSZ9031RNX			(0x00221620)
> +
> +/* phyid : masked external phy id
> + * pre_config : if needed, configure MAC and/or external PHY
> + *		such as irq pin mux and RGMII timing.
> + */
> +struct ext_phy_config_table {
> +	int phyid;
> +	void (*pre_config)(struct lan78xx_net *dev,
> +			   struct phy_device *phydev,
> +			   phy_interface_t *interface);
> +};
> +
>  /* use ethtool to change the level for any given device */
>  static int msg_level = -1;
>  module_param(msg_level, int, 0);
> @@ -1697,6 +1713,7 @@ static int lan78xx_mdiobus_read(struct mii_bus *bus, int phy_id, int idx)
>  done:
>  	mutex_unlock(&dev->phy_mutex);
>  	usb_autopm_put_interface(dev->intf);
> +
>  	return ret;
>  }
>  
> @@ -1759,6 +1776,10 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
>  		/* set to internal PHY id */
>  		dev->mdiobus->phy_mask = ~(1 << 1);
>  		break;
> +	case ID_REV_CHIP_ID_7801_:
> +		/* scan thru PHYAD[2..0] */
> +		dev->mdiobus->phy_mask = ~(0xFF);
> +		break;
>  	}
>  
>  	ret = mdiobus_register(dev->mdiobus);
> @@ -1933,11 +1954,58 @@ static void lan78xx_remove_irq_domain(struct lan78xx_net *dev)
>  	dev->domain_data.irqdomain = NULL;
>  }
>  
> +static void lan8835_pre_config(struct lan78xx_net *dev,
> +			       struct phy_device *phydev,
> +			       phy_interface_t *interface)
> +{
> +	int buf;
> +	int ret;
> +
> +	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
> +	buf = phy_read_mmd_indirect(phydev, 32784, 3);
> +	buf &= ~0x1800;
> +	buf |= 0x0800;
> +	phy_write_mmd_indirect(phydev, 32784, 3, buf);

Using decimal numbers for register addresses is a bit unusual.

> +
> +	/* RGMII MAC TXC Delay Enable */
> +	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
> +				MAC_RGMII_ID_TXC_DELAY_EN_);
> +
> +	/* RGMII TX DLL Tune Adjust */
> +	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
> +
> +	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
> +}
> +
> +static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
> +				  struct phy_device *phydev,
> +				  phy_interface_t *interface)
> +{
> +	/* Micrel9301RNX PHY configuration */
> +	/* RGMII Control Signal Pad Skew */
> +	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
> +	/* RGMII RX Data Pad Skew */
> +	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
> +	/* RGMII RX Clock Pad Skew */
> +	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
> +
> +	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
> +}

This should really belong in the respective PHY drivers for these PHYs,
is there a particular reason you decided to do this here?

-- 
Florian

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-29 17:51 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	alexei.starovoitov@gmail.com
In-Reply-To: <20161129171020.6b8b552d@jkicinski-Precision-T1700>

> > You also need to wrap this under rcu_read_lock() (at least I haven't seen
> > it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
> > protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
> > disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
> > doesn't.

> My understanding was that Yuval is always doing full stop()/start() so
> there should be no RX packets in flight while the XDP prog is being
> changed.  But thinking about it again, perhaps is worth adding the
> optimization to forego the full qede_reload() in qede_xdp_set() if there
> is a program already loaded and just do the xchg()+put() (and add RCU
> protection on the fast path)?

Yeps. That the current state of the code.
I'll surely pursue this later, but I don't think all this added complexity
is required for the initial submission.

BTW, one of the problems [or perhaps 'lack of motivation' is a better term]
I had with the program switching scenario was that there was no sample
application that did it.
If it's really an interesting [basic] scenario, perhaps it's worthy to add
a sample user app. that will repeatedly switch the attached eBPF?

^ permalink raw reply

* Re: [PATCH net-next v2 1/1] driver: ipvlan: Use NF_IP_PRI_LAST as hook priority instead of INT_MAX
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-11-29 17:36 UTC (permalink / raw)
  To: Gao Feng; +Cc: David Miller, Eric Dumazet, linux-netdev
In-Reply-To: <CA+6hz4qXYzC-mEfTqa5HOf9GtL1S1Q=rcz_CtFW5mbvUtj7niA@mail.gmail.com>

On Mon, Nov 28, 2016 at 5:14 PM, Gao Feng <fgao@ikuai8.com> wrote:
>
> Hi Mahesh,
>
> On Tue, Nov 29, 2016 at 3:26 AM, Mahesh Bandewar (महेश बंडेवार)
> <maheshb@google.com> wrote:
> > On Sun, Nov 27, 2016 at 3:18 AM,  <fgao@ikuai8.com> wrote:
> >> From: Gao Feng <fgao@ikuai8.com>
> >>
> >> It is better to use NF_IP_PRI_LAST instead of INT_MAX as hook priority.
> >> The former is good at readability and easier to maintain.
> >>
> > This IPvlan hook has to be "absolute" last hook and at this moment
> > NF_IP_PRI_LAST is set as INT_MAX so it's not altering anything.
>
> Yes. It is same now.
> So I prefer to use NF_IP_PRI_LAST than INT_MAX.
> Because the nf_hook_ops belongs to the netfilter module. i think the
> ipvlan codes should follow its rule.
> Since netfilter has defined some specific priority enum value, why
> don't we follow it?
>
Making changes only to IPvlan is problem-prone as I have explained earlier.

> >
> > If for whatever reasons the value of NF_IP_PRI_LAST changes, there
> > could be random IPvlan failure. Since that possibility cannot be
> > denied and there are several places INT_MAX is still used as hook
> > priority, I don't see any gain in having this patch in fact there
> > could be future (possible) downside.
>
> If the netfilter module changed the value of NF_IP_PRI_LAST, it may
> decrease it and add one check for the hook priority.
> As a result, the ipvlan may fail to register because of invalid priority.
>
> When use INT_MAX not NF_IP_PRI_LAST, there is one assumption that the
> hook priority is never changed.
> I think it is not good as two different modules.
>
> Regards
> Feng
>
> >
> >> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> >> ---
> >>  v2: Add the lost header file. It is added in local but not in v1 patch
> >>  v1: Inital patch
> >>
> >>  drivers/net/ipvlan/ipvlan_main.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> >> index ab90b22..01c7446 100644
> >> --- a/drivers/net/ipvlan/ipvlan_main.c
> >> +++ b/drivers/net/ipvlan/ipvlan_main.c
> >> @@ -7,6 +7,7 @@
> >>   *
> >>   */
> >>
> >> +#include "linux/netfilter_ipv4.h"
> >>  #include "ipvlan.h"
> >>
> >>  static u32 ipvl_nf_hook_refcnt = 0;
> >> @@ -16,13 +17,13 @@
> >>                 .hook     = ipvlan_nf_input,
> >>                 .pf       = NFPROTO_IPV4,
> >>                 .hooknum  = NF_INET_LOCAL_IN,
> >> -               .priority = INT_MAX,
> >> +               .priority = NF_IP_PRI_LAST,
> >>         },
> >>         {
> >>                 .hook     = ipvlan_nf_input,
> >>                 .pf       = NFPROTO_IPV6,
> >>                 .hooknum  = NF_INET_LOCAL_IN,
> >> -               .priority = INT_MAX,
> >> +               .priority = NF_IP_PRI_LAST,
> >>         },
> >>  };
> >>
> >> --
> >> 1.9.1
> >>
> >>
>
>

^ permalink raw reply

* [PATCH net-next] bpf: add test for the verifier equal logic bug
From: Josef Bacik @ 2016-11-29 17:35 UTC (permalink / raw)
  To: davem, netdev, ast, jannh, daniel, kernel-team

This is a test to verify that

bpf: fix states equal logic for varlen access

actually fixed the problem.  The problem was if the register we added to our map
register was UNKNOWN in both the false and true branches and the only thing that
changed was the range then we'd incorrectly assume that the true branch was
valid, which it really wasnt.  This tests this case and properly fails without
my fix in place and passes with it in place.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3c4a1fb..5da2e9d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2660,6 +2660,29 @@ static struct bpf_test tests[] = {
 		.result = ACCEPT,
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS
 	},
+	{
+		"invalid map access from else condition",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_1, MAX_ENTRIES-1, 1),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, offsetof(struct test_val, foo)),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map2 = { 3 },
+		.errstr = "R0 unbounded memory access, make sure to bounds check any array access into a map",
+		.result = REJECT,
+		.errstr_unpriv = "R0 pointer arithmetic prohibited",
+		.result_unpriv = REJECT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.7.4

^ permalink raw reply related

* [PATCH net][v2] bpf: fix states equal logic for varlen access
From: Josef Bacik @ 2016-11-29 17:27 UTC (permalink / raw)
  To: davem, netdev, ast, jannh, daniel, kernel-team

If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
v1->v2:
- renamed and moved varlen_map_access variable.
- dropped the extra () in the second if statement.
- added the Fixes and Reported-by tag.

 kernel/bpf/verifier.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6a93615..8199821 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2454,6 +2454,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 			 struct bpf_verifier_state *old,
 			 struct bpf_verifier_state *cur)
 {
+	bool varlen_map_access = env->varlen_map_value_access;
 	struct bpf_reg_state *rold, *rcur;
 	int i;
 
@@ -2467,12 +2468,17 @@ static bool states_equal(struct bpf_verifier_env *env,
 		/* If the ranges were not the same, but everything else was and
 		 * we didn't do a variable access into a map then we are a-ok.
 		 */
-		if (!env->varlen_map_value_access &&
+		if (!varlen_map_access &&
 		    rold->type == rcur->type && rold->imm == rcur->imm)
 			continue;
 
+		/* If we didn't map access then again we don't care about the
+		 * mismatched range values and it's ok if our old type was
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 */
 		if (rold->type == NOT_INIT ||
-		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
+		    (!varlen_map_access && rold->type == UNKNOWN_VALUE &&
+		     rcur->type != NOT_INIT))
 			continue;
 
 		if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 0/6] Add QLogic FastLinQ iSCSI (qedi) driver.
From: Martin K. Petersen @ 2016-11-29 17:21 UTC (permalink / raw)
  To: Manish Rangankar
  Cc: martin.petersen, lduncan, cleech, linux-scsi, netdev,
	QLogic-Storage-Upstream, Yuval.Mintz
In-Reply-To: <1478588223-16183-1-git-send-email-manish.rangankar@cavium.com>

>>>>> "Manish" == Manish Rangankar <manish.rangankar@cavium.com> writes:

Manish> This series introduces hardware offload iSCSI initiator driver
Manish> for the 41000 Series Converged Network Adapters (579xx chip) by
Manish> Qlogic. The overall driver design includes a common module
Manish> ('qed') and protocol specific dependent modules ('qedi' for
Manish> iSCSI).

The SCSI portions look reasonably sane to me.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH net] openvswitch: Fix skb leak in IPv6 reassembly.
From: Pravin Shelar @ 2016-11-29 17:21 UTC (permalink / raw)
  To: Daniele Di Proietto
  Cc: Linux Kernel Network Developers, Florian Westphal, Joe Stringer
In-Reply-To: <CAExiqTKTbgOaXLjHMuvne__9hyNLT5kN72fOrEg2CF3VgFg3Cw@mail.gmail.com>

On Tue, Nov 29, 2016 at 12:54 AM, Daniele Di Proietto
<diproiettod@ovn.org> wrote:
>
>
> 2016-11-28 23:39 GMT-08:00 Pravin Shelar <pshelar@ovn.org>:
>>
>> On Mon, Nov 28, 2016 at 3:43 PM, Daniele Di Proietto
>> <diproiettod@ovn.org> wrote:
>> > If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it
>> > means that we still have a reference to the skb.  We should free it
>> > before returning from handle_fragments, as stated in the comment above.
>> >
>> > Fixes: daaa7d647f81 ("netfilter: ipv6: avoid nf_iterate recursion")
>> > CC: Florian Westphal <fw@strlen.de>
>> > CC: Pravin B Shelar <pshelar@ovn.org>
>> > CC: Joe Stringer <joe@ovn.org>
>> > Signed-off-by: Daniele Di Proietto <diproiettod@ovn.org>
>> > ---
>> >  net/openvswitch/conntrack.c | 5 ++++-
>> >  1 file changed, 4 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> > index 31045ef..fecefa2 100644
>> > --- a/net/openvswitch/conntrack.c
>> > +++ b/net/openvswitch/conntrack.c
>> > @@ -370,8 +370,11 @@ static int handle_fragments(struct net *net, struct
>> > sw_flow_key *key,
>> >                 skb_orphan(skb);
>> >                 memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
>> >                 err = nf_ct_frag6_gather(net, skb, user);
>> > -               if (err)
>> > +               if (err) {
>> > +                       if (err != -EINPROGRESS)
>> > +                               kfree_skb(skb);
>> >                         return err;
>> > +               }
>> >
>>
>> This fixes the code. But the patch is adding yet another skb-kfree in
>> conntrack code. we could simplify it by reusing error handling in
>> do_execute_actions().
>> If you think that is too complicated for stable branch, I am fine with
>> this patch going in as it is.
>
>
> I thought it was simpler to handle this here rather than changing the
> interface of
> handle_fragments() and ovs_ct_execute(). Also this is more similar to the v4
> case,
> where ip_defrag() frees the skb in case of error.
>
> What do you think?

I agree this is simple enough for stable branch. I will check if ct
action error handling could be simplified bit.

Acked-by: Pravin B Shelar <pshelar@ovn.org>

^ permalink raw reply

* [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-29 17:15 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun

When multiple nexthops are available for a given route, the routing engine
chooses a nexthop by computing the flow hash through get_hash_from_flowi6
and by taking that value modulo the number of nexthops. The resulting value
indexes the nexthop to select. This method causes issues when a new nexthop
is added or one is removed (e.g. link failure). In that case, the number
of nexthops changes and potentially all the flows get re-routed to another
nexthop.

This patch implements a consistent hash method to select the nexthop in
case of ECMP. The idea is to generate K slices (or intervals) for each
route with multiple nexthops. The nexthops are randomly assigned to those
slices, in a uniform manner. The number K is configurable through a sysctl
net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
nexthop, the algorithm takes the flow hash and computes an index which is
the flow hash modulo K. As K = 2^x, the modulo can be computed using a
simple binary AND operation (idx = hash & (K - 1)). The resulting index
references the selected nexthop. The lookup time complexity is thus O(1).

When a nexthop is added, it steals K/N slices from the other nexthops,
where N is the new number of nexthops. The slices are stolen randomly and
uniformly from the other nexthops. When a nexthop is removed, the orphan
slices are randomly reassigned to the other nexthops.

The number of slices for a route also fixes the maximum number of nexthops
possible for that route.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 include/net/ip6_fib.h    |  25 +++++++
 include/net/netns/ipv6.h |   1 +
 net/ipv6/ip6_fib.c       | 174 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv6/route.c         |  58 ++++++++--------
 4 files changed, 229 insertions(+), 29 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index a74e2aa..29594cc 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -93,6 +93,30 @@ struct rt6key {
 
 struct fib6_table;
 
+/* Multipath nexthop.
+ * @nh is the actual nexthop
+ * @nslices is the number of slices assigned to this nexthop
+ */
+struct rt6_multi_nh {
+	struct rt6_info *nh;
+	unsigned int	nslices;
+};
+
+/* Multipath routes map.
+ * @nhs is an array of available nexthops
+ * @size is the number of elements in @nhs
+ * @nslices is the number of slices and the number of elements in @index
+ *	    and is always in the form 2^x to prevent using a division for
+ *	    the computation of the modulo.
+ * @index is an array mapping a slice index to a nexthop index in @nhs
+ */
+struct rt6_multi_map {
+	struct rt6_multi_nh	*nhs;
+	unsigned int		size;
+	unsigned int		nslices;
+	__u8			*index;
+};
+
 struct rt6_info {
 	struct dst_entry		dst;
 
@@ -113,6 +137,7 @@ struct rt6_info {
 	 */
 	struct list_head		rt6i_siblings;
 	unsigned int			rt6i_nsiblings;
+	struct rt6_multi_map		*rt6i_nh_map;
 
 	atomic_t			rt6i_ref;
 
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index de7745e..4967846 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -36,6 +36,7 @@ struct netns_sysctl_ipv6 {
 	int idgen_retries;
 	int idgen_delay;
 	int flowlabel_state_ranges;
+	int ip6_rt_ecmp_slices;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index ef54852..b6b1895 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -727,6 +727,166 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn,
 	}
 }
 
+static void fib6_mp_free(struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = rt->rt6i_nh_map;
+	struct rt6_info *sibling;
+
+	if (nh_map) {
+		list_for_each_entry(sibling, &rt->rt6i_siblings,
+				    rt6i_siblings) {
+			sibling->rt6i_nh_map = NULL;
+		}
+
+		rt->rt6i_nh_map = NULL;
+
+		kfree(nh_map->nhs);
+		kfree(nh_map->index);
+		kfree(nh_map);
+	}
+}
+
+static int fib6_mp_extend(struct net *net, struct rt6_info *sref,
+			  struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
+	struct rt6_multi_nh *tmp_nhs, *old_nhs;
+	unsigned int kslices;
+	unsigned int i, j;
+
+	if (!nh_map) {
+		__u8 *index;
+
+		nh_map = kmalloc(sizeof(*nh_map), GFP_ATOMIC);
+		if (!nh_map)
+			return -ENOMEM;
+
+		nh_map->size = 1;
+		nh_map->nslices = (1 << net->ipv6.sysctl.ip6_rt_ecmp_slices);
+
+		tmp_nhs = kmalloc(sizeof(*tmp_nhs), GFP_ATOMIC);
+		if (!tmp_nhs) {
+			kfree(nh_map);
+			return -ENOMEM;
+		}
+
+		tmp_nhs[0].nh = sref;
+		tmp_nhs[0].nslices = nh_map->nslices;
+		nh_map->nhs = tmp_nhs;
+
+		index = kmalloc_array(nh_map->nslices, sizeof(*index),
+				      GFP_ATOMIC);
+		if (!index) {
+			kfree(nh_map->nhs);
+			kfree(nh_map);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < nh_map->nslices; i++)
+			index[i] = 0;
+
+		nh_map->index = index;
+		sref->rt6i_nh_map = nh_map;
+	}
+
+	if (nh_map->size == nh_map->nslices)
+		return -ENOBUFS;
+
+	nh_map->size++;
+
+	tmp_nhs = kmalloc_array(nh_map->size, sizeof(*tmp_nhs), GFP_ATOMIC);
+	if (!tmp_nhs)
+		return -ENOMEM;
+
+	for (i = 0; i < nh_map->size - 1; i++)
+		tmp_nhs[i] = nh_map->nhs[i];
+
+	tmp_nhs[nh_map->size - 1].nh = rt;
+	tmp_nhs[nh_map->size - 1].nslices = 0;
+
+	kslices = nh_map->nslices / nh_map->size;
+
+	/* Loop over nexthops and steal a random slice until we have kslices for
+	 * the new nexthop. This algorithm ensures that flows are taken as
+	 * equally as possible from current nexthops.
+	 *
+	 * At each iteration, @j is the index in tmp_nhs of the nexthop
+	 * to steal from and @idx is the index of the slice to steal
+	 * among @j's slices.
+	 */
+	for (i = 0, j = 0; i < kslices; i++) {
+		unsigned int idx, k = 0;
+
+		if (tmp_nhs[j].nslices == 1)
+			continue;
+
+		idx = (prandom_u32() % tmp_nhs[j].nslices) + 1;
+		do {
+			if (nh_map->index[k++] == j)
+				idx--;
+		} while (idx);
+
+		nh_map->index[k - 1] = nh_map->size - 1;
+		tmp_nhs[nh_map->size - 1].nslices++;
+		tmp_nhs[j].nslices--;
+
+		j = (j + 1) % (nh_map->size - 1);
+	}
+
+	WARN_ON(tmp_nhs[nh_map->size - 1].nslices != kslices);
+
+	old_nhs = nh_map->nhs;
+	nh_map->nhs = tmp_nhs;
+	kfree(old_nhs);
+
+	rt->rt6i_nh_map = nh_map;
+
+	return 0;
+}
+
+static int fib6_mp_shrink(struct rt6_info *sref, struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
+	struct rt6_multi_nh *tmp_nhs, *old_nhs;
+	unsigned int i, j, idx = 0;
+
+	tmp_nhs = kmalloc_array(nh_map->size - 1, sizeof(*tmp_nhs), GFP_ATOMIC);
+	if (!tmp_nhs)
+		return -ENOMEM;
+
+	for (i = 0, j = 0; i < nh_map->size; i++) {
+		if (nh_map->nhs[i].nh != rt)
+			tmp_nhs[j++] = nh_map->nhs[i];
+		else
+			idx = i;
+	}
+
+	nh_map->size--;
+	old_nhs = nh_map->nhs;
+	nh_map->nhs = tmp_nhs;
+	kfree(old_nhs);
+
+	rt->rt6i_nh_map = NULL;
+
+	/* Update the slices index. For each slice mapping to the removed
+	 * nexthop, assign another random nexthop. For each slice mapping
+	 * to a nexthop that was after the removed nexthop, decrement the
+	 * nexthop index to reflect the shift. Nexthops that were before
+	 * the removed nexthop are unaffected.
+	 */
+	for (i = 0; i < nh_map->nslices; i++) {
+		if (nh_map->index[i] == idx) {
+			j = prandom_u32() % nh_map->size;
+			nh_map->index[i] = j;
+			nh_map->nhs[j].nslices++;
+		} else if (nh_map->index[i] > idx) {
+			nh_map->index[i]--;
+		}
+	}
+
+	return 0;
+}
+
 /*
  *	Insert routing information in a node.
  */
@@ -837,6 +997,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			}
 			sibling = sibling->dst.rt6_next;
 		}
+		err = fib6_mp_extend(info->nl_net, sibling, rt);
+		if (err < 0)
+			return err;
 		/* For each sibling in the list, increment the counter of
 		 * siblings. BUG() if counters does not match, list of siblings
 		 * is broken!
@@ -900,6 +1063,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			fn->fn_flags |= RTN_RTINFO;
 		}
 		nsiblings = iter->rt6i_nsiblings;
+		if (nsiblings)
+			fib6_mp_free(iter);
 		fib6_purge_rt(iter, fn, info->nl_net);
 		rt6_release(iter);
 
@@ -1407,13 +1572,20 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
 
 	/* Remove this entry from other siblings */
 	if (rt->rt6i_nsiblings) {
-		struct rt6_info *sibling, *next_sibling;
+		struct rt6_info *sibling, *next_sibling, *sref;
+
+		sref = list_first_entry(&rt->rt6i_siblings, struct rt6_info,
+					rt6i_siblings);
 
 		list_for_each_entry_safe(sibling, next_sibling,
 					 &rt->rt6i_siblings, rt6i_siblings)
 			sibling->rt6i_nsiblings--;
 		rt->rt6i_nsiblings = 0;
 		list_del_init(&rt->rt6i_siblings);
+		if (!sref->rt6i_nsiblings)
+			fib6_mp_free(sref);
+		else
+			fib6_mp_shrink(sref, rt);
 	}
 
 	/* Adjust walkers */
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b317bb1..e9f13e0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -427,39 +427,27 @@ static bool rt6_check_expired(const struct rt6_info *rt)
 	return false;
 }
 
-/* Multipath route selection:
- *   Hash based function using packet header and flowlabel.
- * Adapted from fib_info_hashfn()
- */
-static int rt6_info_hash_nhsfn(unsigned int candidate_count,
-			       const struct flowi6 *fl6)
-{
-	return get_hash_from_flowi6(fl6) % candidate_count;
-}
-
 static struct rt6_info *rt6_multipath_select(struct rt6_info *match,
 					     struct flowi6 *fl6, int oif,
 					     int strict)
 {
-	struct rt6_info *sibling, *next_sibling;
-	int route_choosen;
+	struct rt6_multi_map *nh_map = match->rt6i_nh_map;
+	__u32 hash = get_hash_from_flowi6(fl6);
+	unsigned int slice, idx;
+	struct rt6_info *res;
 
-	route_choosen = rt6_info_hash_nhsfn(match->rt6i_nsiblings + 1, fl6);
-	/* Don't change the route, if route_choosen == 0
-	 * (siblings does not include ourself)
-	 */
-	if (route_choosen)
-		list_for_each_entry_safe(sibling, next_sibling,
-				&match->rt6i_siblings, rt6i_siblings) {
-			route_choosen--;
-			if (route_choosen == 0) {
-				if (rt6_score_route(sibling, oif, strict) < 0)
-					break;
-				match = sibling;
-				break;
-			}
-		}
-	return match;
+	WARN_ON(!nh_map);
+	if (!nh_map)
+		return match;
+
+	slice = hash & (nh_map->nslices - 1);
+	idx = nh_map->index[slice];
+	res = nh_map->nhs[idx].nh;
+
+	if (rt6_score_route(res, oif, strict) < 0)
+		res = match;
+
+	return res;
 }
 
 /*
@@ -3550,6 +3538,9 @@ int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
 	return 0;
 }
 
+static int one = 1;
+static int thirtyone = 31;
+
 struct ctl_table ipv6_route_table_template[] = {
 	{
 		.procname	=	"flush",
@@ -3621,6 +3612,15 @@ struct ctl_table ipv6_route_table_template[] = {
 		.mode		=	0644,
 		.proc_handler	=	proc_dointvec_ms_jiffies,
 	},
+	{
+		.procname	=	"ecmp_slices",
+		.data		=	&init_net.ipv6.sysctl.ip6_rt_ecmp_slices,
+		.maxlen		=	sizeof(int),
+		.mode		=	0644,
+		.proc_handler	=	proc_dointvec_minmax,
+		.extra1		=	&one,
+		.extra2		=	&thirtyone,
+	},
 	{ }
 };
 
@@ -3644,6 +3644,7 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
 		table[7].data = &net->ipv6.sysctl.ip6_rt_mtu_expires;
 		table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
 		table[9].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
+		table[10].data = &net->ipv6.sysctl.ip6_rt_ecmp_slices;
 
 		/* Don't export sysctls to unprivileged users */
 		if (net->user_ns != &init_user_ns)
@@ -3707,6 +3708,7 @@ static int __net_init ip6_route_net_init(struct net *net)
 	net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
 	net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
 	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
+	net->ipv6.sysctl.ip6_rt_ecmp_slices = 5;
 
 	net->ipv6.ip6_rt_gc_expire = 30*HZ;
 
-- 
2.7.3

^ permalink raw reply related

* [Patch net-next] audit: remove useless synchronize_net()
From: Cong Wang @ 2016-11-29 17:14 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Richard Guy Briggs

netlink kernel socket is protected by refcount, not RCU.
Its rcv path is neither protected by RCU. So the synchronize_net()
is just pointless.

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 kernel/audit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 92c463d..67b9fbd8 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1172,9 +1172,8 @@ static void __net_exit audit_net_exit(struct net *net)
 		audit_sock = NULL;
 	}
 
-	RCU_INIT_POINTER(aunet->nlsk, NULL);
-	synchronize_net();
 	netlink_kernel_release(sock);
+	aunet->nlsk = NULL;
 }
 
 static struct pernet_operations audit_net_ops __net_initdata = {
-- 
2.1.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox