Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] page_pool proof-of-concept early code
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
  To: linux-mm, Alexander Duyck
  Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
	Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
	Tariq Toukan

This is an RFC patchset of my *work-in-progress* page_pool implemenation.
This is NOT ready for inclusion.  People asked to see the code, so here we go.

This patchset is focused providing a generic replacement for the
driver page recycle caches. Where mlx5 is the first user in patch-3.

Notice that patch-2 is more "MM-invasive" (modifies put_page) than
patch-4 which is less MM-agressive (scaled back based on input from
Mel Gorman).

I do know that all page-flags are used (for 32bit), thus I'm open to
suggestions/ideas on howto work-around this (need some way to identify
a page belongs to a page pool).


This patchset is the bare-minimum PoC that allows me to benchmarks
these ideas and see if performance is going in the right direction.
It is not safe, e.g. unloading the driver can crash the kernel.

---

Jesper Dangaard Brouer (4):
      doc: page_pool introduction documentation
      page_pool: basic implementation of page_pool
      mlx5: use page_pool
      page_pool: change refcnt model


 Documentation/vm/page_pool/introduction.rst       |   71 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    1 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   28 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   47 ++
 include/linux/mm.h                                |    1 
 include/linux/mm_types.h                          |   11 +
 include/linux/page-flags.h                        |   13 +
 include/linux/page_pool.h                         |  168 +++++++++
 include/linux/skbuff.h                            |    2 
 include/trace/events/mmflags.h                    |    3 
 mm/Makefile                                       |    3 
 mm/page_alloc.c                                   |    6 
 mm/page_pool.c                                    |  402 +++++++++++++++++++++
 mm/slub.c                                         |    4 
 mm/swap.c                                         |    3 
 15 files changed, 741 insertions(+), 22 deletions(-)
 create mode 100644 Documentation/vm/page_pool/introduction.rst
 create mode 100644 include/linux/page_pool.h
 create mode 100644 mm/page_pool.c

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC PATCH 1/4] doc: page_pool introduction documentation
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
  To: linux-mm, Alexander Duyck
  Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
	Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
	Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>

Copied from:
 https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/introduction.html
 ~/git/prototype-kernel/kernel/Documentation/vm/page_pool/introduction.rst

This will be updated from above links before upstream submit.
Also this need to be "linked" into new kernel doc system.
---
 Documentation/vm/page_pool/introduction.rst |   71 +++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 Documentation/vm/page_pool/introduction.rst

diff --git a/Documentation/vm/page_pool/introduction.rst b/Documentation/vm/page_pool/introduction.rst
new file mode 100644
index 000000000000..db03b02f218c
--- /dev/null
+++ b/Documentation/vm/page_pool/introduction.rst
@@ -0,0 +1,71 @@
+============
+Introduction
+============
+
+The page_pool is a generic API for drivers that have a need for a pool
+of recycling pages used for streaming DMA.
+
+
+Motivation
+==========
+
+The page_pool is primarily motivated by two things (1) performance
+and (2) changing the memory model for drivers.
+
+Drivers have developed performance workarounds when the speed of the
+page allocator and the DMA APIs became too slow for their HW
+needs. The page pool solves them on a general level providing
+performance gains and benefits that local driver recycling hacks
+cannot realize.
+
+A fundamental property is that pages are returned to the page_pool.
+This property allow a certain class of optimizations, which is to move
+setup and tear-down operations out of the fast-path, sometimes known as
+constructor/destruction operations.  DMA map/unmap is one example of
+operations this applies to.  Certain page alloc/free validations can
+also be avoided in the fast-path.  Another example could be
+pre-mapping pages into userspace, and clearing them (memset-zero)
+outside the fast-path.
+
+Memory model
+============
+
+Once drivers are converted to using page_pool API, then it will become
+easier change the underlying memory model backing the driver with
+pages (without changing the driver).
+
+One prime use-case is NIC zero-copy RX into userspace.  As DaveM
+describes in his `Google-plus post`_, the mapping and unmapping
+operations in the address space of the process has a cost that cancels
+out most of the gains of such zero-copy schemes.
+
+This mapping cost can solved the same way as the keeping DMA mapped
+trick.  By keeping the pages VM-mapped to userspace.  This is a layer
+that can be added later to the page_pool.  It will likely be
+beneficial to also consider using huge-pages (as backing) to reduce
+the TLB-stress.
+
+.. _Google-plus post:
+   https://plus.google.com/+DavidMiller/posts/EUDiGoXD6Xv
+
+Advantages
+==========
+
+Advantages of a recycling page pool as bullet points:
+
+1) Faster than going through page-allocator.  Given a specialized
+   allocator require less checks, and can piggyback on drivers
+   resource protection (for alloc-side).
+
+2) DMA IOMMU mapping cost is removed by keeping pages mapped.
+
+3) Makes DMA pages writable by predictable DMA unmap point.
+
+4) OOM protection at device level, as having a feedback-loop knows
+   number of outstanding pages.
+
+5) Flexible memory model allowing zero-copy RX, solving memory early
+   demux (does depend on HW filters into RX queues)
+
+6) Less fragmentation of the page buddy algorithm, when driver
+   maintains a steady-state working-set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [RFC PATCH 2/4] page_pool: basic implementation of page_pool
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
  To: linux-mm, Alexander Duyck
  Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
	Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
	Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>

The focus in this patch is getting the API around page_pool figured out.

The internal data structures for returning page_pool pages is not optimal.
This implementation use ptr_ring for recycling, which is known not to scale
in case of multiple remote CPUs releasing/returning pages.

A bulking interface into the page allocator is also left for later. (This
requires cooperation will Mel Gorman, who just send me some PoC patches for this).
---
 include/linux/mm.h             |    6 +
 include/linux/mm_types.h       |   11 +
 include/linux/page-flags.h     |   13 +
 include/linux/page_pool.h      |  158 +++++++++++++++
 include/linux/skbuff.h         |    2 
 include/trace/events/mmflags.h |    3 
 mm/Makefile                    |    3 
 mm/page_alloc.c                |   10 +
 mm/page_pool.c                 |  423 ++++++++++++++++++++++++++++++++++++++++
 mm/slub.c                      |    4 
 10 files changed, 627 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/page_pool.h
 create mode 100644 mm/page_pool.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..11b4d8fb280b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include <linux/page_ext.h>
 #include <linux/err.h>
 #include <linux/page_ref.h>
+#include <linux/page_pool.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -765,6 +766,11 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	if (PagePool(page)) {
+		page_pool_put_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page))
 		__put_page(page);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08d947fc4c59..c74dea967f99 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -47,6 +47,12 @@ struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	union {
+		/* DISCUSS: Considered moving page_pool pointer here,
+		 * but I'm unsure if 'mapping' is needed for userspace
+		 * mapping the page, as this is a use-case the
+		 * page_pool need to support in the future. (Basically
+		 * mapping a NIC RX ring into userspace).
+		 */
 		struct address_space *mapping;	/* If low bit clear, points to
 						 * inode address_space, or NULL.
 						 * If page mapped as anonymous
@@ -63,6 +69,7 @@ struct page {
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
 		void *freelist;		/* sl[aou]b first free object */
+		dma_addr_t dma_addr;    /* used by page_pool */
 		/* page_deferred_list().prev	-- second tail page */
 	};
 
@@ -117,6 +124,8 @@ struct page {
 	 * avoid collision and false-positive PageTail().
 	 */
 	union {
+		/* XXX: Idea reuse lru list, in page_pool to align with PCP */
+
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone_lru_lock !
 					 * Can be used as a generic list
@@ -189,6 +198,8 @@ struct page {
 #endif
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
+		/* XXX: Sure page_pool will have no users of "private"? */
+		struct page_pool *pool;
 	};
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..253d7f7cf89f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,7 +91,8 @@ enum pageflags {
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
-	PG_unevictable,		/* Page is "unevictable"  */
+/*20*/	PG_unevictable,		/* Page is "unevictable"  */
+// XXX stable flag?
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
@@ -101,6 +102,8 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+	/* Question: can we squeeze in here and avoid CONFIG_64BIT hacks?*/
+	PG_pool, // XXX macros called: SetPagePool / PagePool
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
@@ -347,6 +350,12 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+// XXX: Define some macros for page_pool
+// XXX: avoiding atomic set_bit() operation (like slab)
+// XXX: PF_HEAD vs PF_ANY vs PF_NO_TAIL????
+__PAGEFLAG(Pool, pool, PF_ANY)
+
+
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
@@ -700,7 +709,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
- */
+ */ /* XXX add PG_pool here??? */
 #define PAGE_FLAGS_CHECK_AT_FREE \
 	(1UL << PG_lru	 | 1UL << PG_locked    | \
 	 1UL << PG_private | 1UL << PG_private_2 | \
diff --git a/include/linux/page_pool.h b/include/linux/page_pool.h
new file mode 100644
index 000000000000..6f8f2ff6d758
--- /dev/null
+++ b/include/linux/page_pool.h
@@ -0,0 +1,158 @@
+/*
+ * page_pool.h
+ *
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or modify it
+ *	under the terms of the GNU General Public License as published by the
+ *	Free Software Foundation; either version 2 of the License, or (at your
+ *	option) any later version.
+ *
+ * The page_pool is primarily motivated by two things (1) performance
+ * and (2) changing the memory model for drivers.
+ *
+ * Drivers have developed performance workarounds when the speed of
+ * the page allocator and the DMA APIs became too slow for their HW
+ * needs. The page pool solves them on a general level providing
+ * performance gains and benefits that local driver recycling hacks
+ * cannot realize.
+ *
+ * A fundamental property is that pages are returned to the page_pool.
+ * This property allow a certain class of optimizations, which is to
+ * move setup and tear-down operations out of the fast-path, sometimes
+ * known as constructor/destruction operations.  DMA map/unmap is one
+ * example of operations this applies to.  Certain page alloc/free
+ * validations can also be avoided in the fast-path.  Another example
+ * could be pre-mapping pages into userspace, and clearing them
+ * (memset-zero) outside the fast-path.
+ *
+ * This API is only meant for streaming DMA, which map/unmap frequently.
+ */
+#ifndef _LINUX_PAGE_POOL_H
+#define _LINUX_PAGE_POOL_H
+
+/*
+ * NOTES on page flags (PG_pool)... we might have a problem with
+ * enough page flags on 32 bit systems, example see PG_idle + PG_young
+ * include/linux/page_idle.h and CONFIG_IDLE_PAGE_TRACKING
+ */
+
+#include <linux/ptr_ring.h>
+
+//#include <linux/dma-mapping.h>
+#include <linux/dma-direction.h>
+
+// Not-used-atm #define PP_FLAG_NAPI 0x1
+#define PP_FLAG_ALL	0
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count ____cacheline_aligned_in_smp;
+	u32 refill; /* not used atm */
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+/*
+ * Extensible params struct. Focus on currently implemented features,
+ * extend later. Restriction, subsequently added members value of zero
+ * must gives the previous behaviour. Avoids need to update every
+ * driver simultaniously (given likely in difference subsystems).
+ */
+struct page_pool_params {
+	u32		size; /* caller sets size of struct */
+	unsigned int	order;
+	unsigned long	flags;
+	/* Associated with a specific device, for DMA pre-mapping purposes */
+	struct device	*dev;
+	/* Numa node id to allocate from pages from */
+	int 		nid;
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+	unsigned int	pool_size;
+	char		end_marker[0]; /* must be last struct member */
+};
+#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * For now use ptr_ring, as it separates consumer and
+	 * producer, which is a common use-case. The ptr_ring is not
+	 * though as the final data structure, expecting this to
+	 * change into a more advanced data structure with more
+	 * integration with page_alloc.c and data structs per CPU for
+	 * returning pages in bulk.
+	 *
+	 */
+	struct ptr_ring ring;
+
+	/* TODO: Domain "id" add later, for RX zero-copy validation */
+
+	/* TODO: Need list pointers for keeping page_pool object on a
+	 * cleanup list, given pages can be "outstanding" even after
+	 * e.g. driver is unloaded.
+	 */
+};
+
+struct page* page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN | __GFP_COLD);
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page *page)
+{
+	__page_pool_put_page(page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page *page)
+{
+	__page_pool_put_page(page, true);
+}
+
+#endif /* _LINUX_PAGE_POOL_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac7fa34db8a7..84294278039d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2584,7 +2584,7 @@ static inline void __skb_frag_ref(skb_frag_t *frag)
  * @f: the fragment offset.
  *
  * Takes an additional reference on the @f'th paged fragment of @skb.
- */
+ */ // XXX
 static inline void skb_frag_ref(struct sk_buff *skb, int f)
 {
 	__skb_frag_ref(&skb_shinfo(skb)->frags[f]);
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab48a2fb..ee15ca659ea1 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -99,7 +99,8 @@
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},		\
 	{1UL << PG_reclaim,		"reclaim"	},		\
 	{1UL << PG_swapbacked,		"swapbacked"	},		\
-	{1UL << PG_unevictable,		"unevictable"	}		\
+	{1UL << PG_unevictable,		"unevictable"	},		\
+	{1UL << PG_pool,		"pool"		}		\
 IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..dbe5a7181e28 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,6 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+
+# Hack enable for compile testing
+obj-y += page_pool.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..655db05f0c1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3873,6 +3873,11 @@ EXPORT_SYMBOL(get_zeroed_page);
 
 void __free_pages(struct page *page, unsigned int order)
 {
+	if (PagePool(page)) {
+		page_pool_put_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_cold_page(page, false);
@@ -4000,6 +4005,11 @@ void __free_page_frag(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
+	if (PagePool(page)) {
+		page_pool_put_page(page);
+		return;
+	}
+
 	if (unlikely(put_page_testzero(page)))
 		__free_pages_ok(page, compound_order(page));
 }
diff --git a/mm/page_pool.c b/mm/page_pool.c
new file mode 100644
index 000000000000..74138d5fe86d
--- /dev/null
+++ b/mm/page_pool.c
@@ -0,0 +1,423 @@
+/*
+ * page_pool.c
+ */
+
+/* Using the page pool from a driver, involves
+ *
+ * 1. Creating/allocating a page_pool per RX ring for the NIC
+ * 2. Using pages from page_pool to populate RX ring
+ * 3. Page pool will call dma_map/unmap
+ * 4. Driver is responsible for dma_sync part
+ * 5. On page put/free the page is returned to the page_pool
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <linux/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+/*
+ * The struct page_pool (likely) cannot be embedded into another
+ * structure, because freeing this struct depend on outstanding pages,
+ * which can point back to the page_pool. Thus, don't export "init".
+ */
+int page_pool_init(struct page_pool *pool,
+		   const struct page_pool_params *params)
+{
+	int ring_qsize = 1024; /* Default */
+	int param_copy_sz;
+
+	if (!pool)
+		return -EFAULT;
+
+	/* Allow kernel devel trees and driver to progress at different rates */
+	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+	memset(&pool->p, 0, param_copy_sz);
+	if (params->size < param_copy_sz) {
+		/*
+		 * Older module calling newer kernel, handled by only
+		 * copying supplied size, and keep remaining params zero
+		 */
+		param_copy_sz = params->size;
+	} else if (params->size > param_copy_sz) {
+		/*
+		 * Newer module calling older kernel. Need to validate
+		 * no new features were requested.
+		 */
+		unsigned char *addr = (unsigned char*)params + param_copy_sz;
+		unsigned char *end  = (unsigned char*)params + params->size;
+
+		for (; addr < end; addr++) {
+			if (*addr != 0)
+				return -E2BIG;
+		}
+	}
+	memcpy(&pool->p, params, param_copy_sz);
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	/* ptr_ring is not meant as final struct, see page_pool.h */
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) {
+		return -ENOMEM;
+	}
+
+	/*
+	 * DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	if (params->size < offsetof(struct page_pool_params, nid)) {
+		WARN(1, "Fix page_pool_params->size code\n");
+		return NULL;
+	}
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* FIXME: use another test for safe-context, caller should
+	 * simply provide this guarantee
+	 */
+	if (likely(in_serving_softirq())) { // FIXME add use of PP_FLAG_NAPI
+		struct ptr_ring *r;
+
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill */
+		r = &pool->ring;
+		/* Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: ATM the ring consumer lock is not
+		 * really needed due to the softirq/NAPI protection,
+		 * but later MM-layer need the ability to reclaim
+		 * pages on the ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages(struct page_pool *pool,
+					    gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+	/*
+	 *  Discuss GFP flags: e.g
+	 *   __GFP_NOWARN + __GFP_NORETRY + __GFP_NOMEMALLOC
+	 */
+
+	/*
+	 * FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 *
+	 * For now, page pool recycle cache is not refilled.  Hint:
+	 * when pages are returned, they will go into the recycle
+	 * cache.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	/* FIXME: Add accounting of pages.
+	 *
+	 * TODO: Look into memcg_charge_slab/memcg_uncharge_slab
+	 *
+	 * What if page comes from pfmemalloc reserves?
+	 * Should we abort to help memory pressure? (test err code path!)
+	 * Code see SetPageSlabPfmemalloc(), __ClearPageSlabPfmemalloc()
+	 * and page_is_pfmemalloc(page)
+	 */
+
+	/* Setup DMA mapping:
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	page->dma_addr = dma;
+
+	/* IDEA: When page just alloc'ed is should/must have refcnt 1.
+	 * Should we do refcnt inc tricks to keep page mapped/owned by
+	 * page_pool infrastructure? (like page_frag code)
+	 */
+
+	/* TODO: Init fields in struct page. See slub code allocate_slab()
+	 *
+	 */
+	page->pool = pool;   /* Save pool the page MUST be returned to */
+	__SetPagePool(page); /* Mark page with flag */
+
+	return page;
+}
+
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+// Ideas taken from __free_slab()
+static void __page_pool_clean_page(struct page *page)
+{
+	struct page_pool *pool;
+
+	VM_BUG_ON_PAGE(!PagePool(page), page);
+
+	// mod_zone_page_state() ???
+
+	pool = page->pool;
+	__ClearPagePool(page);
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page->dma_addr,
+		       PAGE_SIZE << pool->p.order,
+                       pool->p.dma_dir);
+	page->dma_addr = 0;
+        /* Q: Use DMA macros???
+	 *
+	 * dma_unmap_page(pool->p.dev, dma_unmap_addr(page,dma_addr),
+	 *	       PAGE_SIZE << pool->p.order,
+	 *	       pool->p.dma_dir);
+	 * dma_unmap_addr_set(page, dma_addr, 0);
+	 */
+
+	/* FUTURE: Use Alex Duyck's DMA_ATTR_SKIP_CPU_SYNC changes
+	 *
+	 * dma_unmap_page_attrs(pool->p.dev, page->dma_addr,
+	 *		     PAGE_SIZE << pool->p.order,
+	 *		     pool->p.dma_dir,
+	 *		     DMA_ATTR_SKIP_CPU_SYNC);
+	 */
+
+	// page_mapcount_reset(page); // ??
+	// page->mapping = NULL;      // ??
+
+	// Not really needed, but good for provoking bugs
+	page->pool = (void *)0xDEADBEE0;
+
+	/* FIXME: Add accounting of pages here!
+	 *
+	 * Look into: memcg_uncharge_page_pool(page, order, pool);
+	 */
+
+	// FIXME: do we need this??? likely not as slub does not...
+//	if (unlikely(is_zone_device_page(page)))
+//		put_zone_device_page(page);
+
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page *page)
+{
+	struct page_pool *pool = page->pool;
+
+	__page_pool_clean_page(page);
+	/*
+	 * Given page pool state and flags were just cleared, the page
+	 * must be freed here.  Thus, code invariant assumes
+	 * refcnt==1, as __free_pages() call put_page_testzero().
+	 */
+	__free_pages(page, pool->p.order);
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* TODO: Use smarter data structure for recycle cache.  Using
+	 * ptr_ring will not scale when multiple remote CPUs want to
+	 * recycle pages.
+	 */
+
+	/* Need BH protection when free occurs from userspace e.g
+	 * __kfree_skb() called via {tcp,inet,sock}_recvmsg
+	 *
+	 * Problematic for several reasons: (1) it is more costly,
+	 * (2) the BH unlock can cause (re)sched of softirq.
+	 *
+	 * BH protection not needed if current is serving softirq
+	 */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/*
+ * Only allow direct recycling in very special circumstances, into the
+ * alloc cache.  E.g. XDP_DROP use-case.
+ *
+ * Caller must provide appropiate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	// BUG_ON(!in_serving_softirq());
+
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page *page, bool allow_direct)
+{
+	struct page_pool *pool = page->pool;
+
+	/* This is a fast-path optimization, that avoids an atomic
+	 * operation, in the case where a single object is (refcnt)
+	 * using the page.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier implicit paired with full MB of atomic ops */
+		smp_rmb();
+
+		if (allow_direct)
+			if (__page_pool_recycle_direct(page, pool))
+			    return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, do real __free_pages() */
+			__page_pool_return_page(page);
+		}
+		return;
+	}
+	/*
+	 * Many drivers splitting up the page into fragments, and some
+	 * want to keep doing this to save memory. The put_page_testzero()
+	 * function as a refcnt decrement, and should not return true.
+	 */
+	if (unlikely(put_page_testzero(page))) {
+		/*
+		 * Reaching refcnt zero should not be possible,
+		 * indicate code error.  Don't crash but warn, handle
+		 * case by not-recycling, but return page to page
+		 * allocator.
+		 */
+		WARN(1, "%s() violating page_pool invariance refcnt:%d\n",
+		     __func__, page_ref_count(page));
+		/* Cleanup state before directly returning page */
+		__page_pool_clean_page(page);
+		__put_page(page);
+	}
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+static void __destructor_put_page(void *ptr)
+{
+	struct page *page = ptr;
+
+	/* Verify the refcnt invariant of cached pages */
+	if (!(page_ref_count(page) == 1)) {
+		pr_crit("%s() page_pool refcnt %d violation\n",
+			__func__, page_ref_count(page));
+		BUG();
+	}
+	__page_pool_return_page(page);
+}
+
+/* Cleanup and release resources */
+void page_pool_destroy(struct page_pool *pool)
+{
+	/* Empty recycle ring */
+	ptr_ring_cleanup(&pool->ring, __destructor_put_page);
+
+	/* FIXME-mem-leak: cleanup array/stack cache
+	 * pool->alloc. Driver usually will destroy RX ring after
+	 * making sure nobody can alloc from it, thus it should be
+	 * safe to just empty cache here
+	 */
+
+	/* FIXME: before releasing the page_pool memory, we MUST make
+	 * sure no pages points back this page_pool.
+	 */
+	kfree(pool);
+}
+EXPORT_SYMBOL(page_pool_destroy);
diff --git a/mm/slub.c b/mm/slub.c
index 067598a00849..7de478c20464 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1572,8 +1572,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	page->objects = oo_objects(oo);
 
 	order = compound_order(page);
-	page->slab_cache = s;
-	__SetPageSlab(page);
+	page->slab_cache = s; // Example: Saving kmem_cache in struct page
+	__SetPageSlab(page); // Example: Setting flag
 	if (page_is_pfmemalloc(page))
 		SetPageSlabPfmemalloc(page);
 

^ permalink raw reply related

* [RFC PATCH 3/4] mlx5: use page_pool
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
  To: linux-mm, Alexander Duyck
  Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
	Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
	Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>

The mlx5 driver already have a driver local page recycle cache.  This
page cache is only efficient when the number of outstanding pages is
small, the queue based cache array size is 128.  Further more a single
page with elevated refcnt can block the queue.

Benchmarking on next-next at commit f5f99309fa74 ("sock: do not set
sk_err in sock_dequeue_err_skb"), which include Paolo's UDP
performance optimizations (commit fc13fd398625 ("Merge branch
'udp-fwd-mem-sched-on-dequeue'").  Showed a speedup of 29% for UDP
packets. Detailed ethtool stats showed mlx5 page recycler didn't
"work" in that benchmark.  The XDP_DROP use-case, showed a small perf
regression +2.7ns using page_pool.  This correspons well to the 28%
gain reported in commit 1bfecfca565c ("net/mlx5e: Build RX SKB on
demand").

UPDATE: On newer kernels, net-next at commit 52f40e9d65. The mlx5 page
recycle cache works again, and performance gain is gone.  Detailed
benchmarking show, RX-ksoftirq side is approx 10% faster, while UDP
socket delivery is same performance.

For TC early ingress drop there is a small performance regression of
approx +4 ns. There are pending page_pool optimization that will
close that gap.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    1 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   28 +++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   47 ++++++++++++++-------
 3 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 951dbd58594d..b30d5b08d6a6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -361,6 +361,7 @@ struct mlx5e_rq {
 	struct mlx5e_tstamp   *tstamp;
 	struct mlx5e_rq_stats  stats;
 	struct mlx5e_cq        cq;
+	struct page_pool      *page_pool;
 	struct mlx5e_page_cache page_cache;
 
 	mlx5e_fp_handle_rx_cqe handle_rx_cqe;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cbfa38fc72c0..cd71e5764ec1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,6 +34,7 @@
 #include <net/pkt_cls.h>
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
+#include <linux/page_pool.h>
 #include <linux/bpf.h>
 #include "en.h"
 #include "en_tc.h"
@@ -521,6 +522,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5e_priv *priv = c->priv;
 	struct mlx5_core_dev *mdev = priv->mdev;
 	void *rqc = param->rqc;
@@ -591,6 +593,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info),
 					    GFP_KERNEL, cpu_to_node(c->cpu));
+//		rq->dma_info = NULL; //HACK ALWAYS FAIL TEST
 		if (!rq->dma_info) {
 			err = -ENOMEM;
 			goto err_rq_wq_destroy;
@@ -618,6 +621,24 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
 		rq->buff.page_order = order_base_2(npages);
 
+		pp_params.size		= PAGE_POOL_PARAMS_SIZE;
+		pp_params.order		= rq->buff.page_order;
+		pp_params.dev		= c->pdev;
+		pp_params.nid		= cpu_to_node(c->cpu);
+		pp_params.dma_dir	= DMA_BIDIRECTIONAL;
+		pp_params.pool_size	= 2000;
+		pr_info("XXX: %s() pp_params.size=%d end=%lu\n",
+			__func__, pp_params.size,
+			offsetof(struct page_pool_params, end_marker));
+
+		rq->page_pool = page_pool_create(&pp_params);
+		if (IS_ERR_OR_NULL(rq->page_pool)) {
+			rq->page_pool = NULL;
+			kfree(rq->dma_info);
+			err = -ENOMEM;
+			goto err_rq_wq_destroy;
+		}
+
 		byte_count |= MLX5_HW_START_PADDING;
 		rq->mkey_be = c->mkey_be;
 	}
@@ -662,6 +683,13 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		kfree(rq->dma_info);
+		if (rq->page_pool)
+			page_pool_destroy(rq->page_pool);
+		else
+			// Can happen because mlx5 have some extra
+			// rq's for some other purposes... (explain?)
+			pr_err("XXX: %s() NULL pointer at rq->page_pool\n",
+			       __func__);
 	}
 
 	for (i = rq->page_cache.head; i != rq->page_cache.tail;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e2fb3ed1790..0512632b30fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -182,6 +182,7 @@ void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
 
 #define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
 
+// TODO: Remove mlx5-page-cache
 static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
 				      struct mlx5e_dma_info *dma_info)
 {
@@ -198,6 +199,7 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
 	return true;
 }
 
+// TODO: Remove mlx5-page-cache
 static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
 				      struct mlx5e_dma_info *dma_info)
 {
@@ -228,20 +230,27 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 {
 	struct page *page;
 
-	if (mlx5e_rx_cache_get(rq, dma_info))
-		return 0;
+//	if (mlx5e_rx_cache_get(rq, dma_info))
+//		return 0;
 
-	page = dev_alloc_pages(rq->buff.page_order);
+	//page = dev_alloc_pages(rq->buff.page_order);
+	page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!page))
 		return -ENOMEM;
 
 	dma_info->page = page;
-	dma_info->addr = dma_map_page(rq->pdev, page, 0,
-				      RQ_PAGE_SIZE(rq), rq->buff.map_dir);
-	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
-		put_page(page);
-		return -ENOMEM;
-	}
+	dma_info->addr = page->dma_addr;
+//	dma_info->addr = dma_map_page(rq->pdev, page, 0,
+//				      RQ_PAGE_SIZE(rq), rq->buff.map_dir);
+
+	/* DISCUSS: should this be moved into page_pool API?  Here we
+	 * sync entire page, but some drivers might want have more
+	 * control?  Like using the dma_sync_single_range_for_device()
+	 * like Alex is doing in the Intel drivers...
+	 */
+	dma_sync_single_for_device(rq->pdev, dma_info->addr,
+				   RQ_PAGE_SIZE(rq),
+				   DMA_FROM_DEVICE);
 
 	return 0;
 }
@@ -249,11 +258,21 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+//	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+//		return;
+	// TODO: use page_pool_recycle_direct(dma_info->page);
+	if (recycle) {
+		page_pool_recycle_direct(dma_info->page);
 		return;
+	}
+
+// page_pool take over dma_unmap
+//	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+//		       rq->buff.map_dir);
+	// XXX: do we need to call dma_sync_single_range_for_cpu here???
+	// dma_sync_single_range_for_cpu(rq->pdev, dma_info->addr,
+	//			      RQ_PAGE_SIZE(rq), rq->buff.map_dir);
 
-	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-		       rq->buff.map_dir);
 	put_page(dma_info->page);
 }
 
@@ -773,10 +792,6 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 		return NULL;
 	}
 
-	/* queue up for recycling ..*/
-	page_ref_inc(di->page);
-	mlx5e_page_release(rq, di, true);
-
 	skb_reserve(skb, MLX5_RX_HEADROOM);
 	skb_put(skb, cqe_bcnt);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [RFC PATCH 4/4] page_pool: change refcnt model
From: Jesper Dangaard Brouer @ 2016-12-20 13:28 UTC (permalink / raw)
  To: linux-mm, Alexander Duyck
  Cc: willemdebruijn.kernel, netdev, john.fastabend, Saeed Mahameed,
	Jesper Dangaard Brouer, bjorn.topel, Alexei Starovoitov,
	Tariq Toukan
In-Reply-To: <20161220132444.18788.50875.stgit@firesoul>

This is the direction the patch is going, after Mel's comments.

Most significantly: Change that refcnt must reach zero (and not 1)
before the page gets into the recycle ring.  Pages on the
pp_alloc_cache have refcnt==1 invariance, as this allow fast direct
recycling (allowed by XDP_DROP).

When mlx5 page recycle cache didn't work (at next-next at commit
f5f99309fa74) the benchmarks showed the gain was reduced to 14% by
this patch, or an added cost of approx 133 cycle (which were a higher
cycle cost than expected).

UPDATE: net-next at commit 52f40e9d65 this patch show no gain, perhaps
a small performance regression.  The accuracy of the UDP measurements
are not good enough to conclude on, ksoftirq +1.4% and UDP side -0.89%.
The TC ingress drop test is more significant and show 4.3% slower.

Thus, this patch makes page_pool slower than the driver specific page
recycle cache.  More optimizations are pending for the page_pool, thus
this can likely be regained.

The page_pool will still show benefit for use-case where the driver
page recycle cache doesn't work (>128 outstanding packets/pages).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mm.h        |    5 --
 include/linux/page_pool.h |   10 +++
 mm/page_alloc.c           |   16 ++---
 mm/page_pool.c            |  141 +++++++++++++++++++--------------------------
 mm/swap.c                 |    3 +
 5 files changed, 79 insertions(+), 96 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 11b4d8fb280b..7315c1790f7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -766,11 +766,6 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
-	if (PagePool(page)) {
-		page_pool_put_page(page);
-		return;
-	}
-
 	if (put_page_testzero(page))
 		__put_page(page);
 
diff --git a/include/linux/page_pool.h b/include/linux/page_pool.h
index 6f8f2ff6d758..40da1fac573d 100644
--- a/include/linux/page_pool.h
+++ b/include/linux/page_pool.h
@@ -112,6 +112,7 @@ struct page_pool {
 	 * wise, because free's can happen on remote CPUs, with no
 	 * association with allocation resource.
 	 *
+	 * XXX: Mel says drop comment
 	 * For now use ptr_ring, as it separates consumer and
 	 * producer, which is a common use-case. The ptr_ring is not
 	 * though as the final data structure, expecting this to
@@ -145,6 +146,7 @@ void page_pool_destroy(struct page_pool *pool);
 /* Never call this directly, use helpers below */
 void __page_pool_put_page(struct page *page, bool allow_direct);
 
+/* XXX: Mel: needs descriptions*/
 static inline void page_pool_put_page(struct page *page)
 {
 	__page_pool_put_page(page, false);
@@ -155,4 +157,12 @@ static inline void page_pool_recycle_direct(struct page *page)
 	__page_pool_put_page(page, true);
 }
 
+/*
+ * Called when refcnt reach zero.  On failure page_pool state is
+ * cleared, and caller can return page to page allocator.
+ */
+bool page_pool_recycle(struct page *page);
+// XXX: compile out trick, let this return false compile time,
+// or let PagePool() check compile to false.
+
 #endif /* _LINUX_PAGE_POOL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 655db05f0c1c..5a68bdbc9dc1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1240,6 +1240,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
+	if (PagePool(page) && page_pool_recycle(page))
+		return;
+
 	if (!free_pages_prepare(page, order, true))
 		return;
 
@@ -2448,6 +2451,9 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	if (PagePool(page) && page_pool_recycle(page))
+		return;
+
 	if (!free_pcp_prepare(page))
 		return;
 
@@ -3873,11 +3879,6 @@ EXPORT_SYMBOL(get_zeroed_page);
 
 void __free_pages(struct page *page, unsigned int order)
 {
-	if (PagePool(page)) {
-		page_pool_put_page(page);
-		return;
-	}
-
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_cold_page(page, false);
@@ -4005,11 +4006,6 @@ void __free_page_frag(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (PagePool(page)) {
-		page_pool_put_page(page);
-		return;
-	}
-
 	if (unlikely(put_page_testzero(page)))
 		__free_pages_ok(page, compound_order(page));
 }
diff --git a/mm/page_pool.c b/mm/page_pool.c
index 74138d5fe86d..064034d89f8a 100644
--- a/mm/page_pool.c
+++ b/mm/page_pool.c
@@ -21,14 +21,15 @@
 #include <linux/dma-mapping.h>
 #include <linux/page-flags.h>
 #include <linux/mm.h> /* for __put_page() */
+#include "internal.h" /* for set_page_refcounted() */
 
 /*
  * The struct page_pool (likely) cannot be embedded into another
  * structure, because freeing this struct depend on outstanding pages,
  * which can point back to the page_pool. Thus, don't export "init".
  */
-int page_pool_init(struct page_pool *pool,
-		   const struct page_pool_params *params)
+static int page_pool_init(struct page_pool *pool,
+			  const struct page_pool_params *params)
 {
 	int ring_qsize = 1024; /* Default */
 	int param_copy_sz;
@@ -108,40 +109,33 @@ EXPORT_SYMBOL(page_pool_create);
 /* fast path */
 static struct page *__page_pool_get_cached(struct page_pool *pool)
 {
+	struct ptr_ring *r;
 	struct page *page;
 
-	/* FIXME: use another test for safe-context, caller should
-	 * simply provide this guarantee
-	 */
-	if (likely(in_serving_softirq())) { // FIXME add use of PP_FLAG_NAPI
-		struct ptr_ring *r;
-
-		if (likely(pool->alloc.count)) {
-			/* Fast-path */
-			page = pool->alloc.cache[--pool->alloc.count];
-			return page;
-		}
-		/* Slower-path: Alloc array empty, time to refill */
-		r = &pool->ring;
-		/* Open-coded bulk ptr_ring consumer.
-		 *
-		 * Discussion: ATM the ring consumer lock is not
-		 * really needed due to the softirq/NAPI protection,
-		 * but later MM-layer need the ability to reclaim
-		 * pages on the ring. Thus, keeping the locks.
-		 */
-		spin_lock(&r->consumer_lock);
-		while ((page = __ptr_ring_consume(r))) {
-			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
-				break;
-			pool->alloc.cache[pool->alloc.count++] = page;
-		}
-		spin_unlock(&r->consumer_lock);
+	/* Caller guarantee safe context for accessing alloc.cache */
+	if (likely(pool->alloc.count)) {
+		/* Fast-path */
+		page = pool->alloc.cache[--pool->alloc.count];
 		return page;
 	}
 
-	/* Slow-path: Get page from locked ring queue */
-	page = ptr_ring_consume(&pool->ring);
+	/* Slower-path: Alloc array empty, time to refill */
+	r = &pool->ring;
+	/* Open-coded bulk ptr_ring consumer.
+	 *
+	 * Discussion: ATM ring *consumer* lock is not really needed
+	 * due to caller protecton, but later MM-layer need the
+	 * ability to reclaim pages from ring. Thus, keeping locks.
+	 */
+	spin_lock(&r->consumer_lock);
+	while ((page = __ptr_ring_consume(r))) {
+		/* Pages on ring refcnt==0, on alloc.cache refcnt==1 */
+		set_page_refcounted(page);
+		if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+			break;
+		pool->alloc.cache[pool->alloc.count++] = page;
+	}
+	spin_unlock(&r->consumer_lock);
 	return page;
 }
 
@@ -290,15 +284,9 @@ static void __page_pool_clean_page(struct page *page)
 /* Return a page to the page allocator, cleaning up our state */
 static void __page_pool_return_page(struct page *page)
 {
-	struct page_pool *pool = page->pool;
-
+	VM_BUG_ON_PAGE(page_ref_count(page) != 0, page);
 	__page_pool_clean_page(page);
-	/*
-	 * Given page pool state and flags were just cleared, the page
-	 * must be freed here.  Thus, code invariant assumes
-	 * refcnt==1, as __free_pages() call put_page_testzero().
-	 */
-	__free_pages(page, pool->p.order);
+	__put_page(page);
 }
 
 bool __page_pool_recycle_into_ring(struct page_pool *pool,
@@ -332,70 +320,61 @@ bool __page_pool_recycle_into_ring(struct page_pool *pool,
  *
  * Caller must provide appropiate safe context.
  */
-static bool __page_pool_recycle_direct(struct page *page,
+// noinline /* hack for perf-record test */
+static
+bool __page_pool_recycle_direct(struct page *page,
 				       struct page_pool *pool)
 {
-	// BUG_ON(!in_serving_softirq());
+	VM_BUG_ON_PAGE(page_ref_count(page) != 1, page);
+	/* page refcnt==1 invarians on alloc.cache */
 
 	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
 		return false;
 
-	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
 	pool->alloc.cache[pool->alloc.count++] = page;
 	return true;
 }
 
-void __page_pool_put_page(struct page *page, bool allow_direct)
+/*
+ * Called when refcnt reach zero.  On failure page_pool state is
+ * cleared, and caller can return page to page allocator.
+ */
+bool page_pool_recycle(struct page *page)
 {
 	struct page_pool *pool = page->pool;
 
-	/* This is a fast-path optimization, that avoids an atomic
-	 * operation, in the case where a single object is (refcnt)
-	 * using the page.
-	 *
-	 * refcnt == 1 means page_pool owns page, and can recycle it.
-	 */
-	if (likely(page_ref_count(page) == 1)) {
-		/* Read barrier implicit paired with full MB of atomic ops */
-		smp_rmb();
-
-		if (allow_direct)
-			if (__page_pool_recycle_direct(page, pool))
-			    return;
+	VM_BUG_ON_PAGE(page_ref_count(page) != 0, page);
 
-		if (!__page_pool_recycle_into_ring(pool, page)) {
-			/* Cache full, do real __free_pages() */
-			__page_pool_return_page(page);
-		}
-		return;
-	}
-	/*
-	 * Many drivers splitting up the page into fragments, and some
-	 * want to keep doing this to save memory. The put_page_testzero()
-	 * function as a refcnt decrement, and should not return true.
-	 */
-	if (unlikely(put_page_testzero(page))) {
-		/*
-		 * Reaching refcnt zero should not be possible,
-		 * indicate code error.  Don't crash but warn, handle
-		 * case by not-recycling, but return page to page
-		 * allocator.
-		 */
-		WARN(1, "%s() violating page_pool invariance refcnt:%d\n",
-		     __func__, page_ref_count(page));
-		/* Cleanup state before directly returning page */
+	/* Pages on recycle ring have refcnt==0 */
+	if (!__page_pool_recycle_into_ring(pool, page)) {
 		__page_pool_clean_page(page);
-		__put_page(page);
+		return false;
 	}
+	return true;
+}
+EXPORT_SYMBOL(page_pool_recycle);
+
+void __page_pool_put_page(struct page *page, bool allow_direct)
+{
+	struct page_pool *pool = page->pool;
+
+	if (allow_direct && (page_ref_count(page) == 1))
+		if (__page_pool_recycle_direct(page, pool))
+			return;
+
+	if (put_page_testzero(page))
+		if (!page_pool_recycle(page))
+			__put_page(page);
+
 }
 EXPORT_SYMBOL(__page_pool_put_page);
 
-static void __destructor_put_page(void *ptr)
+void __destructor_return_page(void *ptr)
 {
 	struct page *page = ptr;
 
 	/* Verify the refcnt invariant of cached pages */
-	if (!(page_ref_count(page) == 1)) {
+	if (page_ref_count(page) != 0) {
 		pr_crit("%s() page_pool refcnt %d violation\n",
 			__func__, page_ref_count(page));
 		BUG();
@@ -407,7 +386,7 @@ static void __destructor_put_page(void *ptr)
 void page_pool_destroy(struct page_pool *pool)
 {
 	/* Empty recycle ring */
-	ptr_ring_cleanup(&pool->ring, __destructor_put_page);
+	ptr_ring_cleanup(&pool->ring, __destructor_return_page);
 
 	/* FIXME-mem-leak: cleanup array/stack cache
 	 * pool->alloc. Driver usually will destroy RX ring after
diff --git a/mm/swap.c b/mm/swap.c
index 4dcf852e1e6d..d71c896cb1a1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -96,6 +96,9 @@ static void __put_compound_page(struct page *page)
 
 void __put_page(struct page *page)
 {
+	if (PagePool(page) && page_pool_recycle(page))
+		return;
+
 	if (unlikely(PageCompound(page)))
 		__put_compound_page(page);
 	else

^ permalink raw reply related

* [PATCH] ethernet: sfc: Add Kconfig entry for vendor Solarflare
From: Tobias Klauser @ 2016-12-20 13:38 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers, ecree, bkenward

Since commit

  5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")

there are two drivers for Solarflare devices, but both still show up
directly beneath "Ethernet driver support" in the Kconfig. Follow the
pattern of other vendors and group them beneath an own vendor Kconfig
entry for Solarflare.

Cc: Edward Cree <ecree@solarflare.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
 drivers/net/ethernet/Kconfig     |  1 -
 drivers/net/ethernet/sfc/Kconfig | 21 +++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 6e16e441f85e..e4c28fed61d5 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -166,7 +166,6 @@ source "drivers/net/ethernet/seeq/Kconfig"
 source "drivers/net/ethernet/silan/Kconfig"
 source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/sfc/Kconfig"
-source "drivers/net/ethernet/sfc/falcon/Kconfig"
 source "drivers/net/ethernet/sgi/Kconfig"
 source "drivers/net/ethernet/smsc/Kconfig"
 source "drivers/net/ethernet/stmicro/Kconfig"
diff --git a/drivers/net/ethernet/sfc/Kconfig b/drivers/net/ethernet/sfc/Kconfig
index 46f7be85f5a3..2c032629c369 100644
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@ -1,3 +1,20 @@
+#
+# Solarflare device configuration
+#
+
+config NET_VENDOR_SOLARFLARE
+	bool "Solarflare devices"
+	default y
+	---help---
+	  If you have a network (Ethernet) card belonging to this class, say Y.
+
+	  Note that the answer to this question doesn't directly affect the
+	  kernel: saying N will just cause the configurator to skip all
+	  the questions about Solarflare devices. If you say Y, you will be asked
+	  for your specific card in the following questions.
+
+if NET_VENDOR_SOLARFLARE
+
 config SFC
 	tristate "Solarflare SFC9000/SFC9100-family support"
 	depends on PCI
@@ -44,3 +61,7 @@ config SFC_MCDI_LOGGING
 	  Driver-Interface) commands and responses, allowing debugging of
 	  driver/firmware interaction.  The tracing is actually enabled by
 	  a sysfs file 'mcdi_logging' under the PCI device.
+
+source "drivers/net/ethernet/sfc/falcon/Kconfig"
+
+endif # NET_VENDOR_SOLARFLARE
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH perf/core REBASE 2/5] samples/bpf: Switch over to libbpf
From: Arnaldo Carvalho de Melo @ 2016-12-20 13:41 UTC (permalink / raw)
  To: Joe Stringer
  Cc: LKML, netdev, Wang Nan, ast, Daniel Borkmann,
	Arnaldo Carvalho de Melo
In-Reply-To: <CAPWQB7EAN7Fg8+vO3Tn7WYWcZW_JpKQFNCJzY_K100ckab6JRg@mail.gmail.com>

Em Thu, Dec 15, 2016 at 05:48:31PM -0800, Joe Stringer escreveu:
> On 15 December 2016 at 14:00, Joe Stringer <joe@ovn.org> wrote:
> > On 15 December 2016 at 10:34, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >> So, I'm stopping here so that I can push what I have to Ingo, then I'll get
> >> back to this, hopefully by then you beat me and I have just to retest 8-)

> > OK, thanks for the report. Looks like there was another difference
> > between the two libbpfs - one used total program size for its
> > load_program API; the actual kernel API uses instruction count. This
> > incremental should do the trick:

> > https://github.com/joestringer/linux/commit/6ff7726f20077bed66fb725f5189c13690154b6a
 
> The full branch with this change (fast-forward from your tmp branch)
> is available here:
> https://github.com/joestringer/linux/tree/submit/libbpf_samples_v5
 
> I tried running every selftest and BPF sample I could get my hands on;
> there's one or two that I couldn't run, but seemed more to do with my
> versions of TC/iproute and kernel config rather than libbpf changes.
> Let me know if you see any further trouble.

Finally getting back to this, now after I figured out how to get patches
out of github (wget commit + .patch) I applied this and at least the
samples/bpf/offwaketime seems to work as before, applying.

- Arnaldo

^ permalink raw reply

* [PATCH] netfilter: xt_connlimit: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller
  Cc: Geliang Tang, netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 net/netfilter/xt_connlimit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 2aff2b7..660b61d 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -218,7 +218,7 @@ count_tree(struct net *net, struct rb_root *root,
 		int diff;
 		bool addit;
 
-		rbconn = container_of(*rbnode, struct xt_connlimit_rb, node);
+		rbconn = rb_entry(*rbnode, struct xt_connlimit_rb, node);
 
 		parent = *rbnode;
 		diff = same_source_net(addr, mask, &rbconn->addr, family);
@@ -398,7 +398,7 @@ static void destroy_tree(struct rb_root *r)
 	struct rb_node *node;
 
 	while ((node = rb_first(r)) != NULL) {
-		rbconn = container_of(node, struct xt_connlimit_rb, node);
+		rbconn = rb_entry(node, struct xt_connlimit_rb, node);
 
 		rb_erase(node, r);
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH] net/mlx5: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
  To: Saeed Mahameed, Matan Barak, Leon Romanovsky
  Cc: Geliang Tang, netdev, linux-rdma, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index 3b026c1..7431f63 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -75,7 +75,7 @@ static void mlx5_fc_stats_insert(struct rb_root *root, struct mlx5_fc *counter)
 	struct rb_node *parent = NULL;
 
 	while (*new) {
-		struct mlx5_fc *this = container_of(*new, struct mlx5_fc, node);
+		struct mlx5_fc *this = rb_entry(*new, struct mlx5_fc, node);
 		int result = counter->id - this->id;
 
 		parent = *new;
-- 
2.9.3

^ permalink raw reply related

* [PATCH] net_sched: sch_netem: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
  To: Stephen Hemminger, Jamal Hadi Salim, David S. Miller
  Cc: Geliang Tang, netem, netdev, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 net/sched/sch_netem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 9f7b380..b7e4097 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -152,7 +152,7 @@ struct netem_skb_cb {

 static struct sk_buff *netem_rb_to_skb(struct rb_node *rb)
 {
-	return container_of(rb, struct sk_buff, rbnode);
+	return rb_entry(rb, struct sk_buff, rbnode);
 }

 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
-- 
2.9.3

^ permalink raw reply related

* [PATCH] RDS: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
  To: Santosh Shilimkar, David S. Miller
  Cc: Geliang Tang, netdev, linux-rdma, rds-devel, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 net/rds/rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 4c93bad..ea96114 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -135,7 +135,7 @@ void rds_rdma_drop_keys(struct rds_sock *rs)
 	/* Release any MRs associated with this socket */
 	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
 	while ((node = rb_first(&rs->rs_rdma_keys))) {
-		mr = container_of(node, struct rds_mr, r_rb_node);
+		mr = rb_entry(node, struct rds_mr, r_rb_node);
 		if (mr->r_trans == rs->rs_transport)
 			mr->r_invalidate = 0;
 		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
-- 
2.9.3

^ permalink raw reply related

* [PATCH] net_sched: sch_fq: use rb_entry()
From: Geliang Tang @ 2016-12-20 14:02 UTC (permalink / raw)
  To: Jamal Hadi Salim, David S. Miller; +Cc: Geliang Tang, netdev, linux-kernel
In-Reply-To: <ddabc96c798df194791134d8e070d728e2a7b59f.1482203698.git.geliangtang@gmail.com>

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 net/sched/sch_fq.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 86309a3..a4f738a 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -136,7 +136,7 @@ static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f)
 		struct fq_flow *aux;
 
 		parent = *p;
-		aux = container_of(parent, struct fq_flow, rate_node);
+		aux = rb_entry(parent, struct fq_flow, rate_node);
 		if (f->time_next_packet >= aux->time_next_packet)
 			p = &parent->rb_right;
 		else
@@ -188,7 +188,7 @@ static void fq_gc(struct fq_sched_data *q,
 	while (*p) {
 		parent = *p;
 
-		f = container_of(parent, struct fq_flow, fq_node);
+		f = rb_entry(parent, struct fq_flow, fq_node);
 		if (f->sk == sk)
 			break;
 
@@ -256,7 +256,7 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 	while (*p) {
 		parent = *p;
 
-		f = container_of(parent, struct fq_flow, fq_node);
+		f = rb_entry(parent, struct fq_flow, fq_node);
 		if (f->sk == sk) {
 			/* socket might have been reallocated, so check
 			 * if its sk_hash is the same.
@@ -424,7 +424,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 
 	q->time_next_delayed_flow = ~0ULL;
 	while ((p = rb_first(&q->delayed)) != NULL) {
-		struct fq_flow *f = container_of(p, struct fq_flow, rate_node);
+		struct fq_flow *f = rb_entry(p, struct fq_flow, rate_node);
 
 		if (f->time_next_packet > now) {
 			q->time_next_delayed_flow = f->time_next_packet;
@@ -563,7 +563,7 @@ static void fq_reset(struct Qdisc *sch)
 	for (idx = 0; idx < (1U << q->fq_trees_log); idx++) {
 		root = &q->fq_root[idx];
 		while ((p = rb_first(root)) != NULL) {
-			f = container_of(p, struct fq_flow, fq_node);
+			f = rb_entry(p, struct fq_flow, fq_node);
 			rb_erase(p, root);
 
 			fq_flow_purge(f);
@@ -593,7 +593,7 @@ static void fq_rehash(struct fq_sched_data *q,
 		oroot = &old_array[idx];
 		while ((op = rb_first(oroot)) != NULL) {
 			rb_erase(op, oroot);
-			of = container_of(op, struct fq_flow, fq_node);
+			of = rb_entry(op, struct fq_flow, fq_node);
 			if (fq_gc_candidate(of)) {
 				fcnt++;
 				kmem_cache_free(fq_flow_cachep, of);
@@ -606,7 +606,7 @@ static void fq_rehash(struct fq_sched_data *q,
 			while (*np) {
 				parent = *np;
 
-				nf = container_of(parent, struct fq_flow, fq_node);
+				nf = rb_entry(parent, struct fq_flow, fq_node);
 				BUG_ON(nf->sk == of->sk);
 
 				if (nf->sk > of->sk)
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH perf/core REBASE 3/5] tools lib bpf: Add bpf_prog_{attach,detach}
From: Arnaldo Carvalho de Melo @ 2016-12-20 14:18 UTC (permalink / raw)
  To: Joe Stringer; +Cc: linux-kernel, netdev, wangnan0, ast, daniel
In-Reply-To: <20161214224342.12858-4-joe@ovn.org>

Em Wed, Dec 14, 2016 at 02:43:40PM -0800, Joe Stringer escreveu:
> Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching
> eBPF programs to cgroups") added these functions to samples/libbpf, but
> during this merge all of the samples libbpf functionality is shifting to
> tools/lib/bpf. Shift these functions there.
> 
> Signed-off-by: Joe Stringer <joe@ovn.org>
> ---
> Arnaldo, this is a new patch you didn't previously review which I've
> prepared due to the conflict with net-next. I figured it's better to try
> to get samples/bpf properly switched over this window rather than defer the
> problem and end up having to deal with another merge problem next time
> around. I hope that is fine for you. If not, this patch onwards will need
> to be dropped
> 
> It's a simple copy/paste/delete with a minor change for sys_bpf() vs
> syscall().
> ---
>  samples/bpf/libbpf.c | 21 ---------------------
>  samples/bpf/libbpf.h |  3 ---
>  tools/lib/bpf/bpf.c  | 21 +++++++++++++++++++++
>  tools/lib/bpf/bpf.h  |  3 +++
>  4 files changed, 24 insertions(+), 24 deletions(-)
> 
> diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
> index 3391225ad7e9..d9af876b4a2c 100644
> --- a/samples/bpf/libbpf.c
> +++ b/samples/bpf/libbpf.c
> @@ -11,27 +11,6 @@
>  #include <arpa/inet.h>
>  #include "libbpf.h"
>  
> -int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
> -{
> -	union bpf_attr attr = {
> -		.target_fd = target_fd,
> -		.attach_bpf_fd = prog_fd,
> -		.attach_type = type,
> -	};
> -
> -	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));

This one makes it fail for CentOS 5 and 6, others may fail as well,
still building, investigating...

   4 10.853874028 centos:5: FAIL
...                         glibc: [ on  ]
...                          gtk2: [ OFF ]
...                      libaudit: [ on  ]
...                        libbfd: [ OFF ]
...                        libelf: [ on  ]
...                       libnuma: [ on  ]
...        numa_num_possible_cpus: [ OFF ]
...                       libperl: [ on  ]
...                     libpython: [ OFF ]
...                      libslang: [ on  ]
...                     libcrypto: [ on  ]
...                     libunwind: [ OFF ]
...            libdw-dwarf-unwind: [ OFF ]
...                          zlib: [ on  ]
...                          lzma: [ on  ]
...                     get_cpuid: [ OFF ]
...                           bpf: [ on  ]

Makefile.config:290: No libdw DWARF unwind found, Please install elfutils-devel/libdw-dev >= 0.158 and/or set LIBDW_DIR
Makefile.config:294: No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev
Makefile.config:363: DWARF support is off, BPF prologue is disabled
Makefile.config:417: No libunwind found. Please install libunwind-dev[el] >= 1.1 and/or set LIBUNWIND_DIR
Makefile.config:444: Disabling post unwind, no support found.
Makefile.config:530: GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev
Makefile.config:569: No timerfd support. Disables 'perf kvm stat live'
Makefile.config:589: No 'python-config' tool was found: disables Python support - please install python-devel/python-dev
Makefile.config:662: No bfd.h/libbfd found, please install binutils-dev[el]/zlib-static/libiberty-dev to gain symbol demangling
Makefile.config:708: Old numa library found, disables 'perf bench numa mem' benchmark, please install numactl-devel/libnuma-devel/libnuma-dev >= 2.0.8
Makefile.config:762: Your gcc lacks the __get_cpuid() builtin, disables support for auxtrace/Intel PT, please install a newer gcc
Makefile.config:792: No openjdk development package found, please install JDK package
  GEN      /tmp/build/perf/common-cmds.h
  MKDIR    /tmp/build/perf/fd/
  CC       /tmp/build/perf/fd/array.o
  CC       /tmp/build/perf/event-parse.o
  CC       /tmp/build/perf/exec-cmd.o
  MKDIR    /tmp/build/perf/fd/
  CC       /tmp/build/perf/help.o
  LD       /tmp/build/perf/fd/libapi-in.o
  MKDIR    /tmp/build/perf/fs/
  CC       /tmp/build/perf/fs/fs.o
  CC       /tmp/build/perf/event-plugin.o
  CC       /tmp/build/perf/pager.o
  MKDIR    /tmp/build/perf/fs/
  CC       /tmp/build/perf/fs/tracing_path.o
  CC       /tmp/build/perf/trace-seq.o
  CC       /tmp/build/perf/parse-options.o
  CC       /tmp/build/perf/parse-filter.o
  MKDIR    /tmp/build/perf/fs/
  LD       /tmp/build/perf/fs/libapi-in.o
  CC       /tmp/build/perf/cpu.o
  CC       /tmp/build/perf/debug.o
  CC       /tmp/build/perf/str_error_r.o
  CC       /tmp/build/perf/parse-utils.o
  CC       /tmp/build/perf/kbuffer-parse.o
  LD       /tmp/build/perf/libapi-in.o
  AR       /tmp/build/perf/libapi.a
  LD       /tmp/build/perf/libtraceevent-in.o
  CC       /tmp/build/perf/libbpf.o
  LINK     /tmp/build/perf/libtraceevent.a
  MKDIR    /tmp/build/perf/pmu-events/
  HOSTCC   /tmp/build/perf/pmu-events/json.o
  CC       /tmp/build/perf/run-command.o
  MKDIR    /tmp/build/perf/pmu-events/
  HOSTCC   /tmp/build/perf/pmu-events/jsmn.o
  MKDIR    /tmp/build/perf/pmu-events/
  HOSTCC   /tmp/build/perf/pmu-events/jevents.o
  CC       /tmp/build/perf/bpf.o
  PERF_VERSION = 4.9.0
  CC       /tmp/build/perf/sigchain.o
bpf.c: In function 'bpf_prog_attach':
bpf.c:174: error: unknown field 'target_fd' specified in initializer
cc1: warnings being treated as errors
bpf.c:174: warning: missing braces around initializer
bpf.c:174: warning: (near initialization for 'attr.<anonymous>')
bpf.c:175: error: unknown field 'attach_bpf_fd' specified in initializer
bpf.c:175: warning: excess elements in union initializer
bpf.c:175: warning: (near initialization for 'attr')
bpf.c:176: error: unknown field 'attach_type' specified in initializer
bpf.c:176: warning: excess elements in union initializer
bpf.c:176: warning: (near initialization for 'attr')
bpf.c: In function 'bpf_prog_detach':
bpf.c:185: error: unknown field 'target_fd' specified in initializer
bpf.c:185: warning: missing braces around initializer
bpf.c:185: warning: (near initialization for 'attr.<anonymous>')
bpf.c:186: error: unknown field 'attach_type' specified in initializer
bpf.c:186: warning: excess elements in union initializer
bpf.c:186: warning: (near initialization for 'attr')
mv: cannot stat `/tmp/build/perf/.bpf.o.tmp': No such file or directory
make[4]: *** [/tmp/build/perf/bpf.o] Error 1
make[3]: *** [/tmp/build/perf/libbpf-in.o] Error 2
make[2]: *** [/tmp/build/perf/libbpf.a] Error 2
make[2]: *** Waiting for unfinished jobs....
  CC       /tmp/build/perf/subcmd-config.o
  MKDIR    /tmp/build/perf/pmu-events/
  HOSTLD   /tmp/build/perf/pmu-events/jevents-in.o
  LD       /tmp/build/perf/libsubcmd-in.o
  AR       /tmp/build/perf/libsubcmd.a
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
make: Leaving directory `/git/linux/tools/perf'

^ permalink raw reply

* Re: [PATCH] net/mlx5: use rb_entry()
From: Leon Romanovsky @ 2016-12-20 14:19 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Saeed Mahameed, Matan Barak, netdev, linux-rdma, linux-kernel
In-Reply-To: <8443fa3fa03d82c2b829375d8020762e5236dc6d.1482203930.git.geliangtang@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1041 bytes --]

On Tue, Dec 20, 2016 at 10:02:14PM +0800, Geliang Tang wrote:
> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
>
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks,
Acked-by: Leon Romanovsky <leonro@mellanox.com>

>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
> index 3b026c1..7431f63 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
> @@ -75,7 +75,7 @@ static void mlx5_fc_stats_insert(struct rb_root *root, struct mlx5_fc *counter)
>  	struct rb_node *parent = NULL;
>
>  	while (*new) {
> -		struct mlx5_fc *this = container_of(*new, struct mlx5_fc, node);
> +		struct mlx5_fc *this = rb_entry(*new, struct mlx5_fc, node);
>  		int result = counter->id - this->id;
>
>  		parent = *new;
> --
> 2.9.3
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] RDS: use rb_entry()
From: Leon Romanovsky @ 2016-12-20 14:20 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Santosh Shilimkar, David S. Miller, netdev, linux-rdma, rds-devel,
	linux-kernel
In-Reply-To: <2cd84448fe04ffb7023be892c5ed04bbfc759c87.1482204342.git.geliangtang@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Tue, Dec 20, 2016 at 10:02:18PM +0800, Geliang Tang wrote:
> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
>
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>
> ---
>  net/rds/rdma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

>
> diff --git a/net/rds/rdma.c b/net/rds/rdma.c
> index 4c93bad..ea96114 100644
> --- a/net/rds/rdma.c
> +++ b/net/rds/rdma.c
> @@ -135,7 +135,7 @@ void rds_rdma_drop_keys(struct rds_sock *rs)
>  	/* Release any MRs associated with this socket */
>  	spin_lock_irqsave(&rs->rs_rdma_lock, flags);
>  	while ((node = rb_first(&rs->rs_rdma_keys))) {
> -		mr = container_of(node, struct rds_mr, r_rb_node);
> +		mr = rb_entry(node, struct rds_mr, r_rb_node);
>  		if (mr->r_trans == rs->rs_transport)
>  			mr->r_invalidate = 0;
>  		rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys);
> --
> 2.9.3
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: regression: ath_tx_edma_tasklet() Illegal idle entry in RCU read-side critical section
From: Valo, Kalle @ 2016-12-20 14:20 UTC (permalink / raw)
  To: Gabriel C
  Cc: paulmck@linux.vnet.ibm.com, Tobias Klausmann, lkml, ath9k-devel,
	linux-wireless@vger.kernel.org, ath9k-devel@lists.ath9k.org,
	netdev@vger.kernel.org, nbd@nbd.name
In-Reply-To: <20b64afc-8270-27d5-6d45-592ce6dd24c7@gmail.com>

Gabriel C <nix.or.die@gmail.com> writes:

> On 18.12.2016 21:14, Paul E. McKenney wrote:
>> On Sun, Dec 18, 2016 at 07:57:42PM +0000, Valo, Kalle wrote:
>>> Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de> writes:
>>>
>>>> A patch for this is already floating on the ML for a while now latest:
>>>> (ath9k: do not return early to fix rcu unlocking)
>>>
>>> It's here:
>>>
>>> https://patchwork.kernel.org/patch/9472709/
>>
>> Feel free to add:
>>
>> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>
>
> This patch fixes the bug for me.
>
> You can add my Tested-by if you wish.
>
> Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>

I added both of these to the commit log, thanks.

-- 
Kalle Valo

^ permalink raw reply

* Re: mlx4: Bug in XDP_TX + 16 rx-queues
From: David Miller @ 2016-12-20 14:22 UTC (permalink / raw)
  To: ttoukan.linux; +Cc: kafai, tariqt, saeedm, netdev, ast
In-Reply-To: <a586ecb7-e290-8c70-714b-2eea0b94c7fb@gmail.com>

From: Tariq Toukan <ttoukan.linux@gmail.com>
Date: Tue, 20 Dec 2016 14:02:05 +0200

> I can prepare and send it once the window opens again.

Bug fixes like this can and should be sent at any time whatsoever.

^ permalink raw reply

* Re: [PATCH for bnxt_re V3 00/21] Broadcom RoCE Driver (bnxt_re)
From: Doug Ledford @ 2016-12-20 14:28 UTC (permalink / raw)
  To: Selvin Xavier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	michael.chan-dY08KVG/lbpWk0Htik3J/w
In-Reply-To: <1482225211-22423-1-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 6375 bytes --]

On 12/20/2016 4:13 AM, Selvin Xavier wrote:
> This series introduces the RoCE driver for the Broadcom
> NetXtreme-E 10/25/40/50G RoCE HCAs.
> This driver is dependent on the bnxt_en NIC driver and is
> based on the bnxt_re branch in linux-rdma repository.
> bnxt_en changes required for this patch series are already
> available afore mentioned branch.
> 
> These changes are available for your reference in
> the bnxt_re_v3 branch of following repository.
> https://github.com/Broadcom/linux-rdma-nxt/
> 
> Doug,
> Please review and consider applying this to linux-rdma repository
> for 4.11 merge cycle.

I certainly won't get it done before the holiday break coming up (Red
Hat, and by extension, myself, have a 1 week shutdown over the holiday
season, so I'll be offline starting this Friday).

However, even though you changed the api filename, there is still a lot
of inconsistency in the naming of your files.  The module itself is
bnxt_re, not bnxtre or any other variant.  Please make all uses of
bnxtre or other variants match bnxt_re.  This includes the api file that
you just moved.  I also want the directory in drivers/infiniband/hw to
be bnxt_re not bnxtre.  Also, putting bnxt_re as part of the file name
for files already in the bnxt_re directory is redundant (all except for
bnxt_re.h, where it is appropriate).  Just name the files things like
main.c and ib_verbs.c.  And for the qplib files, drop the bnxt_ prefix
and just use qplib_*.

I'll make other comments as I sort through the files, but those are
things I would like to see changed so I wanted to get that feedback to
you sooner rather than later.

> Thanks,
> Selvin Xavier
> 
> v2->v3:
>   * Fix 0day build breakage
>   * Fix cocci, kbuild robot, sparse, smatch and checkpatch warnings
>   * Changed the filename bnxt_re_uverbs_abi.h  to bnxtre-abi.h
>   * Removed the __packed qualifier from the uverbs structure and adjusted
>     the structure alignment to 64bits.
>   * Added retry count to bail out in case of delayed or no response
>     to FW commands
>   * Removed the debugfs support from this patch series
>   * Changed some of the defines as inline functions based on Jason's comment
>   * Split two functions to get rid of switch within switch construct
>   * Removed bnxt_re_copy_to_udata as it is just a wrapper for ib_copy_to_udata
>   * Added maintainers information to MAINTAINERS file
> 
> v1-> v2:
>   * The license text in each file updated to reflect Dual license.
>   * Makefile and Kconfig changes are pushed to the last patch
>   * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
>   * Remove duplicate structure definitions from bnxt_re_hsi.h as
>     it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
>   * Removed some unused code reported during code review.
>   * Fixed few sparse warnings
> 
> 
> Selvin Xavier (21):
>   bnxt_re: Add bnxt_re RoCE driver files
>   bnxt_re: Introducing autogenerated Host Software Interface(hsi) file
>   bnxt_re: register with the NIC driver
>   bnxt_re: Enabling RoCE control path
>   bnxt_re: Adding Notification Queue support
>   bnxt_re: Support for PD, ucontext and mmap verbs
>   bnxt_re: Support for query and modify device verbs
>   bnxt_re: Adding support for port related verbs
>   bnxt_re: Support for GID related verbs
>   bnxt_re: Support for CQ verbs
>   bnxt_re: Support for AH verbs
>   bnxt_re: Support memory registration verbs
>   bnxt_re: Support QP verbs
>   bnxt_re: Support post_send verb
>   bnxt_re: Support post_recv
>   bnxt_re: Support poll_cq verb
>   bnxt_re: Handling dispatching of events to IB stack
>   bnxt_re: Support for DCB
>   bnxt_re: Set uverbs command mask
>   bnxt_re: Add QP event handling
>   bnxt_re: Add bnxt_re driver build support
> 
>  MAINTAINERS                                     |   11 +
>  drivers/infiniband/Kconfig                      |    2 +
>  drivers/infiniband/hw/Makefile                  |    1 +
>  drivers/infiniband/hw/bnxtre/Kconfig            |    9 +
>  drivers/infiniband/hw/bnxtre/Makefile           |    6 +
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c    | 2167 +++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h    |  441 ++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c  |  692 +++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.h  |  231 ++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c   |  825 ++++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_res.h   |  223 ++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c    |  838 ++++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.h    |  160 ++
>  drivers/infiniband/hw/bnxtre/bnxt_re.h          |  150 ++
>  drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h      | 2821 ++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 3206 +++++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h |  196 ++
>  drivers/infiniband/hw/bnxtre/bnxt_re_main.c     | 1340 ++++++++++
>  include/uapi/rdma/bnxtre-abi.h                  |   89 +
>  19 files changed, 13408 insertions(+)
>  create mode 100644 drivers/infiniband/hw/bnxtre/Kconfig
>  create mode 100644 drivers/infiniband/hw/bnxtre/Makefile
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_res.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_re.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
>  create mode 100644 drivers/infiniband/hw/bnxtre/bnxt_re_main.c
>  create mode 100644 include/uapi/rdma/bnxtre-abi.h
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [PATCH perf/core REBASE 3/5] tools lib bpf: Add bpf_prog_{attach,detach}
From: Arnaldo Carvalho de Melo @ 2016-12-20 14:32 UTC (permalink / raw)
  To: Joe Stringer; +Cc: linux-kernel, netdev, wangnan0, ast, daniel
In-Reply-To: <20161220141851.GB32756@kernel.org>

Em Tue, Dec 20, 2016 at 11:18:51AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Dec 14, 2016 at 02:43:40PM -0800, Joe Stringer escreveu:
> > Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching
> > eBPF programs to cgroups") added these functions to samples/libbpf, but
> > during this merge all of the samples libbpf functionality is shifting to
> > tools/lib/bpf. Shift these functions there.
> > 
> > Signed-off-by: Joe Stringer <joe@ovn.org>
> > ---
> > Arnaldo, this is a new patch you didn't previously review which I've
> > prepared due to the conflict with net-next. I figured it's better to try
> > to get samples/bpf properly switched over this window rather than defer the
> > problem and end up having to deal with another merge problem next time
> > around. I hope that is fine for you. If not, this patch onwards will need
> > to be dropped
> > 
> > It's a simple copy/paste/delete with a minor change for sys_bpf() vs
> > syscall().
> > ---
> >  samples/bpf/libbpf.c | 21 ---------------------
> >  samples/bpf/libbpf.h |  3 ---
> >  tools/lib/bpf/bpf.c  | 21 +++++++++++++++++++++
> >  tools/lib/bpf/bpf.h  |  3 +++
> >  4 files changed, 24 insertions(+), 24 deletions(-)
> > 
> > diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
> > index 3391225ad7e9..d9af876b4a2c 100644
> > --- a/samples/bpf/libbpf.c
> > +++ b/samples/bpf/libbpf.c
> > @@ -11,27 +11,6 @@
> >  #include <arpa/inet.h>
> >  #include "libbpf.h"
> >  
> > -int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
> > -{
> > -	union bpf_attr attr = {
> > -		.target_fd = target_fd,
> > -		.attach_bpf_fd = prog_fd,
> > -		.attach_type = type,
> > -	};
> > -
> > -	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
> 
> This one makes it fail for CentOS 5 and 6, others may fail as well,
> still building, investigating...

Ok, fixed it by making it follow the model of the other sys_bpf wrappers
setting up that bpf_attr union wrt initializing unamed struct members:

 int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
 {
-       union bpf_attr attr = {
-               .target_fd = target_fd,
-               .attach_bpf_fd = prog_fd,
-               .attach_type = type,
-       };
+       union bpf_attr attr;
+
+       bzero(&attr, sizeof(attr));
+       attr.target_fd     = target_fd;
+       attr.attach_bpf_fd = prog_fd;
+       attr.attach_type   = type;
 
        return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
 }
 
> 
>    4 10.853874028 centos:5: FAIL
> ...                         glibc: [ on  ]
> ...                          gtk2: [ OFF ]
> ...                      libaudit: [ on  ]
> ...                        libbfd: [ OFF ]
> ...                        libelf: [ on  ]
> ...                       libnuma: [ on  ]
> ...        numa_num_possible_cpus: [ OFF ]
> ...                       libperl: [ on  ]
> ...                     libpython: [ OFF ]
> ...                      libslang: [ on  ]
> ...                     libcrypto: [ on  ]
> ...                     libunwind: [ OFF ]
> ...            libdw-dwarf-unwind: [ OFF ]
> ...                          zlib: [ on  ]
> ...                          lzma: [ on  ]
> ...                     get_cpuid: [ OFF ]
> ...                           bpf: [ on  ]
> 
> Makefile.config:290: No libdw DWARF unwind found, Please install elfutils-devel/libdw-dev >= 0.158 and/or set LIBDW_DIR
> Makefile.config:294: No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev
> Makefile.config:363: DWARF support is off, BPF prologue is disabled
> Makefile.config:417: No libunwind found. Please install libunwind-dev[el] >= 1.1 and/or set LIBUNWIND_DIR
> Makefile.config:444: Disabling post unwind, no support found.
> Makefile.config:530: GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev
> Makefile.config:569: No timerfd support. Disables 'perf kvm stat live'
> Makefile.config:589: No 'python-config' tool was found: disables Python support - please install python-devel/python-dev
> Makefile.config:662: No bfd.h/libbfd found, please install binutils-dev[el]/zlib-static/libiberty-dev to gain symbol demangling
> Makefile.config:708: Old numa library found, disables 'perf bench numa mem' benchmark, please install numactl-devel/libnuma-devel/libnuma-dev >= 2.0.8
> Makefile.config:762: Your gcc lacks the __get_cpuid() builtin, disables support for auxtrace/Intel PT, please install a newer gcc
> Makefile.config:792: No openjdk development package found, please install JDK package
>   GEN      /tmp/build/perf/common-cmds.h
>   MKDIR    /tmp/build/perf/fd/
>   CC       /tmp/build/perf/fd/array.o
>   CC       /tmp/build/perf/event-parse.o
>   CC       /tmp/build/perf/exec-cmd.o
>   MKDIR    /tmp/build/perf/fd/
>   CC       /tmp/build/perf/help.o
>   LD       /tmp/build/perf/fd/libapi-in.o
>   MKDIR    /tmp/build/perf/fs/
>   CC       /tmp/build/perf/fs/fs.o
>   CC       /tmp/build/perf/event-plugin.o
>   CC       /tmp/build/perf/pager.o
>   MKDIR    /tmp/build/perf/fs/
>   CC       /tmp/build/perf/fs/tracing_path.o
>   CC       /tmp/build/perf/trace-seq.o
>   CC       /tmp/build/perf/parse-options.o
>   CC       /tmp/build/perf/parse-filter.o
>   MKDIR    /tmp/build/perf/fs/
>   LD       /tmp/build/perf/fs/libapi-in.o
>   CC       /tmp/build/perf/cpu.o
>   CC       /tmp/build/perf/debug.o
>   CC       /tmp/build/perf/str_error_r.o
>   CC       /tmp/build/perf/parse-utils.o
>   CC       /tmp/build/perf/kbuffer-parse.o
>   LD       /tmp/build/perf/libapi-in.o
>   AR       /tmp/build/perf/libapi.a
>   LD       /tmp/build/perf/libtraceevent-in.o
>   CC       /tmp/build/perf/libbpf.o
>   LINK     /tmp/build/perf/libtraceevent.a
>   MKDIR    /tmp/build/perf/pmu-events/
>   HOSTCC   /tmp/build/perf/pmu-events/json.o
>   CC       /tmp/build/perf/run-command.o
>   MKDIR    /tmp/build/perf/pmu-events/
>   HOSTCC   /tmp/build/perf/pmu-events/jsmn.o
>   MKDIR    /tmp/build/perf/pmu-events/
>   HOSTCC   /tmp/build/perf/pmu-events/jevents.o
>   CC       /tmp/build/perf/bpf.o
>   PERF_VERSION = 4.9.0
>   CC       /tmp/build/perf/sigchain.o
> bpf.c: In function 'bpf_prog_attach':
> bpf.c:174: error: unknown field 'target_fd' specified in initializer
> cc1: warnings being treated as errors
> bpf.c:174: warning: missing braces around initializer
> bpf.c:174: warning: (near initialization for 'attr.<anonymous>')
> bpf.c:175: error: unknown field 'attach_bpf_fd' specified in initializer
> bpf.c:175: warning: excess elements in union initializer
> bpf.c:175: warning: (near initialization for 'attr')
> bpf.c:176: error: unknown field 'attach_type' specified in initializer
> bpf.c:176: warning: excess elements in union initializer
> bpf.c:176: warning: (near initialization for 'attr')
> bpf.c: In function 'bpf_prog_detach':
> bpf.c:185: error: unknown field 'target_fd' specified in initializer
> bpf.c:185: warning: missing braces around initializer
> bpf.c:185: warning: (near initialization for 'attr.<anonymous>')
> bpf.c:186: error: unknown field 'attach_type' specified in initializer
> bpf.c:186: warning: excess elements in union initializer
> bpf.c:186: warning: (near initialization for 'attr')
> mv: cannot stat `/tmp/build/perf/.bpf.o.tmp': No such file or directory
> make[4]: *** [/tmp/build/perf/bpf.o] Error 1
> make[3]: *** [/tmp/build/perf/libbpf-in.o] Error 2
> make[2]: *** [/tmp/build/perf/libbpf.a] Error 2
> make[2]: *** Waiting for unfinished jobs....
>   CC       /tmp/build/perf/subcmd-config.o
>   MKDIR    /tmp/build/perf/pmu-events/
>   HOSTLD   /tmp/build/perf/pmu-events/jevents-in.o
>   LD       /tmp/build/perf/libsubcmd-in.o
>   AR       /tmp/build/perf/libsubcmd.a
> make[1]: *** [sub-make] Error 2
> make: *** [all] Error 2
> make: Leaving directory `/git/linux/tools/perf'
> 

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Niklas Cassel @ 2016-12-20 14:43 UTC (permalink / raw)
  To: Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, pavel, linux-kernel, netdev
In-Reply-To: <9a8911dfd2ed07391106e9cd0f90475742a798dc.1482238007.git.jpinto@synopsys.com>



On 12/20/2016 01:55 PM, Joao Pinto wrote:
> When the hardware is synthesized with multiple queues, all queues are
> disabled for default. This patch adds the rx queues configuration.
> This patch was successfully tested in a Synopsys QoS Reference design.
>
> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>  4 files changed, 38 insertions(+)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
> index b13a144..61bab50 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> @@ -454,6 +454,8 @@ struct stmmac_ops {
>  	void (*core_init)(struct mac_device_info *hw, int mtu);
>  	/* Enable and verify that the IPC module is supported */
>  	int (*rx_ipc)(struct mac_device_info *hw);
> +	/* Enable RX Queues */
> +	void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>  	/* Dump MAC registers */
>  	void (*dump_regs)(struct mac_device_info *hw);
>  	/* Handle extra events on specific interrupts hw dependent */
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> index 3e8d4fe..fd013bd 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> @@ -22,6 +22,7 @@
>  #define GMAC_HASH_TAB_32_63		0x00000014
>  #define GMAC_RX_FLOW_CTRL		0x00000090
>  #define GMAC_QX_TX_FLOW_CTRL(x)		(0x70 + x * 4)
> +#define GMAC_RXQ_CTRL0			0x000000a0
>  #define GMAC_INT_STATUS			0x000000b0
>  #define GMAC_INT_EN			0x000000b4
>  #define GMAC_PCS_BASE			0x000000e0
> @@ -44,6 +45,9 @@
>  
>  #define GMAC_MAX_PERFECT_ADDRESSES	128
>  
> +/* MAC RX Queue Enable*/
> +#define GMAC_RX_QUEUE_ENABLE(queue)	BIT(queue * 2)

Always have parentheses around a variable in a
macro, otherwise strange things could happen.
Imagine if you send 5 - 4 as argument,
it will then expand to 5 - 4 * 2 = -3,
instead of (5 - 4) * 2 = 2

> +
>  /* MAC Flow Control RX */
>  #define GMAC_RX_FLOW_CTRL_RFE		BIT(0)
>  
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> index eaed7cb..7ec1887 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
>  	writel(value, ioaddr + GMAC_INT_EN);
>  }
>  
> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
> +{
> +	void __iomem *ioaddr = hw->pcsr;
> +	u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
> +
> +	value |= GMAC_RX_QUEUE_ENABLE(queue);
> +
> +	writel(value, ioaddr + GMAC_RXQ_CTRL0);
> +}
> +
>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>  {
>  	void __iomem *ioaddr = hw->pcsr;
> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
>  static const struct stmmac_ops dwmac4_ops = {
>  	.core_init = dwmac4_core_init,
>  	.rx_ipc = dwmac4_rx_ipc_enable,
> +	.rx_queue_enable = dwmac4_rx_queue_enable,
>  	.dump_regs = dwmac4_dump_regs,
>  	.host_irq_status = dwmac4_irq_status,
>  	.flow_ctrl = dwmac4_flow_ctrl,
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 3e40578..e30034d 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
>  }
>  
>  /**
> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
> + *  @priv: driver private structure
> + *  Description: It is used for enabling the rx queues in the MAC
> + */
> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
> +{
> +	int rx_count = priv->dma_cap.number_rx_channel;

priv->dma_cap.number_rx_channel
actually contains the value from register
MAC_HW_Feature2, field RXCHCNT,
which is the number of DMA rx channels.

This is not the same as the number of MTL
receive queues, field RXQCNT in MAC_HW_Feature2.

I guess they will often have the same value,
but since there actually are two different fields
for them, I suppose that is not always the case.



Reading the comments in dwmac4_dma.*

#define DMA_CHANNEL_NB_MAX              1

"Only Channel 0 is actually configured and used"

"Following code only done for channel 0, other channels not yet supported"

Is there any point in actually enabling more than RX queue 0 if the
driver does not yet support more than one channel.
Can RXCHCNT ever be different from RXQCNT?
If so, when? Maybe when using an external DMA IP?


> +	int queue = 0;
> +
> +	/* If GMAC does not have multiqueues, then this is not necessary*/
> +	if (rx_count == 1)
> +		return;
> +
> +	for (queue = 0; queue < rx_count; queue++)
> +		priv->hw->mac->rx_queue_enable(priv->hw, queue);
> +}
> +
> +/**
>   *  stmmac_dma_operation_mode - HW DMA operation mode
>   *  @priv: driver private structure
>   *  Description: it is used for configuring the DMA operation mode register in
> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>  	/* Initialize the MAC Core */
>  	priv->hw->mac->core_init(priv->hw, dev->mtu);
>  
> +	/* Initialize MAC RX Queues */
> +	stmmac_mac_enable_rx_queues(priv);
> +
>  	ret = priv->hw->mac->rx_ipc(priv->hw);
>  	if (!ret) {
>  		netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Seraphin BONNAFFE @ 2016-12-20 14:52 UTC (permalink / raw)
  To: Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, niklas.cassel, pavel, linux-kernel, netdev
In-Reply-To: <9a8911dfd2ed07391106e9cd0f90475742a798dc.1482238007.git.jpinto@synopsys.com>

Hi Joao,

Please see my in-line comments.

Regards,
Séraphin
--
Seraphin BONNAFFE | Tel: +33244027061
STMicroelectronics | ADG | S/W Design

On 12/20/2016 01:55 PM, Joao Pinto wrote:
> When the hardware is synthesized with multiple queues, all queues are
> disabled for default. This patch adds the rx queues configuration.
> This patch was successfully tested in a Synopsys QoS Reference design.
>
> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>  4 files changed, 38 insertions(+)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
> index b13a144..61bab50 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> @@ -454,6 +454,8 @@ struct stmmac_ops {
>  	void (*core_init)(struct mac_device_info *hw, int mtu);
>  	/* Enable and verify that the IPC module is supported */
>  	int (*rx_ipc)(struct mac_device_info *hw);
> +	/* Enable RX Queues */
> +	void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>  	/* Dump MAC registers */
>  	void (*dump_regs)(struct mac_device_info *hw);
>  	/* Handle extra events on specific interrupts hw dependent */
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> index 3e8d4fe..fd013bd 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
> @@ -22,6 +22,7 @@
>  #define GMAC_HASH_TAB_32_63		0x00000014
>  #define GMAC_RX_FLOW_CTRL		0x00000090
>  #define GMAC_QX_TX_FLOW_CTRL(x)		(0x70 + x * 4)
> +#define GMAC_RXQ_CTRL0			0x000000a0
>  #define GMAC_INT_STATUS			0x000000b0
>  #define GMAC_INT_EN			0x000000b4
>  #define GMAC_PCS_BASE			0x000000e0
> @@ -44,6 +45,9 @@
>
>  #define GMAC_MAX_PERFECT_ADDRESSES	128
>
> +/* MAC RX Queue Enable*/
> +#define GMAC_RX_QUEUE_ENABLE(queue)	BIT(queue * 2)

According to the GMAC_RXQ0 register description from the dwc_ether_qos_databok, I guess this will enable the queues in AV only.
What if we would like to enable them in DCB (10)b ?


> +
>  /* MAC Flow Control RX */
>  #define GMAC_RX_FLOW_CTRL_RFE		BIT(0)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> index eaed7cb..7ec1887 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
>  	writel(value, ioaddr + GMAC_INT_EN);
>  }
>
> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
> +{
> +	void __iomem *ioaddr = hw->pcsr;
> +	u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
> +
> +	value |= GMAC_RX_QUEUE_ENABLE(queue);

What happen if for some reason the previous value of the register was 0xffff ?
The OR mask, by itself, won't clear the bits. AFAIU each queue is enabled with a 2-bit value, and (11)b don't have the same effect as (01)b, does it ?

I would suggest to clear the RXQ-enable bits before writing a new value.
Something like
   value &= GMAC_RX_QUEUE_CLEAR(queue)
   value |= GMAC_RX_QUEUE_ENABLE(queue);

What do you think about that ?

> +
> +	writel(value, ioaddr + GMAC_RXQ_CTRL0);
> +}
> +
>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>  {
>  	void __iomem *ioaddr = hw->pcsr;
> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
>  static const struct stmmac_ops dwmac4_ops = {
>  	.core_init = dwmac4_core_init,
>  	.rx_ipc = dwmac4_rx_ipc_enable,
> +	.rx_queue_enable = dwmac4_rx_queue_enable,
>  	.dump_regs = dwmac4_dump_regs,
>  	.host_irq_status = dwmac4_irq_status,
>  	.flow_ctrl = dwmac4_flow_ctrl,
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 3e40578..e30034d 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
>  }
>
>  /**
> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
> + *  @priv: driver private structure
> + *  Description: It is used for enabling the rx queues in the MAC
> + */
> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
> +{
> +	int rx_count = priv->dma_cap.number_rx_channel;
> +	int queue = 0;
> +
> +	/* If GMAC does not have multiqueues, then this is not necessary*/
> +	if (rx_count == 1)
> +		return;
> +
> +	for (queue = 0; queue < rx_count; queue++)
> +		priv->hw->mac->rx_queue_enable(priv->hw, queue);


Maybe it is worth checking if (priv->hw->mac->rx_queue_enable)
before actually calling this callback ?
I can see you have implemented it for dwmac4, but not for dwmac100 and dwmac100
In that case we may have a null pointer exception here.

> +}
> +
> +/**
>   *  stmmac_dma_operation_mode - HW DMA operation mode
>   *  @priv: driver private structure
>   *  Description: it is used for configuring the DMA operation mode register in
> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>  	/* Initialize the MAC Core */
>  	priv->hw->mac->core_init(priv->hw, dev->mtu);
>
> +	/* Initialize MAC RX Queues */
> +	stmmac_mac_enable_rx_queues(priv);
> +
>  	ret = priv->hw->mac->rx_ipc(priv->hw);
>  	if (!ret) {
>  		netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");
>

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Joao Pinto @ 2016-12-20 14:52 UTC (permalink / raw)
  To: Niklas Cassel, Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, pavel, linux-kernel, netdev
In-Reply-To: <f5cda4a7-8c41-86b7-b133-cf41eaec0370@axis.com>


Hi Niklas,

Às 2:43 PM de 12/20/2016, Niklas Cassel escreveu:
> 
> 
> On 12/20/2016 01:55 PM, Joao Pinto wrote:
>> When the hardware is synthesized with multiple queues, all queues are
>> disabled for default. This patch adds the rx queues configuration.
>> This patch was successfully tested in a Synopsys QoS Reference design.
>>
>> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
>> ---
>>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>>  4 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
>> index b13a144..61bab50 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>> @@ -454,6 +454,8 @@ struct stmmac_ops {
>>  	void (*core_init)(struct mac_device_info *hw, int mtu);
>>  	/* Enable and verify that the IPC module is supported */
>>  	int (*rx_ipc)(struct mac_device_info *hw);
>> +	/* Enable RX Queues */
>> +	void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>>  	/* Dump MAC registers */
>>  	void (*dump_regs)(struct mac_device_info *hw);
>>  	/* Handle extra events on specific interrupts hw dependent */
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> index 3e8d4fe..fd013bd 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> @@ -22,6 +22,7 @@
>>  #define GMAC_HASH_TAB_32_63		0x00000014
>>  #define GMAC_RX_FLOW_CTRL		0x00000090
>>  #define GMAC_QX_TX_FLOW_CTRL(x)		(0x70 + x * 4)
>> +#define GMAC_RXQ_CTRL0			0x000000a0
>>  #define GMAC_INT_STATUS			0x000000b0
>>  #define GMAC_INT_EN			0x000000b4
>>  #define GMAC_PCS_BASE			0x000000e0
>> @@ -44,6 +45,9 @@
>>  
>>  #define GMAC_MAX_PERFECT_ADDRESSES	128
>>  
>> +/* MAC RX Queue Enable*/
>> +#define GMAC_RX_QUEUE_ENABLE(queue)	BIT(queue * 2)
> 
> Always have parentheses around a variable in a
> macro, otherwise strange things could happen.
> Imagine if you send 5 - 4 as argument,
> it will then expand to 5 - 4 * 2 = -3,
> instead of (5 - 4) * 2 = 2

Right. I am going to do that.

> 
>> +
>>  /* MAC Flow Control RX */
>>  #define GMAC_RX_FLOW_CTRL_RFE		BIT(0)
>>  
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> index eaed7cb..7ec1887 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
>>  	writel(value, ioaddr + GMAC_INT_EN);
>>  }
>>  
>> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
>> +{
>> +	void __iomem *ioaddr = hw->pcsr;
>> +	u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
>> +
>> +	value |= GMAC_RX_QUEUE_ENABLE(queue);
>> +
>> +	writel(value, ioaddr + GMAC_RXQ_CTRL0);
>> +}
>> +
>>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>>  {
>>  	void __iomem *ioaddr = hw->pcsr;
>> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
>>  static const struct stmmac_ops dwmac4_ops = {
>>  	.core_init = dwmac4_core_init,
>>  	.rx_ipc = dwmac4_rx_ipc_enable,
>> +	.rx_queue_enable = dwmac4_rx_queue_enable,
>>  	.dump_regs = dwmac4_dump_regs,
>>  	.host_irq_status = dwmac4_irq_status,
>>  	.flow_ctrl = dwmac4_flow_ctrl,
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 3e40578..e30034d 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
>>  }
>>  
>>  /**
>> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
>> + *  @priv: driver private structure
>> + *  Description: It is used for enabling the rx queues in the MAC
>> + */
>> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
>> +{
>> +	int rx_count = priv->dma_cap.number_rx_channel;
> 
> priv->dma_cap.number_rx_channel
> actually contains the value from register
> MAC_HW_Feature2, field RXCHCNT,
> which is the number of DMA rx channels.
> 
> This is not the same as the number of MTL
> receive queues, field RXQCNT in MAC_HW_Feature2.
> 
> I guess they will often have the same value,
> but since there actually are two different fields
> for them, I suppose that is not always the case.

Yes, you typically have a match between channels and queues.
But I can use RXQCNT of course, I agree.

> 
> 
> 
> Reading the comments in dwmac4_dma.*
> 
> #define DMA_CHANNEL_NB_MAX              1
> 
> "Only Channel 0 is actually configured and used"
> 
> "Following code only done for channel 0, other channels not yet supported"
> 
> Is there any point in actually enabling more than RX queue 0 if the
> driver does not yet support more than one channel.
> Can RXCHCNT ever be different from RXQCNT?
> If so, when? Maybe when using an external DMA IP?

Yes, currently stmmac only supports 1 Channel. Bt it needs this feature if the
hardware is multi-channel. The hardware I have is multi-channel and so you have
to enable RX queue for it to work and that's why I made this fix.
In the future I will develope multi channel support for stmmac and this RX queue
enable will be already made.

> 
> 
>> +	int queue = 0;
>> +
>> +	/* If GMAC does not have multiqueues, then this is not necessary*/
>> +	if (rx_count == 1)
>> +		return;
>> +
>> +	for (queue = 0; queue < rx_count; queue++)
>> +		priv->hw->mac->rx_queue_enable(priv->hw, queue);
>> +}
>> +
>> +/**
>>   *  stmmac_dma_operation_mode - HW DMA operation mode
>>   *  @priv: driver private structure
>>   *  Description: it is used for configuring the DMA operation mode register in
>> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>>  	/* Initialize the MAC Core */
>>  	priv->hw->mac->core_init(priv->hw, dev->mtu);
>>  
>> +	/* Initialize MAC RX Queues */
>> +	stmmac_mac_enable_rx_queues(priv);
>> +
>>  	ret = priv->hw->mac->rx_ipc(priv->hw);
>>  	if (!ret) {
>>  		netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");
> 

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Joao Pinto @ 2016-12-20 14:59 UTC (permalink / raw)
  To: Seraphin BONNAFFE, Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, niklas.cassel, pavel, linux-kernel, netdev
In-Reply-To: <ba998f36-4ae2-10f3-2483-14b0bf351037@st.com>


Hi Séraphin,

Às 2:52 PM de 12/20/2016, Seraphin BONNAFFE escreveu:
> Hi Joao,
> 
> Please see my in-line comments.
> 
> Regards,
> Séraphin
> -- 
> Seraphin BONNAFFE | Tel: +33244027061
> STMicroelectronics | ADG | S/W Design
> 
> On 12/20/2016 01:55 PM, Joao Pinto wrote:
>> When the hardware is synthesized with multiple queues, all queues are
>> disabled for default. This patch adds the rx queues configuration.
>> This patch was successfully tested in a Synopsys QoS Reference design.
>>
>> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
>> ---
>>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>>  4 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
>> b/drivers/net/ethernet/stmicro/stmmac/common.h
>> index b13a144..61bab50 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>> @@ -454,6 +454,8 @@ struct stmmac_ops {
>>      void (*core_init)(struct mac_device_info *hw, int mtu);
>>      /* Enable and verify that the IPC module is supported */
>>      int (*rx_ipc)(struct mac_device_info *hw);
>> +    /* Enable RX Queues */
>> +    void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>>      /* Dump MAC registers */
>>      void (*dump_regs)(struct mac_device_info *hw);
>>      /* Handle extra events on specific interrupts hw dependent */
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> index 3e8d4fe..fd013bd 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>> @@ -22,6 +22,7 @@
>>  #define GMAC_HASH_TAB_32_63        0x00000014
>>  #define GMAC_RX_FLOW_CTRL        0x00000090
>>  #define GMAC_QX_TX_FLOW_CTRL(x)        (0x70 + x * 4)
>> +#define GMAC_RXQ_CTRL0            0x000000a0
>>  #define GMAC_INT_STATUS            0x000000b0
>>  #define GMAC_INT_EN            0x000000b4
>>  #define GMAC_PCS_BASE            0x000000e0
>> @@ -44,6 +45,9 @@
>>
>>  #define GMAC_MAX_PERFECT_ADDRESSES    128
>>
>> +/* MAC RX Queue Enable*/
>> +#define GMAC_RX_QUEUE_ENABLE(queue)    BIT(queue * 2)
> 
> According to the GMAC_RXQ0 register description from the dwc_ether_qos_databok,
> I guess this will enable the queues in AV only.
> What if we would like to enable them in DCB (10)b ?

Correct, I am enabling AV only. What you are saying makes perfect sense. What
about adding a variable to plat in order to configure the RX queue type?
Example:
plat->rx_queue_type = RX_QUEUE_DCB or plat->rx_queue_type = RX_QUEUE_AV


> 
> 
>> +
>>  /* MAC Flow Control RX */
>>  #define GMAC_RX_FLOW_CTRL_RFE        BIT(0)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> index eaed7cb..7ec1887 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw,
>> int mtu)
>>      writel(value, ioaddr + GMAC_INT_EN);
>>  }
>>
>> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
>> +{
>> +    void __iomem *ioaddr = hw->pcsr;
>> +    u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
>> +
>> +    value |= GMAC_RX_QUEUE_ENABLE(queue);
> 
> What happen if for some reason the previous value of the register was 0xffff ?
> The OR mask, by itself, won't clear the bits. AFAIU each queue is enabled with a
> 2-bit value, and (11)b don't have the same effect as (01)b, does it ?
> 
> I would suggest to clear the RXQ-enable bits before writing a new value.
> Something like
>   value &= GMAC_RX_QUEUE_CLEAR(queue)
>   value |= GMAC_RX_QUEUE_ENABLE(queue);
> 
> What do you think about that ?

According to the databook, these values are always reset, but we can force it to
clear as you suggest.

> 
>> +
>> +    writel(value, ioaddr + GMAC_RXQ_CTRL0);
>> +}
>> +
>>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>>  {
>>      void __iomem *ioaddr = hw->pcsr;
>> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct
>> stmmac_extra_stats *x)
>>  static const struct stmmac_ops dwmac4_ops = {
>>      .core_init = dwmac4_core_init,
>>      .rx_ipc = dwmac4_rx_ipc_enable,
>> +    .rx_queue_enable = dwmac4_rx_queue_enable,
>>      .dump_regs = dwmac4_dump_regs,
>>      .host_irq_status = dwmac4_irq_status,
>>      .flow_ctrl = dwmac4_flow_ctrl,
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 3e40578..e30034d 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv
>> *priv)
>>  }
>>
>>  /**
>> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
>> + *  @priv: driver private structure
>> + *  Description: It is used for enabling the rx queues in the MAC
>> + */
>> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
>> +{
>> +    int rx_count = priv->dma_cap.number_rx_channel;
>> +    int queue = 0;
>> +
>> +    /* If GMAC does not have multiqueues, then this is not necessary*/
>> +    if (rx_count == 1)
>> +        return;
>> +
>> +    for (queue = 0; queue < rx_count; queue++)
>> +        priv->hw->mac->rx_queue_enable(priv->hw, queue);
> 
> 
> Maybe it is worth checking if (priv->hw->mac->rx_queue_enable)
> before actually calling this callback ?
> I can see you have implemented it for dwmac4, but not for dwmac100 and dwmac100
> In that case we may have a null pointer exception here.

Yes, you are absolutely correct. Since I am using gmac4, I forgot that stmmac
can use other types.

> 
>> +}
>> +
>> +/**
>>   *  stmmac_dma_operation_mode - HW DMA operation mode
>>   *  @priv: driver private structure
>>   *  Description: it is used for configuring the DMA operation mode register in
>> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool
>> init_ptp)
>>      /* Initialize the MAC Core */
>>      priv->hw->mac->core_init(priv->hw, dev->mtu);
>>
>> +    /* Initialize MAC RX Queues */
>> +    stmmac_mac_enable_rx_queues(priv);
>> +
>>      ret = priv->hw->mac->rx_ipc(priv->hw);
>>      if (!ret) {
>>          netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");
>>

Thanks.

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Niklas Cassel @ 2016-12-20 15:05 UTC (permalink / raw)
  To: Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, pavel, linux-kernel, netdev
In-Reply-To: <020e1c77-504e-3cfc-e526-ce1645fd6b32@synopsys.com>



On 12/20/2016 03:52 PM, Joao Pinto wrote:
> Hi Niklas,
>
> Às 2:43 PM de 12/20/2016, Niklas Cassel escreveu:
>>
>> On 12/20/2016 01:55 PM, Joao Pinto wrote:
>>> When the hardware is synthesized with multiple queues, all queues are
>>> disabled for default. This patch adds the rx queues configuration.
>>> This patch was successfully tested in a Synopsys QoS Reference design.
>>>
>>> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
>>> ---
>>>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>>>  4 files changed, 38 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
>>> index b13a144..61bab50 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>>> @@ -454,6 +454,8 @@ struct stmmac_ops {
>>>  	void (*core_init)(struct mac_device_info *hw, int mtu);
>>>  	/* Enable and verify that the IPC module is supported */
>>>  	int (*rx_ipc)(struct mac_device_info *hw);
>>> +	/* Enable RX Queues */
>>> +	void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>>>  	/* Dump MAC registers */
>>>  	void (*dump_regs)(struct mac_device_info *hw);
>>>  	/* Handle extra events on specific interrupts hw dependent */
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>> index 3e8d4fe..fd013bd 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>> @@ -22,6 +22,7 @@
>>>  #define GMAC_HASH_TAB_32_63		0x00000014
>>>  #define GMAC_RX_FLOW_CTRL		0x00000090
>>>  #define GMAC_QX_TX_FLOW_CTRL(x)		(0x70 + x * 4)
>>> +#define GMAC_RXQ_CTRL0			0x000000a0
>>>  #define GMAC_INT_STATUS			0x000000b0
>>>  #define GMAC_INT_EN			0x000000b4
>>>  #define GMAC_PCS_BASE			0x000000e0
>>> @@ -44,6 +45,9 @@
>>>  
>>>  #define GMAC_MAX_PERFECT_ADDRESSES	128
>>>  
>>> +/* MAC RX Queue Enable*/
>>> +#define GMAC_RX_QUEUE_ENABLE(queue)	BIT(queue * 2)
>> Always have parentheses around a variable in a
>> macro, otherwise strange things could happen.
>> Imagine if you send 5 - 4 as argument,
>> it will then expand to 5 - 4 * 2 = -3,
>> instead of (5 - 4) * 2 = 2
> Right. I am going to do that.
>
>>> +
>>>  /* MAC Flow Control RX */
>>>  #define GMAC_RX_FLOW_CTRL_RFE		BIT(0)
>>>  
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>> index eaed7cb..7ec1887 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
>>>  	writel(value, ioaddr + GMAC_INT_EN);
>>>  }
>>>  
>>> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
>>> +{
>>> +	void __iomem *ioaddr = hw->pcsr;
>>> +	u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
>>> +
>>> +	value |= GMAC_RX_QUEUE_ENABLE(queue);
>>> +
>>> +	writel(value, ioaddr + GMAC_RXQ_CTRL0);
>>> +}
>>> +
>>>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>>>  {
>>>  	void __iomem *ioaddr = hw->pcsr;
>>> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
>>>  static const struct stmmac_ops dwmac4_ops = {
>>>  	.core_init = dwmac4_core_init,
>>>  	.rx_ipc = dwmac4_rx_ipc_enable,
>>> +	.rx_queue_enable = dwmac4_rx_queue_enable,
>>>  	.dump_regs = dwmac4_dump_regs,
>>>  	.host_irq_status = dwmac4_irq_status,
>>>  	.flow_ctrl = dwmac4_flow_ctrl,
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> index 3e40578..e30034d 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
>>>  }
>>>  
>>>  /**
>>> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
>>> + *  @priv: driver private structure
>>> + *  Description: It is used for enabling the rx queues in the MAC
>>> + */
>>> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
>>> +{
>>> +	int rx_count = priv->dma_cap.number_rx_channel;
>> priv->dma_cap.number_rx_channel
>> actually contains the value from register
>> MAC_HW_Feature2, field RXCHCNT,
>> which is the number of DMA rx channels.
>>
>> This is not the same as the number of MTL
>> receive queues, field RXQCNT in MAC_HW_Feature2.
>>
>> I guess they will often have the same value,
>> but since there actually are two different fields
>> for them, I suppose that is not always the case.
> Yes, you typically have a match between channels and queues.
> But I can use RXQCNT of course, I agree.
>
>>
>>
>> Reading the comments in dwmac4_dma.*
>>
>> #define DMA_CHANNEL_NB_MAX              1
>>
>> "Only Channel 0 is actually configured and used"
>>
>> "Following code only done for channel 0, other channels not yet supported"
>>
>> Is there any point in actually enabling more than RX queue 0 if the
>> driver does not yet support more than one channel.
>> Can RXCHCNT ever be different from RXQCNT?
>> If so, when? Maybe when using an external DMA IP?
> Yes, currently stmmac only supports 1 Channel. Bt it needs this feature if the
> hardware is multi-channel. The hardware I have is multi-channel and so you have
> to enable RX queue for it to work and that's why I made this fix.
> In the future I will develope multi channel support for stmmac and this RX queue
> enable will be already made.

I understand that for multi-queue hardware, RX queue 0 is default off,
but perhaps it is safer to only enable RX queue 0,
even if you have more than one RX queue.
(Only until you have implemented actual support for multi-queues
in the driver.)

But if you know that it's safe to enable all RX queues even if the
driver only uses RX queue 0, then perhaps it doesn't matter.


>
>>
>>> +	int queue = 0;
>>> +
>>> +	/* If GMAC does not have multiqueues, then this is not necessary*/
>>> +	if (rx_count == 1)
>>> +		return;
>>> +
>>> +	for (queue = 0; queue < rx_count; queue++)
>>> +		priv->hw->mac->rx_queue_enable(priv->hw, queue);
>>> +}
>>> +
>>> +/**
>>>   *  stmmac_dma_operation_mode - HW DMA operation mode
>>>   *  @priv: driver private structure
>>>   *  Description: it is used for configuring the DMA operation mode register in
>>> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>>>  	/* Initialize the MAC Core */
>>>  	priv->hw->mac->core_init(priv->hw, dev->mtu);
>>>  
>>> +	/* Initialize MAC RX Queues */
>>> +	stmmac_mac_enable_rx_queues(priv);
>>> +
>>>  	ret = priv->hw->mac->rx_ipc(priv->hw);
>>>  	if (!ret) {
>>>  		netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");

^ permalink raw reply

* Re: [PATCH] stmmac: enable rx queues
From: Joao Pinto @ 2016-12-20 15:09 UTC (permalink / raw)
  To: Niklas Cassel, Joao Pinto, peppe.cavallaro, davem
  Cc: hock.leong.kweh, pavel, linux-kernel, netdev
In-Reply-To: <a345494e-0e8b-9233-ed13-9dedc88328c2@axis.com>

Às 3:05 PM de 12/20/2016, Niklas Cassel escreveu:
> 
> 
> On 12/20/2016 03:52 PM, Joao Pinto wrote:
>> Hi Niklas,
>>
>> Às 2:43 PM de 12/20/2016, Niklas Cassel escreveu:
>>>
>>> On 12/20/2016 01:55 PM, Joao Pinto wrote:
>>>> When the hardware is synthesized with multiple queues, all queues are
>>>> disabled for default. This patch adds the rx queues configuration.
>>>> This patch was successfully tested in a Synopsys QoS Reference design.
>>>>
>>>> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
>>>> ---
>>>>  drivers/net/ethernet/stmicro/stmmac/common.h      |  2 ++
>>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |  4 ++++
>>>>  drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 11 +++++++++++
>>>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++++++++
>>>>  4 files changed, 38 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
>>>> index b13a144..61bab50 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>>>> @@ -454,6 +454,8 @@ struct stmmac_ops {
>>>>  	void (*core_init)(struct mac_device_info *hw, int mtu);
>>>>  	/* Enable and verify that the IPC module is supported */
>>>>  	int (*rx_ipc)(struct mac_device_info *hw);
>>>> +	/* Enable RX Queues */
>>>> +	void (*rx_queue_enable)(struct mac_device_info *hw, u32 queue);
>>>>  	/* Dump MAC registers */
>>>>  	void (*dump_regs)(struct mac_device_info *hw);
>>>>  	/* Handle extra events on specific interrupts hw dependent */
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> index 3e8d4fe..fd013bd 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
>>>> @@ -22,6 +22,7 @@
>>>>  #define GMAC_HASH_TAB_32_63		0x00000014
>>>>  #define GMAC_RX_FLOW_CTRL		0x00000090
>>>>  #define GMAC_QX_TX_FLOW_CTRL(x)		(0x70 + x * 4)
>>>> +#define GMAC_RXQ_CTRL0			0x000000a0
>>>>  #define GMAC_INT_STATUS			0x000000b0
>>>>  #define GMAC_INT_EN			0x000000b4
>>>>  #define GMAC_PCS_BASE			0x000000e0
>>>> @@ -44,6 +45,9 @@
>>>>  
>>>>  #define GMAC_MAX_PERFECT_ADDRESSES	128
>>>>  
>>>> +/* MAC RX Queue Enable*/
>>>> +#define GMAC_RX_QUEUE_ENABLE(queue)	BIT(queue * 2)
>>> Always have parentheses around a variable in a
>>> macro, otherwise strange things could happen.
>>> Imagine if you send 5 - 4 as argument,
>>> it will then expand to 5 - 4 * 2 = -3,
>>> instead of (5 - 4) * 2 = 2
>> Right. I am going to do that.
>>
>>>> +
>>>>  /* MAC Flow Control RX */
>>>>  #define GMAC_RX_FLOW_CTRL_RFE		BIT(0)
>>>>  
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>>> index eaed7cb..7ec1887 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
>>>> @@ -59,6 +59,16 @@ static void dwmac4_core_init(struct mac_device_info *hw, int mtu)
>>>>  	writel(value, ioaddr + GMAC_INT_EN);
>>>>  }
>>>>  
>>>> +static void dwmac4_rx_queue_enable(struct mac_device_info *hw, u32 queue)
>>>> +{
>>>> +	void __iomem *ioaddr = hw->pcsr;
>>>> +	u32 value = readl(ioaddr + GMAC_RXQ_CTRL0);
>>>> +
>>>> +	value |= GMAC_RX_QUEUE_ENABLE(queue);
>>>> +
>>>> +	writel(value, ioaddr + GMAC_RXQ_CTRL0);
>>>> +}
>>>> +
>>>>  static void dwmac4_dump_regs(struct mac_device_info *hw)
>>>>  {
>>>>  	void __iomem *ioaddr = hw->pcsr;
>>>> @@ -392,6 +402,7 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x)
>>>>  static const struct stmmac_ops dwmac4_ops = {
>>>>  	.core_init = dwmac4_core_init,
>>>>  	.rx_ipc = dwmac4_rx_ipc_enable,
>>>> +	.rx_queue_enable = dwmac4_rx_queue_enable,
>>>>  	.dump_regs = dwmac4_dump_regs,
>>>>  	.host_irq_status = dwmac4_irq_status,
>>>>  	.flow_ctrl = dwmac4_flow_ctrl,
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>>> index 3e40578..e30034d 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>>>> @@ -1271,6 +1271,24 @@ static void free_dma_desc_resources(struct stmmac_priv *priv)
>>>>  }
>>>>  
>>>>  /**
>>>> + *  stmmac_mac_enable_rx_queues - Enable MAC rx queues
>>>> + *  @priv: driver private structure
>>>> + *  Description: It is used for enabling the rx queues in the MAC
>>>> + */
>>>> +static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
>>>> +{
>>>> +	int rx_count = priv->dma_cap.number_rx_channel;
>>> priv->dma_cap.number_rx_channel
>>> actually contains the value from register
>>> MAC_HW_Feature2, field RXCHCNT,
>>> which is the number of DMA rx channels.
>>>
>>> This is not the same as the number of MTL
>>> receive queues, field RXQCNT in MAC_HW_Feature2.
>>>
>>> I guess they will often have the same value,
>>> but since there actually are two different fields
>>> for them, I suppose that is not always the case.
>> Yes, you typically have a match between channels and queues.
>> But I can use RXQCNT of course, I agree.
>>
>>>
>>>
>>> Reading the comments in dwmac4_dma.*
>>>
>>> #define DMA_CHANNEL_NB_MAX              1
>>>
>>> "Only Channel 0 is actually configured and used"
>>>
>>> "Following code only done for channel 0, other channels not yet supported"
>>>
>>> Is there any point in actually enabling more than RX queue 0 if the
>>> driver does not yet support more than one channel.
>>> Can RXCHCNT ever be different from RXQCNT?
>>> If so, when? Maybe when using an external DMA IP?
>> Yes, currently stmmac only supports 1 Channel. Bt it needs this feature if the
>> hardware is multi-channel. The hardware I have is multi-channel and so you have
>> to enable RX queue for it to work and that's why I made this fix.
>> In the future I will develope multi channel support for stmmac and this RX queue
>> enable will be already made.
> 
> I understand that for multi-queue hardware, RX queue 0 is default off,
> but perhaps it is safer to only enable RX queue 0,
> even if you have more than one RX queue.
> (Only until you have implemented actual support for multi-queues
> in the driver.)
> 
> But if you know that it's safe to enable all RX queues even if the
> driver only uses RX queue 0, then perhaps it doesn't matter.

I think it won't bring problems to enable all the available queues even if the
driver only uses queue 0. My QoS reference design has 4 RX queues and it works fine.

> 
> 
>>
>>>
>>>> +	int queue = 0;
>>>> +
>>>> +	/* If GMAC does not have multiqueues, then this is not necessary*/
>>>> +	if (rx_count == 1)
>>>> +		return;
>>>> +
>>>> +	for (queue = 0; queue < rx_count; queue++)
>>>> +		priv->hw->mac->rx_queue_enable(priv->hw, queue);
>>>> +}
>>>> +
>>>> +/**
>>>>   *  stmmac_dma_operation_mode - HW DMA operation mode
>>>>   *  @priv: driver private structure
>>>>   *  Description: it is used for configuring the DMA operation mode register in
>>>> @@ -1691,6 +1709,9 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>>>>  	/* Initialize the MAC Core */
>>>>  	priv->hw->mac->core_init(priv->hw, dev->mtu);
>>>>  
>>>> +	/* Initialize MAC RX Queues */
>>>> +	stmmac_mac_enable_rx_queues(priv);
>>>> +
>>>>  	ret = priv->hw->mac->rx_ipc(priv->hw);
>>>>  	if (!ret) {
>>>>  		netdev_warn(priv->dev, "RX IPC Checksum Offload disabled\n");
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox