[net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool
@ 2014-12-10  3:40 Alexander Duyck
  2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:40 UTC (permalink / raw)
  To: netdev; +Cc: ast, davem, eric.dumazet, brouer

This patch series implements a means of allocating page fragments without
the need for the local_irq_save/restore in __netdev_alloc_frag.  By doing
this I am able to decrease packet processing time by 11ns per packet in my
test environment.

---

Alexander Duyck (6):
      net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb
      ethernet/intel: Use napi_alloc_skb
      cxgb: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
      ethernet/realtek: use napi_alloc_skb instead of netdev_alloc_skb_ip_align
      ethernet/broadcom: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align


 drivers/net/ethernet/broadcom/b44.c             |    2 
 drivers/net/ethernet/broadcom/bcm63xx_enet.c    |    2 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    2 
 drivers/net/ethernet/chelsio/cxgb/sge.c         |   11 +
 drivers/net/ethernet/intel/e1000/e1000_main.c   |    2 
 drivers/net/ethernet/intel/e1000e/netdev.c      |    2 
 drivers/net/ethernet/intel/fm10k/fm10k_main.c   |    4 
 drivers/net/ethernet/intel/igb/igb_main.c       |    3 
 drivers/net/ethernet/intel/ixgb/ixgb_main.c     |    6 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |    4 
 drivers/net/ethernet/realtek/8139cp.c           |    2 
 drivers/net/ethernet/realtek/8139too.c          |    2 
 drivers/net/ethernet/realtek/r8169.c            |    2 
 include/linux/skbuff.h                          |   11 +
 net/core/dev.c                                  |    2 
 net/core/skbuff.c                               |  191 +++++++++++++++++------
 16 files changed, 178 insertions(+), 70 deletions(-)

--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
@ 2014-12-10  3:40 ` Alexander Duyck
  2014-12-10  4:16   ` Alexei Starovoitov
  2014-12-10 16:02   ` Eric Dumazet
  2014-12-10  3:40 ` [net-next PATCH 2/6] net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb Alexander Duyck
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:40 UTC (permalink / raw)
  To: netdev; +Cc: ast, davem, eric.dumazet, brouer

This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache.  By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context.  The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.

I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag.  First I updated the allocation to do either a 32K
allocation or an order 0 page.  This is based on the changes in commmit
d9b2938aa where it was found that latencies could be reduced in case of
failures.  Then I also rewrote the logic to work from the end of the page to
the start.  By doing this the size value doesn't have to be used unless we
have run out of space for page fragments.  Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set.  This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |  117 ++++++++++++++++++++++++++++++++----------------
 2 files changed, 79 insertions(+), 40 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ef64cec..b2b53b0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2164,6 +2164,8 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC);
 }
 
+void *napi_alloc_frag(unsigned int fragsz);
+
 /**
  * __dev_alloc_pages - allocate page for network Rx
  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7a338fb..56ed17c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -336,59 +336,85 @@ struct netdev_alloc_cache {
 	unsigned int		pagecnt_bias;
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct netdev_alloc_cache, napi_alloc_cache);
 
-static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
+				       gfp_t gfp_mask)
 {
-	struct netdev_alloc_cache *nc;
-	void *data = NULL;
-	int order;
-	unsigned long flags;
+	const unsigned int order = NETDEV_FRAG_PAGE_MAX_ORDER;
+	struct page *page = NULL;
+	gfp_t gfp = gfp_mask;
+
+	if (order) {
+		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
+		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+		nc->frag.size = PAGE_SIZE << (page ? order : 0);
+	}
 
-	local_irq_save(flags);
-	nc = this_cpu_ptr(&netdev_alloc_cache);
-	if (unlikely(!nc->frag.page)) {
+	if (unlikely(!page))
+		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+
+	nc->frag.page = page;
+
+	return page;
+}
+
+static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache,
+			       unsigned int fragsz, gfp_t gfp_mask)
+{
+	struct netdev_alloc_cache *nc = this_cpu_ptr(cache);
+	struct page *page = nc->frag.page;
+	unsigned int size;
+	int offset;
+
+	if (unlikely(!page)) {
 refill:
-		for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) {
-			gfp_t gfp = gfp_mask;
+		page = __page_frag_refill(nc, gfp_mask);
+		if (!page)
+			return NULL;
+
+		/* if size can vary use frag.size else just use PAGE_SIZE */
+		size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE;
 
-			if (order)
-				gfp |= __GFP_COMP | __GFP_NOWARN;
-			nc->frag.page = alloc_pages(gfp, order);
-			if (likely(nc->frag.page))
-				break;
-			if (--order < 0)
-				goto end;
-		}
-		nc->frag.size = PAGE_SIZE << order;
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		atomic_add(NETDEV_PAGECNT_MAX_BIAS - 1,
-			   &nc->frag.page->_count);
-		nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS;
-		nc->frag.offset = 0;
+		atomic_add(size - 1, &page->_count);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = size;
+		nc->frag.offset = size;
 	}
 
-	if (nc->frag.offset + fragsz > nc->frag.size) {
-		if (atomic_read(&nc->frag.page->_count) != nc->pagecnt_bias) {
-			if (!atomic_sub_and_test(nc->pagecnt_bias,
-						 &nc->frag.page->_count))
-				goto refill;
-			/* OK, page count is 0, we can safely set it */
-			atomic_set(&nc->frag.page->_count,
-				   NETDEV_PAGECNT_MAX_BIAS);
-		} else {
-			atomic_add(NETDEV_PAGECNT_MAX_BIAS - nc->pagecnt_bias,
-				   &nc->frag.page->_count);
-		}
-		nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS;
-		nc->frag.offset = 0;
+	offset = nc->frag.offset - fragsz;
+	if (unlikely(offset < 0)) {
+		if (!atomic_sub_and_test(nc->pagecnt_bias, &page->_count))
+			goto refill;
+
+		/* if size can vary use frag.size else just use PAGE_SIZE */
+		size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE;
+
+		/* OK, page count is 0, we can safely set it */
+		atomic_set(&page->_count, size);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = size;
+		offset = size - fragsz;
 	}
 
-	data = page_address(nc->frag.page) + nc->frag.offset;
-	nc->frag.offset += fragsz;
 	nc->pagecnt_bias--;
-end:
+	nc->frag.offset = offset;
+
+	return page_address(page) + offset;
+}
+
+static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+{
+	unsigned long flags;
+	void *data;
+
+	local_irq_save(flags);
+	data = __alloc_page_frag(&netdev_alloc_cache, fragsz, gfp_mask);
 	local_irq_restore(flags);
 	return data;
 }
@@ -406,6 +432,17 @@ void *netdev_alloc_frag(unsigned int fragsz)
 }
 EXPORT_SYMBOL(netdev_alloc_frag);
 
+static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+{
+	return __alloc_page_frag(&napi_alloc_cache, fragsz, gfp_mask);
+}
+
+void *napi_alloc_frag(unsigned int fragsz)
+{
+	return __napi_alloc_frag(fragsz, GFP_ATOMIC | __GFP_COLD);
+}
+EXPORT_SYMBOL(napi_alloc_frag);
+
 /**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next PATCH 2/6] net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
  2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
@ 2014-12-10  3:40 ` Alexander Duyck
  2014-12-10  3:40 ` [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb Alexander Duyck
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:40 UTC (permalink / raw)
  To: netdev; +Cc: ast, davem, eric.dumazet, brouer

This change pulls the core functionality out of __netdev_alloc_skb and
places them in a new function named __alloc_rx_skb.  The reason for doing
this is to make these bits accessible to a new function __napi_alloc_skb.
In addition __alloc_rx_skb now has a new flags value that is used to
determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
flag is set then the NAPI pool is used.  The advantage of this is that we
do not have to use local_irq_save/restore when accessing the NAPI pool from
NAPI context.

In my test setup I saw at least 11ns of savings using the napi_alloc_skb
function versus the netdev_alloc_skb function, most of this being due to
the fact that we didn't have to call local_irq_save/restore.

The main use case for napi_alloc_skb would be for things such as copybreak
or page fragment based receive paths where an skb is allocated after the
data has been received instead of before.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    9 ++++++
 net/core/dev.c         |    2 +
 net/core/skbuff.c      |   74 +++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b2b53b0..af79302 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -151,6 +151,7 @@ struct net_device;
 struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
+struct napi_struct;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -673,6 +674,7 @@ struct sk_buff {
 
 #define SKB_ALLOC_FCLONE	0x01
 #define SKB_ALLOC_RX		0x02
+#define SKB_ALLOC_NAPI		0x04
 
 /* Returns true if the skb was allocated from PFMEMALLOC reserves */
 static inline bool skb_pfmemalloc(const struct sk_buff *skb)
@@ -2165,6 +2167,13 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 }
 
 void *napi_alloc_frag(unsigned int fragsz);
+struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
+				 unsigned int length, gfp_t gfp_mask);
+static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
+					     unsigned int length)
+{
+	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
+}
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 3f191da..80f798d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4172,7 +4172,7 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
 	struct sk_buff *skb = napi->skb;
 
 	if (!skb) {
-		skb = netdev_alloc_skb_ip_align(napi->dev, GRO_MAX_HEAD);
+		skb = napi_alloc_skb(napi, GRO_MAX_HEAD);
 		napi->skb = skb;
 	}
 	return skb;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56ed17c..ae13ef6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -444,10 +444,13 @@ void *napi_alloc_frag(unsigned int fragsz)
 EXPORT_SYMBOL(napi_alloc_frag);
 
 /**
- *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
- *	@dev: network device to receive on
+ *	__alloc_rx_skb - allocate an skbuff for rx
  *	@length: length to allocate
  *	@gfp_mask: get_free_pages mask, passed to alloc_skb
+ *	@flags:	If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ *		allocations in case we have to fallback to __alloc_skb()
+ *		If SKB_ALLOC_NAPI is set, page fragment will be allocated
+ *		from napi_cache instead of netdev_cache.
  *
  *	Allocate a new &sk_buff and assign it a usage count of one. The
  *	buffer has unspecified headroom built in. Users should allocate
@@ -456,11 +459,11 @@ EXPORT_SYMBOL(napi_alloc_frag);
  *
  *	%NULL is returned if there is no free memory.
  */
-struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
-				   unsigned int length, gfp_t gfp_mask)
+static struct sk_buff *__alloc_rx_skb(unsigned int length, gfp_t gfp_mask,
+				      int flags)
 {
 	struct sk_buff *skb = NULL;
-	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
+	unsigned int fragsz = SKB_DATA_ALIGN(length) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	if (fragsz <= PAGE_SIZE && !(gfp_mask & (__GFP_WAIT | GFP_DMA))) {
@@ -469,7 +472,9 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		if (sk_memalloc_socks())
 			gfp_mask |= __GFP_MEMALLOC;
 
-		data = __netdev_alloc_frag(fragsz, gfp_mask);
+		data = (flags & SKB_ALLOC_NAPI) ?
+			__napi_alloc_frag(fragsz, gfp_mask) :
+			__netdev_alloc_frag(fragsz, gfp_mask);
 
 		if (likely(data)) {
 			skb = build_skb(data, fragsz);
@@ -477,17 +482,72 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 				put_page(virt_to_head_page(data));
 		}
 	} else {
-		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+		skb = __alloc_skb(length, gfp_mask,
 				  SKB_ALLOC_RX, NUMA_NO_NODE);
 	}
+	return skb;
+}
+
+/**
+ *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
+ *	@dev: network device to receive on
+ *	@length: length to allocate
+ *	@gfp_mask: get_free_pages mask, passed to alloc_skb
+ *
+ *	Allocate a new &sk_buff and assign it a usage count of one. The
+ *	buffer has NET_SKB_PAD headroom built in. Users should allocate
+ *	the headroom they think they need without accounting for the
+ *	built in space. The built in space is used for optimisations.
+ *
+ *	%NULL is returned if there is no free memory.
+ */
+struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+				   unsigned int length, gfp_t gfp_mask)
+{
+	struct sk_buff *skb;
+
+	length += NET_SKB_PAD;
+	skb = __alloc_rx_skb(length, gfp_mask, 0);
+
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
 	}
+
 	return skb;
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
+/**
+ *	__napi_alloc_skb - allocate skbuff for rx in a specific NAPI instance
+ *	@napi: napi instance this buffer was allocated for
+ *	@length: length to allocate
+ *	@gfp_mask: get_free_pages mask, passed to alloc_skb and alloc_pages
+ *
+ *	Allocate a new sk_buff for use in NAPI receive.  This buffer will
+ *	attempt to allocate the head from a special reserved region used
+ *	only for NAPI Rx allocation.  By doing this we can save several
+ *	CPU cycles by avoiding having to disable and re-enable IRQs.
+ *
+ *	%NULL is returned if there is no free memory.
+ */
+struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
+				 unsigned int length, gfp_t gfp_mask)
+{
+	struct sk_buff *skb;
+
+	length += NET_SKB_PAD + NET_IP_ALIGN;
+	skb = __alloc_rx_skb(length, gfp_mask, SKB_ALLOC_NAPI);
+
+	if (likely(skb)) {
+		skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
+		skb->dev = napi->dev;
+	}
+
+	return skb;
+}
+EXPORT_SYMBOL(__napi_alloc_skb);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		     int size, unsigned int truesize)
 {

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
  2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
  2014-12-10  3:40 ` [net-next PATCH 2/6] net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb Alexander Duyck
@ 2014-12-10  3:40 ` Alexander Duyck
  2014-12-11 21:43   ` Jeff Kirsher
  2014-12-10  3:41 ` [net-next PATCH 4/6] cxgb: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align Alexander Duyck
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:40 UTC (permalink / raw)
  To: netdev; +Cc: brouer, ast, Jeff Kirsher, eric.dumazet, davem

This change replaces calls to netdev_alloc_skb_ip_align with
napi_alloc_skb.  The advantage of napi_alloc_skb is currently the fact that
the page allocation doesn't make use of any irq disable calls.

There are few spots where I couldn't replace the calls as the buffer
allocation routine is called as a part of init which is outside of the
softirq context.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    2 +-
 drivers/net/ethernet/intel/e1000e/netdev.c    |    2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |    4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c     |    3 +--
 drivers/net/ethernet/intel/ixgb/ixgb_main.c   |    6 +++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
 6 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 862d198..83140cb 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4100,7 +4100,7 @@ static bool e1000_tbi_should_accept(struct e1000_adapter *adapter,
 static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
 					  unsigned int bufsz)
 {
-	struct sk_buff *skb = netdev_alloc_skb_ip_align(adapter->netdev, bufsz);
+	struct sk_buff *skb = napi_alloc_skb(&adapter->napi, bufsz);
 
 	if (unlikely(!skb))
 		adapter->alloc_rx_buff_failed++;
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 88936aa..5c82c80 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1016,7 +1016,7 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
 		 */
 		if (length < copybreak) {
 			struct sk_buff *new_skb =
-			    netdev_alloc_skb_ip_align(netdev, length);
+				napi_alloc_skb(&adapter->napi, length);
 			if (new_skb) {
 				skb_copy_to_linear_data_offset(new_skb,
 							       -NET_IP_ALIGN,
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index f0d9d6c..3418665 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -308,8 +308,8 @@ static struct sk_buff *fm10k_fetch_rx_buffer(struct fm10k_ring *rx_ring,
 #endif
 
 		/* allocate a skb to store the frags */
-		skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
-						FM10K_RX_HDR_LEN);
+		skb = napi_alloc_skb(&rx_ring->q_vector->napi,
+				     FM10K_RX_HDR_LEN);
 		if (unlikely(!skb)) {
 			rx_ring->rx_stats.alloc_failed++;
 			return NULL;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index bfd60577..675a330 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6641,8 +6641,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 #endif
 
 		/* allocate a skb to store the frags */
-		skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
-						IGB_RX_HDR_LEN);
+		skb = napi_alloc_skb(&rx_ring->q_vector->napi, IGB_RX_HDR_LEN);
 		if (unlikely(!skb)) {
 			rx_ring->rx_stats.alloc_failed++;
 			return NULL;
diff --git a/drivers/net/ethernet/intel/ixgb/ixgb_main.c b/drivers/net/ethernet/intel/ixgb/ixgb_main.c
index 055961b..aa87605 100644
--- a/drivers/net/ethernet/intel/ixgb/ixgb_main.c
+++ b/drivers/net/ethernet/intel/ixgb/ixgb_main.c
@@ -1963,7 +1963,7 @@ ixgb_rx_checksum(struct ixgb_adapter *adapter,
  * this should improve performance for small packets with large amounts
  * of reassembly being done in the stack
  */
-static void ixgb_check_copybreak(struct net_device *netdev,
+static void ixgb_check_copybreak(struct napi_struct *napi,
 				 struct ixgb_buffer *buffer_info,
 				 u32 length, struct sk_buff **skb)
 {
@@ -1972,7 +1972,7 @@ static void ixgb_check_copybreak(struct net_device *netdev,
 	if (length > copybreak)
 		return;
 
-	new_skb = netdev_alloc_skb_ip_align(netdev, length);
+	new_skb = napi_alloc_skb(napi, length);
 	if (!new_skb)
 		return;
 
@@ -2064,7 +2064,7 @@ ixgb_clean_rx_irq(struct ixgb_adapter *adapter, int *work_done, int work_to_do)
 			goto rxdesc_done;
 		}
 
-		ixgb_check_copybreak(netdev, buffer_info, length, &skb);
+		ixgb_check_copybreak(&adapter->napi, buffer_info, length, &skb);
 
 		/* Good Receive */
 		skb_put(skb, length);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 94feddf..d75d01b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1909,8 +1909,8 @@ static struct sk_buff *ixgbe_fetch_rx_buffer(struct ixgbe_ring *rx_ring,
 #endif
 
 		/* allocate a skb to store the frags */
-		skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
-						IXGBE_RX_HDR_SIZE);
+		skb = napi_alloc_skb(&rx_ring->q_vector->napi,
+				     IXGBE_RX_HDR_SIZE);
 		if (unlikely(!skb)) {
 			rx_ring->rx_stats.alloc_rx_buff_failed++;
 			return NULL;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next PATCH 4/6] cxgb: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
                   ` (2 preceding siblings ...)
  2014-12-10  3:40 ` [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb Alexander Duyck
@ 2014-12-10  3:41 ` Alexander Duyck
  2014-12-10  3:41 ` [net-next PATCH 5/6] ethernet/realtek: use " Alexander Duyck
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:41 UTC (permalink / raw)
  To: netdev; +Cc: ast, davem, eric.dumazet, brouer

In order to use napi_alloc_skb I needed to pass a pointer to struct adapter
instead of struct pci_dev.  This allowed me to access &adapter->napi.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 86222a1..babe2a9 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -1025,7 +1025,7 @@ MODULE_PARM_DESC(copybreak, "Receive copy threshold");
 
 /**
  *	get_packet - return the next ingress packet buffer
- *	@pdev: the PCI device that received the packet
+ *	@adapter: the adapter that received the packet
  *	@fl: the SGE free list holding the packet
  *	@len: the actual packet length, excluding any SGE padding
  *
@@ -1037,14 +1037,15 @@ MODULE_PARM_DESC(copybreak, "Receive copy threshold");
  *	threshold and the packet is too big to copy, or (b) the packet should
  *	be copied but there is no memory for the copy.
  */
-static inline struct sk_buff *get_packet(struct pci_dev *pdev,
+static inline struct sk_buff *get_packet(struct adapter *adapter,
 					 struct freelQ *fl, unsigned int len)
 {
-	struct sk_buff *skb;
 	const struct freelQ_ce *ce = &fl->centries[fl->cidx];
+	struct pci_dev *pdev = adapter->pdev;
+	struct sk_buff *skb;
 
 	if (len < copybreak) {
-		skb = netdev_alloc_skb_ip_align(NULL, len);
+		skb = napi_alloc_skb(&adapter->napi, len);
 		if (!skb)
 			goto use_orig_buf;
 
@@ -1357,7 +1358,7 @@ static void sge_rx(struct sge *sge, struct freelQ *fl, unsigned int len)
 	struct sge_port_stats *st;
 	struct net_device *dev;
 
-	skb = get_packet(adapter->pdev, fl, len - sge->rx_pkt_pad);
+	skb = get_packet(adapter, fl, len - sge->rx_pkt_pad);
 	if (unlikely(!skb)) {
 		sge->stats.rx_drops++;
 		return;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next PATCH 5/6] ethernet/realtek: use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
                   ` (3 preceding siblings ...)
  2014-12-10  3:41 ` [net-next PATCH 4/6] cxgb: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align Alexander Duyck
@ 2014-12-10  3:41 ` Alexander Duyck
  2014-12-10  3:41 ` [net-next PATCH 6/6] ethernet/broadcom: Use " Alexander Duyck
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:41 UTC (permalink / raw)
  To: netdev; +Cc: Realtek linux nic maintainers, ast, davem, eric.dumazet, brouer

This replaces most of the calls to netdev_alloc_skb_ip_align in the Realtek
drivers.  The one instance I didn't replace in 8139cp.c is because it was
called as a part of init and as such is not always accessed from the
softirq context.

Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/realtek/8139cp.c  |    2 +-
 drivers/net/ethernet/realtek/8139too.c |    2 +-
 drivers/net/ethernet/realtek/r8169.c   |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 75b1693..9c31e46 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -507,7 +507,7 @@ rx_status_loop:
 		netif_dbg(cp, rx_status, dev, "rx slot %d status 0x%x len %d\n",
 			  rx_tail, status, len);
 
-		new_skb = netdev_alloc_skb_ip_align(dev, buflen);
+		new_skb = napi_alloc_skb(napi, buflen);
 		if (!new_skb) {
 			dev->stats.rx_dropped++;
 			goto rx_next;
diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
index 63dc0f9..6d0b9df 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -2037,7 +2037,7 @@ keep_pkt:
 		/* Malloc up new buffer, compatible with net-2e. */
 		/* Omit the four octet CRC from the length. */
 
-		skb = netdev_alloc_skb_ip_align(dev, pkt_size);
+		skb = napi_alloc_skb(&tp->napi, pkt_size);
 		if (likely(skb)) {
 #if RX_BUF_IDX == 3
 			wrap_copy(skb, rx_ring, ring_offset+4, pkt_size);
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 1620184..54a3ec1 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7269,7 +7269,7 @@ static struct sk_buff *rtl8169_try_rx_copy(void *data,
 	data = rtl8169_align(data);
 	dma_sync_single_for_cpu(d, addr, pkt_size, DMA_FROM_DEVICE);
 	prefetch(data);
-	skb = netdev_alloc_skb_ip_align(tp->dev, pkt_size);
+	skb = napi_alloc_skb(&tp->napi, pkt_size);
 	if (skb)
 		memcpy(skb->data, data, pkt_size);
 	dma_sync_single_for_device(d, addr, pkt_size, DMA_FROM_DEVICE);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [net-next PATCH 6/6] ethernet/broadcom: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
                   ` (4 preceding siblings ...)
  2014-12-10  3:41 ` [net-next PATCH 5/6] ethernet/realtek: use " Alexander Duyck
@ 2014-12-10  3:41 ` Alexander Duyck
  2014-12-10  9:50   ` David Laight
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
  2014-12-10 18:32 ` [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool David Miller
  7 siblings, 1 reply; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10  3:41 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, eric.dumazet, Ariel Elior, brouer,
	Gary Zambrano, davem, ast

This patch replaces the calls to netdev_alloc_skb_ip_align in the
copybreak paths.

Cc: Gary Zambrano <zambrano@broadcom.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 drivers/net/ethernet/broadcom/b44.c             |    2 +-
 drivers/net/ethernet/broadcom/bcm63xx_enet.c    |    2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c
index ffeaf47..d86d6ba 100644
--- a/drivers/net/ethernet/broadcom/b44.c
+++ b/drivers/net/ethernet/broadcom/b44.c
@@ -836,7 +836,7 @@ static int b44_rx(struct b44 *bp, int budget)
 			struct sk_buff *copy_skb;
 
 			b44_recycle_rx(bp, cons, bp->rx_prod);
-			copy_skb = netdev_alloc_skb_ip_align(bp->dev, len);
+			copy_skb = napi_alloc_skb(&bp->napi, len);
 			if (copy_skb == NULL)
 				goto drop_it_no_recycle;
 
diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 3e8d1a8..21206d3 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -385,7 +385,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 		if (len < copybreak) {
 			struct sk_buff *nskb;
 
-			nskb = netdev_alloc_skb_ip_align(dev, len);
+			nskb = napi_alloc_skb(&priv->napi, len);
 			if (!nskb) {
 				/* forget packet, just rearm desc */
 				dev->stats.rx_dropped++;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index b4d71fd..1d1147c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1015,7 +1015,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
 		 */
 		if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
 		    (len <= RX_COPY_THRESH)) {
-			skb = netdev_alloc_skb_ip_align(bp->dev, len);
+			skb = napi_alloc_skb(&fp->napi, len);
 			if (skb == NULL) {
 				DP(NETIF_MSG_RX_ERR | NETIF_MSG_RX_STATUS,
 				   "ERROR  packet dropped because of alloc failure\n");

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
@ 2014-12-10  4:16   ` Alexei Starovoitov
  2014-12-10 15:21     ` Alexander Duyck
  2014-12-10 16:02   ` Eric Dumazet
  1 sibling, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2014-12-10  4:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Network Development, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer

On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
> This patch splits the netdev_alloc_frag function up so that it can be used
> on one of two page frag pools instead of being fixed on the
> netdev_alloc_cache.  By doing this we can add a NAPI specific function
> __napi_alloc_frag that accesses a pool that is only used from softirq
> context.  The advantage to this is that we do not need to call
> local_irq_save/restore which can be a significant savings.
>
> I also took the opportunity to refactor the core bits that were placed in
> __alloc_page_frag.  First I updated the allocation to do either a 32K
> allocation or an order 0 page.  This is based on the changes in commmit
> d9b2938aa where it was found that latencies could be reduced in case of

thanks for explaining that piece of it.

> +       struct page *page = NULL;
> +       gfp_t gfp = gfp_mask;
> +
> +       if (order) {
> +               gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
> +               page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
> +               nc->frag.size = PAGE_SIZE << (page ? order : 0);
> +       }
>
> -       local_irq_save(flags);
> -       nc = this_cpu_ptr(&netdev_alloc_cache);
> -       if (unlikely(!nc->frag.page)) {
> +       if (unlikely(!page))
> +               page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);

I'm guessing you're not combining this 'if' with above one to
keep gfp untouched, so there is a 'warn' when it actually fails 2nd time.
Tricky :)
Anyway looks good to me and I think I understand it enough to say:
Acked-by: Alexei Starovoitov <ast@plumgrid.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [net-next PATCH 6/6] ethernet/broadcom: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  3:41 ` [net-next PATCH 6/6] ethernet/broadcom: Use " Alexander Duyck
@ 2014-12-10  9:50   ` David Laight
  2014-12-10  9:52     ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2014-12-10  9:50 UTC (permalink / raw)
  To: 'Alexander Duyck', netdev@vger.kernel.org
  Cc: Florian Fainelli, eric.dumazet@gmail.com, Ariel Elior,
	brouer@redhat.com, Gary Zambrano, davem@davemloft.net,
	ast@plumgrid.com

From: Alexander Duyck
> This patch replaces the calls to netdev_alloc_skb_ip_align in the
> copybreak paths.

Why?

You still want the IP header to be aligned and you also want the
memcpy() to be copying aligned data.

I suspect this fails on both counts?

	David

> Cc: Gary Zambrano <zambrano@broadcom.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Ariel Elior <ariel.elior@qlogic.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> ---
>  drivers/net/ethernet/broadcom/b44.c             |    2 +-
>  drivers/net/ethernet/broadcom/bcm63xx_enet.c    |    2 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c
> index ffeaf47..d86d6ba 100644
> --- a/drivers/net/ethernet/broadcom/b44.c
> +++ b/drivers/net/ethernet/broadcom/b44.c
> @@ -836,7 +836,7 @@ static int b44_rx(struct b44 *bp, int budget)
>  			struct sk_buff *copy_skb;
> 
>  			b44_recycle_rx(bp, cons, bp->rx_prod);
> -			copy_skb = netdev_alloc_skb_ip_align(bp->dev, len);
> +			copy_skb = napi_alloc_skb(&bp->napi, len);
>  			if (copy_skb == NULL)
>  				goto drop_it_no_recycle;
> 
> diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> index 3e8d1a8..21206d3 100644
> --- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> +++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> @@ -385,7 +385,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
>  		if (len < copybreak) {
>  			struct sk_buff *nskb;
> 
> -			nskb = netdev_alloc_skb_ip_align(dev, len);
> +			nskb = napi_alloc_skb(&priv->napi, len);
>  			if (!nskb) {
>  				/* forget packet, just rearm desc */
>  				dev->stats.rx_dropped++;
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index b4d71fd..1d1147c 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -1015,7 +1015,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  		 */
>  		if ((bp->dev->mtu > ETH_MAX_PACKET_SIZE) &&
>  		    (len <= RX_COPY_THRESH)) {
> -			skb = netdev_alloc_skb_ip_align(bp->dev, len);
> +			skb = napi_alloc_skb(&fp->napi, len);
>  			if (skb == NULL) {
>  				DP(NETIF_MSG_RX_ERR | NETIF_MSG_RX_STATUS,
>  				   "ERROR  packet dropped because of alloc failure\n");
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [net-next PATCH 6/6] ethernet/broadcom: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  9:50   ` David Laight
@ 2014-12-10  9:52     ` David Laight
  2014-12-10 15:16       ` Alexander Duyck
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2014-12-10  9:52 UTC (permalink / raw)
  To: David Laight, 'Alexander Duyck', netdev@vger.kernel.org
  Cc: Florian Fainelli, eric.dumazet@gmail.com, Ariel Elior,
	brouer@redhat.com, Gary Zambrano, davem@davemloft.net,
	ast@plumgrid.com

From: David Laight
> From: Alexander Duyck
> > This patch replaces the calls to netdev_alloc_skb_ip_align in the
> > copybreak paths.
> 
> Why?
> 
> You still want the IP header to be aligned and you also want the
> memcpy() to be copying aligned data.
> 
> I suspect this fails on both counts?

Or am I confused by the naming?

	David

> 	David
> 
> > Cc: Gary Zambrano <zambrano@broadcom.com>
> > Cc: Florian Fainelli <f.fainelli@gmail.com>
> > Cc: Ariel Elior <ariel.elior@qlogic.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> > ---
> >  drivers/net/ethernet/broadcom/b44.c             |    2 +-
> >  drivers/net/ethernet/broadcom/bcm63xx_enet.c    |    2 +-
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    2 +-
> >  3 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c
> > index ffeaf47..d86d6ba 100644
> > --- a/drivers/net/ethernet/broadcom/b44.c
> > +++ b/drivers/net/ethernet/broadcom/b44.c
> > @@ -836,7 +836,7 @@ static int b44_rx(struct b44 *bp, int budget)
> >  			struct sk_buff *copy_skb;
> >
> >  			b44_recycle_rx(bp, cons, bp->rx_prod);
> > -			copy_skb = netdev_alloc_skb_ip_align(bp->dev, len);
> > +			copy_skb = napi_alloc_skb(&bp->napi, len);
> >  			if (copy_skb == NULL)
> >  				goto drop_it_no_recycle;
...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
                   ` (5 preceding siblings ...)
  2014-12-10  3:41 ` [net-next PATCH 6/6] ethernet/broadcom: Use " Alexander Duyck
@ 2014-12-10 14:15 ` Jesper Dangaard Brouer
  2014-12-10 14:15   ` [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue Jesper Dangaard Brouer
                     ` (5 more replies)
  2014-12-10 18:32 ` [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool David Miller
  7 siblings, 6 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt

The network stack have some use-cases that puts some extreme demands
on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
packet size[1], requires handling a packet every 67.2 ns (nanosec).

Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
elements), show "fast-path" instant reuse only costs 19 ns, but a
closer to network usage pattern show the cost rise to 45 ns.

This patchset introduce a quick mempool (qmempool), which when used
in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
drop in iptables "raw" table, but more importantly saves 40 ns with
IP-forwarding, which were hitting the slower SLUB use-case.


One of the building blocks for achieving this speedup is a cmpxchg
based Lock-Free queue that supports bulking, named alf_queue for
Array-based Lock-Free queue.  By bulking elements (pointers) from the
queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
elements.

 Patch1: alf_queue (Lock-Free queue)

 Patch2: qmempool using alf_queue

 Patch3: usage of qmempool for SKB caching


Notice, this patchset depend on introduction of napi_alloc_skb(),
which is part of Alexander Duyck's work patchset [3].

Different correctness tests and micro benchmarks are avail via my
github repo "prototype-kernel"[4], where the alf_queue and qmempool is
also kept in sync with this patchset.

Links:
 [1]: http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
 [2]: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c
 [3]: http://thread.gmane.org/gmane.linux.network/342347
 [4]: https://github.com/netoptimizer/prototype-kernel

---

Jesper Dangaard Brouer (3):
      net: use qmempool in-front of sk_buff kmem_cache
      mm: qmempool - quick queue based memory pool
      lib: adding an Array-based Lock-Free (ALF) queue


 include/linux/alf_queue.h |  303 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/qmempool.h  |  205 +++++++++++++++++++++++++++++
 include/linux/skbuff.h    |    4 -
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 mm/Kconfig                |   12 ++
 mm/Makefile               |    1 
 mm/qmempool.c             |  322 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |    5 +
 net/core/skbuff.c         |   43 +++++-
 11 files changed, 950 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 include/linux/qmempool.h
 create mode 100644 lib/alf_queue.c
 create mode 100644 mm/qmempool.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
@ 2014-12-10 14:15   ` Jesper Dangaard Brouer
  2014-12-11 19:15     ` David Miller
  2014-12-10 14:15   ` [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool Jesper Dangaard Brouer
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking.  The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
 MPMC = Multi-Producer-Multi-Consumer
 SPSC = Single-Producer-Single-Consumer

(none-bulking):  per element cost MPMC and SPSC
                   MPMC     -- SPSC
 simple         :  9.519 ns -- 1.282 ns
 multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant.  Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
         MPMC     -- SPSC
 bulk2 : 5.849 ns -- 1.748 ns
 bulk3 : 4.102 ns -- 1.531 ns
 bulk4 : 3.281 ns -- 1.383 ns
 bulk6 : 2.530 ns -- 1.238 ns
 bulk8 : 2.125 ns -- 1.196 ns
 bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure.  Now no detection happens.

 include/linux/alf_queue.h |  303 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 4 files changed, 365 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c

diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ *  - Array based for cache-line optimization
+ *  - Bounded by the array size
+ *  - FIFO Producer/Consumer queue, no queue traversal supported
+ *  - Very fast
+ *  - Designed as a queue for pointers to objects
+ *  - Bulk enqueue and dequeue support
+ *  - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+	u32 head;
+	u32 tail;
+};
+
+struct alf_queue {
+	u32 size;
+	u32 mask;
+	u32 flags;
+	struct alf_actor producer ____cacheline_aligned_in_smp;
+	struct alf_actor consumer ____cacheline_aligned_in_smp;
+	void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void		  alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ *  1. They can be reused for both "Single" and "Multi" variants
+ *  2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+			   void **ptr, const u32 n)
+{
+	int i, index = p_head;
+
+	for (i = 0; i < n; i++, index++)
+		q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+			  void **ptr, const u32 elems)
+{
+	int i, index = c_head;
+
+	for (i = 0; i < elems; i++, index++)
+		ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size.  Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful.  This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	do {
+		p_head = ACCESS_ONCE(q->producer.head);
+		c_tail = ACCESS_ONCE(q->consumer.tail);
+
+		space = q->size + c_tail - p_head;
+		if (n > space)
+			return 0;
+
+		p_next = p_head + n;
+	}
+	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Wait for other concurrent preceding enqueues not yet done,
+	 * this part make us none-wait-free and could be problematic
+	 * in case of congestion with many CPUs
+	 */
+	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+		cpu_relax();
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	do {
+		c_head = ACCESS_ONCE(q->consumer.head);
+		p_tail = ACCESS_ONCE(q->producer.tail);
+
+		elems = p_tail - c_head;
+
+		if (elems == 0)
+			return 0;
+		else
+			elems = min(elems, n);
+
+		c_next = c_head + elems;
+	}
+	while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+	/* LOAD the elems from the queue array.
+	 *   We don't need a smb_rmb() Read-Memory-Barrier here because
+	 *   the above cmpxchg is an implied full Memory-Barrier.
+	 */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough. Paired with enqueue
+	 * implicit full-MB in cmpxchg.
+	 */
+	smp_wmb();
+
+	/* Wait for other concurrent preceding dequeues not yet done */
+	while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+		cpu_relax();
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x)							\
+	do {								\
+		if (unlikely(!(x))) {					\
+			pr_crit("Assertion failed %s:%d: \"%s\"\n",	\
+				__FILE__, __LINE__, #x);		\
+			BUG();						\
+		}							\
+	} while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	p_head = q->producer.head;
+	smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+	c_tail = ACCESS_ONCE(q->consumer.tail);
+
+	space = q->size + c_tail - p_head;
+	if (n > space)
+		return 0;
+
+	p_next = p_head + n;
+	ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+	q->producer.head = p_next;
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	c_head = q->consumer.head;
+	p_tail = ACCESS_ONCE(q->producer.tail);
+
+	elems = p_tail - c_head;
+
+	if (elems == 0)
+		return 0;
+	else
+		elems = min(elems, n);
+
+	c_next = c_head + elems;
+	ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+	q->consumer.head = c_next;
+
+	smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough.
+	 */
+	smp_wmb();
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+	/* The empty (and initial state) is when consumer have reached
+	 * up with producer.
+	 *
+	 * DOUBLE-CHECK: Should we use producer.head, as this indicate
+	 * a producer is in-progress(?)
+	 */
+	return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+	u32 c_head = ACCESS_ONCE(q->consumer.head);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+	u32 elems;
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	elems = p_tail - c_head;
+	/* Thus, same as:
+	 *  elems = (p_tail - c_head) & q->mask;
+	 */
+	return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+	u32 p_head = ACCESS_ONCE(q->producer.head);
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 space;
+
+	/* The max avail space is q->size and
+	 * the empty state is when (consumer == producer)
+	 */
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	space = q->size + c_tail - p_head;
+	/* Thus, same as:
+	 *  space = (q->size + c_tail - p_head) & q->mask;
+	 */
+	return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
 	bool
 
 #
+# ALF queue
+#
+config ALF_QUEUE
+	bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+	default y
+	help
+	  This Array-based Lock-Free (ALF) queue, is a very fast
+	  bounded Producer-Consumer queue, supporting bulking.  The
+	  MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+	  cmpxchg, but the cost can be amorized by utilizing bulk
+	  enqueue/dequeue.
+
+#
 # Generic 64-bit atomic support is selected if needed
 #
 config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
 
 obj-$(CONFIG_NLATTR) += nlattr.o
 
+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
 obj-$(CONFIG_LRU_CACHE) += lru_cache.o
 
 obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ *  - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+	struct alf_queue *q;
+	size_t mem_size;
+
+	if (!(is_power_of_2(size)) || size > 65536)
+		return ERR_PTR(-EINVAL);
+
+	/* The ring array is allocated together with the queue struct */
+	mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+	q = kzalloc(mem_size, gfp);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	q->size = size;
+	q->mask = size - 1;
+
+	return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+	kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
  2014-12-10 14:15   ` [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue Jesper Dangaard Brouer
@ 2014-12-10 14:15   ` Jesper Dangaard Brouer
  2014-12-10 14:15   ` [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache Jesper Dangaard Brouer
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt

A quick queue-based memory pool, that functions as a cache in-front
of kmem_cache SLAB/SLUB allocators.  Which allows faster than
SLAB/SLUB reuse/caching of fixed size memory elements

The speed gain comes from, the shared storage, using a Lock-Free
queue that supports bulk refilling elements (to a percpu cache)
with a single cmpxchg.  Thus, the (lock-prefixed) cmpxchg cost is
amortize over the bulk size.

Qmempool cannot easily replace all kmem_cache usage, because it is
restricted in which contexts is can be used in, as the Lock-Free
queue is not preemption safe. E.g. only supports GFP_ATOMIC allocations
from SLAB.

This version is optimized for usage from softirq context, and cannot
be used from hardirq context.  Usage from none-softirq requires usage
of local_bh_{disable,enable}, which have a fairly high cost.

Performance micro benchmarks against SLUB. First test is fast-path
reuse of same element. Second test is allocating 256 element before
freeing elements again, this pattern comes from how NIC ring queue
cleanups often run.

On CPU E5-2695, CONFIG_PREEMPT=y, showing cost of alloc+free:

                 SLUB      - softirq   - none-softirq
 fastpath-reuse: 19.563 ns -  7.837 ns - 18.536 ns
 N(256)-pattern: 45.039 ns - 11.782 ns - 24.186 ns

A significant win for usage from softirq, and a smaller win for
none-softirq which requires taking local_bh_{disable,enable}.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/linux/qmempool.h |  205 +++++++++++++++++++++++++++++
 mm/Kconfig               |   12 ++
 mm/Makefile              |    1 
 mm/qmempool.c            |  322 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 540 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/qmempool.h
 create mode 100644 mm/qmempool.c

diff --git a/include/linux/qmempool.h b/include/linux/qmempool.h
new file mode 100644
index 0000000..922ed27
--- /dev/null
+++ b/include/linux/qmempool.h
@@ -0,0 +1,205 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * A quick queue-based memory pool, that functions as a cache in-front
+ * of kmem_cache SLAB/SLUB allocators.  Which allows faster than
+ * SLAB/SLUB reuse/caching of fixed size memory elements
+ *
+ * The speed gain comes from, the shared storage, using a Lock-Free
+ * queue that supports bulk refilling elements (to a percpu cache)
+ * with a single cmpxchg.  Thus, the lock-prefixed cmpxchg cost is
+ * amortize over the bulk size.
+ *
+ * The Lock-Free queue is based on an array (of pointer to elements).
+ * This make access more cache optimal, as e.g. on 64bit 8 pointers
+ * can be stored per cache-line (which is superior to a linked list
+ * approach).  Only storing the pointers to elements, is also
+ * beneficial as we don't touch the elements data.
+ *
+ * Qmempool cannot easily replace all kmem_cache usage, because it is
+ * restricted in which contexts is can be used in, as the Lock-Free
+ * queue is not preemption safe.  This version is optimized for usage
+ * from softirq context, and cannot be used from hardirq context.
+ *
+ * Only support GFP_ATOMIC allocations from SLAB.
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ *  for licensing details see kernel-base/COPYING
+ */
+
+#ifndef _LINUX_QMEMPOOL_H
+#define _LINUX_QMEMPOOL_H
+
+#include <linux/alf_queue.h>
+#include <linux/prefetch.h>
+#include <linux/hardirq.h>
+
+/* Bulking is an essential part of the performance gains as this
+ * amortize the cost of cmpxchg ops used when accessing sharedq
+ */
+#define QMEMPOOL_BULK 16
+#define QMEMPOOL_REFILL_MULTIPLIER 2
+
+struct qmempool_percpu {
+	struct alf_queue *localq;
+};
+
+struct qmempool {
+	/* The shared queue (sharedq) is a Multi-Producer-Multi-Consumer
+	 *  queue where access is protected by an atomic cmpxchg operation.
+	 *  The queue support bulk transfers, which amortize the cost
+	 *  of the atomic cmpxchg operation.
+	 */
+	struct alf_queue	*sharedq;
+
+	/* Per CPU local "cache" queues for faster atomic free access.
+	 * The local queues (localq) are Single-Producer-Single-Consumer
+	 * queues as they are per CPU.
+	 */
+	struct qmempool_percpu __percpu *percpu;
+
+	/* Backed by some SLAB kmem_cache */
+	struct kmem_cache	*kmem;
+
+	/* Setup */
+	uint32_t prealloc;
+	gfp_t gfp_mask;
+};
+
+extern void qmempool_destroy(struct qmempool *pool);
+extern struct qmempool *qmempool_create(
+	uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+	struct kmem_cache *kmem, gfp_t gfp_mask);
+
+extern void *__qmempool_alloc_from_sharedq(
+	struct qmempool *pool, gfp_t gfp_mask, struct alf_queue *localq);
+extern void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+				       struct alf_queue *localq);
+
+/* The percpu variables (SPSC queues) needs preempt protection, and
+ * the shared MPMC queue also needs protection against the same CPU
+ * access the same queue.
+ *
+ * Specialize and optimize the qmempool to run from softirq.
+ * Don't allow qmempool to be used from interrupt context.
+ *
+ * IDEA: When used from softirq, take advantage of the protection
+ * softirq gives.  A softirq will never preempt another softirq,
+ * running on the same CPU.  The only event that can preempt a softirq
+ * is an interrupt handler (and perhaps we don't need to support
+ * calling qmempool from an interrupt).  Another softirq, even the
+ * same one, can run on another CPU however, but these helpers are
+ * only protecting our percpu variables.
+ *
+ * Thus, our percpu variables are safe if current the CPU is the one
+ * serving the softirq (tested via in_serving_softirq()), like:
+ *
+ *  if (!in_serving_softirq())
+ *		local_bh_disable();
+ *
+ * This makes qmempool very fast, when accesses from softirq, but
+ * slower when accessed outside softirq.  The other contexts need to
+ * disable bottom-halves "bh" via local_bh_{disable,enable} (which on
+ * have been measured add cost if 7.5ns on CPU E5-2695).
+ *
+ * MUST not be used from interrupt context, when relying on softirq usage.
+ */
+static inline int __qmempool_preempt_disable(void)
+{
+	int in_serving_softirq = in_serving_softirq();
+
+	if (!in_serving_softirq)
+		local_bh_disable();
+
+	return in_serving_softirq;
+}
+
+static inline void __qmempool_preempt_enable(int in_serving_softirq)
+{
+	if (!in_serving_softirq)
+		local_bh_enable();
+}
+
+/* Elements - alloc and free functions are inlined here for
+ * performance reasons, as the per CPU lockless access should be as
+ * fast as possible.
+ */
+
+/* Main allocation function
+ *
+ * Caller must make sure this is called from a preemptive safe context
+ */
+static inline void * main_qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	/* NUMA considerations, for now the numa node is not handles,
+	 * this could be handled via e.g. numa_mem_id()
+	 */
+	void *elem;
+	struct qmempool_percpu *cpu;
+	int num;
+
+	/* 1. attempt get element from local per CPU queue */
+	cpu = this_cpu_ptr(pool->percpu);
+	num = alf_sc_dequeue(cpu->localq, (void **)&elem, 1);
+	if (num == 1) /* Succes: alloc elem by deq from localq cpu cache */
+		return elem;
+
+	/* 2. attempt get element from shared queue.  This involves
+	 * refilling the localq for next round. Side-effect can be
+	 * alloc from SLAB.
+	 */
+	elem = __qmempool_alloc_from_sharedq(pool, gfp_mask, cpu->localq);
+	return elem;
+}
+
+static inline void *__qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	void *elem;
+	int state;
+
+	state = __qmempool_preempt_disable();
+	elem  = main_qmempool_alloc(pool, gfp_mask);
+	__qmempool_preempt_enable(state);
+	return elem;
+}
+
+static inline void *__qmempool_alloc_softirq(struct qmempool *pool,
+					     gfp_t gfp_mask)
+{
+	return main_qmempool_alloc(pool, gfp_mask);
+}
+
+/* Main free function */
+static inline void __qmempool_free(struct qmempool *pool, void *elem)
+{
+	struct qmempool_percpu *cpu;
+	int num;
+	int state;
+
+	/* NUMA considerations, how do we make sure to avoid caching
+	 * elements from a different NUMA node.
+	 */
+	state = __qmempool_preempt_disable();
+
+	/* 1. attempt to free/return element to local per CPU queue */
+	cpu = this_cpu_ptr(pool->percpu);
+	num = alf_sp_enqueue(cpu->localq, &elem, 1);
+	if (num == 1) /* success: element free'ed by enqueue to localq */
+		goto done;
+
+	/* 2. localq cannot store more elements, need to return some
+	 * from localq to sharedq, to make room. Side-effect can be
+	 * free to SLAB.
+	 */
+	__qmempool_free_to_sharedq(elem, pool, cpu->localq);
+
+done:
+	__qmempool_preempt_enable(state);
+}
+
+/* API users can choose to use "__" prefixed versions for inlining */
+extern void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask);
+extern void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask);
+extern void qmempool_free(struct qmempool *pool, void *elem);
+
+#endif /* _LINUX_QMEMPOOL_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..abaa94c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,15 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config QMEMPOOL
+	bool "Quick queue based mempool (qmempool)"
+	default y
+	select ALF_QUEUE
+	help
+	  A mempool designed for faster than SLAB/kmem_cache
+	  reuse/caching of fixed size memory elements.  Works as a
+	  caching layer in-front of existing kmem_cache SLABs.  Speed
+	  is achieved by _bulk_ refilling percpu local cache, from a
+	  Lock-Free queue requiring a single (locked) cmpxchg per bulk
+	  transfer, thus amortizing the cost of the cmpxchg.
diff --git a/mm/Makefile b/mm/Makefile
index 8405eb0..49c1e18 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -69,3 +69,4 @@ obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_QMEMPOOL) += qmempool.o
diff --git a/mm/qmempool.c b/mm/qmempool.c
new file mode 100644
index 0000000..d6debcc
--- /dev/null
+++ b/mm/qmempool.c
@@ -0,0 +1,322 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/percpu.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>
+
+/* Due to hotplug CPU support, we need access to all qmempools
+ * in-order to cleanup elements in localq for the CPU going offline.
+ *
+ * TODO: implement HOTPLUG_CPU
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(qmempool_list);
+static DEFINE_SPINLOCK(qmempool_list_lock);
+#endif
+ */
+
+void qmempool_destroy(struct qmempool *pool)
+{
+	void *elem = NULL;
+	int j;
+
+	if (pool->percpu) {
+		for_each_possible_cpu(j) {
+			struct qmempool_percpu *cpu =
+				per_cpu_ptr(pool->percpu, j);
+
+			while (alf_mc_dequeue(cpu->localq, &elem, 1) == 1)
+				kmem_cache_free(pool->kmem, elem);
+			BUG_ON(!alf_queue_empty(cpu->localq));
+			alf_queue_free(cpu->localq);
+		}
+		free_percpu(pool->percpu);
+	}
+
+	if (pool->sharedq) {
+		while (alf_mc_dequeue(pool->sharedq, &elem, 1) == 1)
+			kmem_cache_free(pool->kmem, elem);
+		BUG_ON(!alf_queue_empty(pool->sharedq));
+		alf_queue_free(pool->sharedq);
+	}
+
+	kfree(pool);
+}
+EXPORT_SYMBOL(qmempool_destroy);
+
+struct qmempool *
+qmempool_create(uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+		struct kmem_cache *kmem, gfp_t gfp_mask)
+{
+	struct qmempool *pool;
+	int i, j, num;
+	void *elem;
+
+	/* Validate constraints, e.g. due to bulking */
+	if (localq_sz < QMEMPOOL_BULK) {
+		pr_err("%s() localq size(%d) too small for bulking\n",
+		       __func__, localq_sz);
+		return NULL;
+	}
+	if (sharedq_sz < (QMEMPOOL_BULK * QMEMPOOL_REFILL_MULTIPLIER)) {
+		pr_err("%s() sharedq size(%d) too small for bulk refill\n",
+		       __func__, sharedq_sz);
+		return NULL;
+	}
+	if (!is_power_of_2(localq_sz) || !is_power_of_2(sharedq_sz)) {
+		pr_err("%s() queue sizes (%d/%d) must be power-of-2\n",
+		       __func__, localq_sz, sharedq_sz);
+		return NULL;
+	}
+	if (prealloc > sharedq_sz) {
+		pr_err("%s() prealloc(%d) req > sharedq size(%d)\n",
+		       __func__, prealloc, sharedq_sz);
+		return NULL;
+	}
+	if ((prealloc % QMEMPOOL_BULK) != 0) {
+		pr_warn("%s() prealloc(%d) should be div by BULK size(%d)\n",
+			__func__, prealloc, QMEMPOOL_BULK);
+	}
+	if (!kmem) {
+		pr_err("%s() kmem_cache is a NULL ptr\n",  __func__);
+		return NULL;
+	}
+
+	pool = kzalloc(sizeof(*pool), gfp_mask);
+	if (!pool)
+		return NULL;
+	pool->kmem     = kmem;
+	pool->gfp_mask = gfp_mask;
+
+	/* MPMC (Multi-Producer-Multi-Consumer) queue */
+	pool->sharedq = alf_queue_alloc(sharedq_sz, gfp_mask);
+	if (IS_ERR_OR_NULL(pool->sharedq)) {
+		pr_err("%s() failed to create shared queue(%d) ERR_PTR:0x%p\n",
+		       __func__, sharedq_sz, pool->sharedq);
+		qmempool_destroy(pool);
+		return NULL;
+	}
+
+	pool->prealloc = prealloc;
+	for (i = 0; i < prealloc; i++) {
+		elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+		if (!elem) {
+			pr_err("%s() kmem_cache out of memory?!\n",  __func__);
+			qmempool_destroy(pool);
+			return NULL;
+		}
+		/* Could use the SP version given it is not visible yet */
+		num = alf_mp_enqueue(pool->sharedq, &elem, 1);
+		BUG_ON(num <= 0);
+	}
+
+	pool->percpu = alloc_percpu(struct qmempool_percpu);
+	if (pool->percpu == NULL) {
+		pr_err("%s() failed to alloc percpu\n", __func__);
+		qmempool_destroy(pool);
+		return NULL;
+	}
+
+	/* SPSC (Single-Consumer-Single-Producer) queue per CPU */
+	for_each_possible_cpu(j) {
+		struct qmempool_percpu *cpu = per_cpu_ptr(pool->percpu, j);
+
+		cpu->localq = alf_queue_alloc(localq_sz, gfp_mask);
+		if (IS_ERR_OR_NULL(cpu->localq)) {
+			pr_err("%s() failed alloc localq(sz:%d) on cpu:%d\n",
+			       __func__, localq_sz, j);
+			qmempool_destroy(pool);
+			return NULL;
+		}
+	}
+
+	return pool;
+}
+EXPORT_SYMBOL(qmempool_create);
+
+/* Element handling
+ */
+
+/* This function is called when sharedq runs-out of elements.
+ * Thus, sharedq needs to be refilled (enq) with elems from slab.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_enqueue() call.
+ */
+void *__qmempool_alloc_from_slab(struct qmempool *pool, gfp_t gfp_mask)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	void *elem;
+	int num, i, j;
+
+	/* Cannot use SLAB that can sleep if (gfp_mask & __GFP_WAIT),
+	 * else preemption disable/enable scheme becomes too complicated
+	 */
+	BUG_ON(gfp_mask & __GFP_WAIT);
+
+	elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+	if (elem == NULL) /* slab depleted, no reason to call below allocs */
+		return NULL;
+
+	/* SLAB considerations, we need a kmem_cache interface that
+	 * supports allocating a bulk of elements.
+	 */
+
+	for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+		for (j = 0; j < QMEMPOOL_BULK; j++) {
+			elems[j] = kmem_cache_alloc(pool->kmem, gfp_mask);
+			/* Handle if slab gives us NULL elem */
+			if (elems[j] == NULL) {
+				pr_err("%s() ARGH - slab returned NULL",
+				       __func__);
+				num = alf_mp_enqueue(pool->sharedq, elems, j-1);
+				BUG_ON(num == 0); //FIXME handle
+				return elem;
+			}
+		}
+		num = alf_mp_enqueue(pool->sharedq, elems, QMEMPOOL_BULK);
+		/* FIXME: There is a theoretical chance that multiple
+		 * CPU enter here, refilling sharedq at the same time,
+		 * thus we must handle "full" situation, for now die
+		 * hard so someone will need to fix this.
+		 */
+		BUG_ON(num == 0); /* sharedq should have room */
+	}
+
+	/* What about refilling localq here? (else it will happen on
+	 * next cycle, and will cost an extra cmpxchg).
+	 */
+	return elem;
+}
+
+/* This function is called when the localq runs out-of elements.
+ * Thus, localq is refilled (enq) with elements (deq) from sharedq.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_dequeue() call.
+ */
+void *__qmempool_alloc_from_sharedq(struct qmempool *pool, gfp_t gfp_mask,
+				    struct alf_queue *localq)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	void *elem;
+	int num;
+
+	/* Costs atomic "cmpxchg", but amortize cost by bulk dequeue */
+	num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+	if (likely(num > 0)) {
+		/* Consider prefetching data part of elements here, it
+		 * should be an optimal place to hide memory prefetching.
+		 * Especially given the localq is known to be an empty FIFO
+		 * which guarantees the order objs are accessed in.
+		 */
+		elem = elems[0]; /* extract one element */
+		if (num > 1) {
+			num = alf_sp_enqueue(localq, &elems[1], num-1);
+			/* Refill localq, should be empty, must succeed */
+			BUG_ON(num == 0);
+		}
+		return elem;
+	}
+	/* Use slab if sharedq runs out of elements */
+	elem = __qmempool_alloc_from_slab(pool, gfp_mask);
+	return elem;
+}
+EXPORT_SYMBOL(__qmempool_alloc_from_sharedq);
+
+/* Called when sharedq is full. Thus also make room in sharedq,
+ * besides also freeing the "elems" given.
+ */
+bool __qmempool_free_to_slab(struct qmempool *pool, void **elems, int n)
+{
+	int num, i, j;
+	/* SLAB considerations, we could use kmem_cache interface that
+	 * supports returning a bulk of elements.
+	 */
+
+	/* free these elements for real */
+	for (i = 0; i < n; i++)
+		kmem_cache_free(pool->kmem, elems[i]);
+
+	/* Make room in sharedq for next round */
+	for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+		num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+		for (j = 0; j < num; j++)
+			kmem_cache_free(pool->kmem, elems[j]);
+	}
+	return true;
+}
+
+/* This function is called when the localq is full. Thus, elements
+ * from localq needs to be (dequeued) and returned (enqueued) to
+ * sharedq (or if shared is full, need to be free'ed to slab)
+ *
+ * MUST be called from a preemptive safe context.
+ */
+void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+				struct alf_queue *localq)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	int num_enq, num_deq;
+
+	elems[0] = elem;
+	/* Make room in localq */
+	num_deq = alf_sc_dequeue(localq, &elems[1], QMEMPOOL_BULK-1);
+	if (unlikely(num_deq == 0))
+		goto failed;
+	num_deq++; /* count first 'elem' */
+
+	/* Successful dequeued 'num_deq' elements from localq, "free"
+	 * these elems by enqueuing to sharedq
+	 */
+	num_enq = alf_mp_enqueue(pool->sharedq, elems, num_deq);
+	if (likely(num_enq == num_deq)) /* Success enqueued to sharedq */
+		return;
+
+	/* If sharedq is full (num_enq == 0) dequeue elements will be
+	 * returned directly to the SLAB allocator.
+	 *
+	 * Note: This usage of alf_queue API depend on enqueue is
+	 * fixed, by only enqueueing if all elements could fit, this
+	 * is an API that might change.
+	 */
+
+	__qmempool_free_to_slab(pool, elems, num_deq);
+	return;
+failed:
+	/* dequeing from a full localq should always be possible */
+	BUG();
+}
+EXPORT_SYMBOL(__qmempool_free_to_sharedq);
+
+/* API users can choose to use "__" prefixed versions for inlining */
+void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	return __qmempool_alloc(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc);
+
+void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask)
+{
+	return __qmempool_alloc_softirq(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc_softirq);
+
+void qmempool_free(struct qmempool *pool, void *elem)
+{
+	return __qmempool_free(pool, elem);
+}
+EXPORT_SYMBOL(qmempool_free);
+
+MODULE_DESCRIPTION("Quick queue based mempool (qmempool)");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
  2014-12-10 14:15   ` [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue Jesper Dangaard Brouer
  2014-12-10 14:15   ` [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool Jesper Dangaard Brouer
@ 2014-12-10 14:15   ` Jesper Dangaard Brouer
  2014-12-10 14:22   ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) David Laight
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt

This patch uses qmempool for faster than SLAB caching of SKBs.

Only use this caching in connection with napi_alloc_skb() which runs
in softirq context.  This softirq context provides the needed
protection for qmempool and the underlying alf_queue.

Current caching settings are max 32 elements per CPU, which is 8192
bytes given SKB is SLAB_HWCACHE_ALIGN'ed.  The shared queue max limit
is 1024 which corresponds to worst-case 263KB memory usage.  Systems
with a NR_CPUS <= 8 will get a smaller max shared queue.

Benchmarked on a E5-2695 12-cores (no-HT) with ixgbe.
Baseline is Alex'es napi_alloc_skb patchset.

Single flow/CPU, early drop in iptables RAW table (fast path compare):
 * baseline: 3,159,160 pps
 * qmempool: 3,282,508 pps
 - Diff to baseline: +123348 pps => -11.89 ns

IP-forward single flow/cpu (slower path compare):
 * baseline: 1,137,284 pps
 * qmempool: 1,191,856 pps
 - Diff to baseline: +54572 pps => -40.26 ns

Some of the improvements also come from qmempool_{alloc,free} have
smaller code size than kmem_cache_{alloc,free}, which helps reduce
instruction-cache misses.

Also did some scaling tests, to stress qmempool sharedq allocs (which
stress the alf_queue's concurrency).

IP-forward MULTI flow/cpu (12 CPUs E5-2695 no-HT, 12 HWQs):
 * baseline: 11,946,666 pps
 * qmempool: 11,988,042 pps
 - Diff to baseline: +41376 pps => -0.29 ns

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/linux/skbuff.h |    4 +++-
 net/core/dev.c         |    5 ++++-
 net/core/skbuff.c      |   43 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index af79302..8881215 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -152,6 +152,7 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct qmempool;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -557,8 +558,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
+				qmempool:1,
 				xmit_more:1;
-	/* one bit hole */
 	kmemcheck_bitfield_end(flags1);
 
 	/* fields enclosed in headers_start/headers_end are copied
@@ -755,6 +756,7 @@ void skb_tx_error(struct sk_buff *skb);
 void consume_skb(struct sk_buff *skb);
 void  __kfree_skb(struct sk_buff *skb);
 extern struct kmem_cache *skbuff_head_cache;
+extern struct qmempool *skbuff_head_pool;
 
 void kfree_skb_partial(struct sk_buff *skb, bool head_stolen);
 bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/dev.c b/net/core/dev.c
index 80f798d..0c95fbd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -135,6 +135,7 @@
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
 #include <linux/hrtimer.h>
+#include <linux/qmempool.h>
 
 #include "net-sysfs.h"
 
@@ -4125,7 +4126,9 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 
 	case GRO_MERGED_FREE:
 		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
-			kmem_cache_free(skbuff_head_cache, skb);
+			(skb->qmempool) ?
+				qmempool_free(skbuff_head_pool, skb) :
+				kmem_cache_free(skbuff_head_cache, skb);
 		else
 			__kfree_skb(skb);
 		break;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ae13ef6..a96ce75 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -74,10 +74,24 @@
 #include <asm/uaccess.h>
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>
 
 struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+/* Keep max 32 skbs per CPU = 8192 bytes per CPU (as skb is
+ * SLAB_HWCACHE_ALIGN).  Sharedq cache is limited to max 1024 elems
+ * which is max 262KB skb memory, on small systems it allocs 32*2
+ * elems * NR_CPUS.
+ */
+struct qmempool *skbuff_head_pool __read_mostly;
+#define QMEMPOOL_LOCALQ 32
+#define QMEMPOOL_SCALE  (QMEMPOOL_LOCALQ * 2)
+#define QMEMPOOL_SYSTEM_SIZE roundup_pow_of_two(NR_CPUS * QMEMPOOL_SCALE)
+#define QMEMPOOL_SHAREDQ min(1024UL, QMEMPOOL_SYSTEM_SIZE)
+#define QMEMPOOL_PREALLOC 0
+
 /**
  *	skb_panic - private function for out-of-line support
  *	@skb:	buffer
@@ -278,13 +292,14 @@ nodata:
 EXPORT_SYMBOL(__alloc_skb);
 
 /**
- * build_skb - build a network buffer
+ * __build_skb - build a network buffer
  * @data: data buffer provided by caller
  * @frag_size: size of fragment, or 0 if head was kmalloced
  *
  * Allocate a new &sk_buff. Caller provides space holding head and
  * skb_shared_info. @data must have been allocated by kmalloc() only if
  * @frag_size is 0, otherwise data should come from the page allocator.
+ * @flags: FIXME-DESCRIBE
  * The return is the new skb buffer.
  * On a failure the return is %NULL, and @data is not freed.
  * Notes :
@@ -295,13 +310,16 @@ EXPORT_SYMBOL(__alloc_skb);
  *  before giving packet to stack.
  *  RX rings only contains data buffers, not full skbs.
  */
-struct sk_buff *build_skb(void *data, unsigned int frag_size)
+static struct sk_buff *__build_skb(void *data, unsigned int frag_size,
+				   int flags)
 {
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	skb = (flags & SKB_ALLOC_NAPI) ?
+		qmempool_alloc_softirq(skbuff_head_pool, GFP_ATOMIC) :
+		kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
 	if (!skb)
 		return NULL;
 
@@ -310,6 +328,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
 	skb->head_frag = frag_size != 0;
+	skb->qmempool = !!(flags & SKB_ALLOC_NAPI);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
@@ -326,6 +345,11 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 
 	return skb;
 }
+
+struct sk_buff *build_skb(void *data, unsigned int frag_size)
+{
+	return __build_skb(data, frag_size, 0);
+}
 EXPORT_SYMBOL(build_skb);
 
 struct netdev_alloc_cache {
@@ -477,7 +501,7 @@ static struct sk_buff *__alloc_rx_skb(unsigned int length, gfp_t gfp_mask,
 			__netdev_alloc_frag(fragsz, gfp_mask);
 
 		if (likely(data)) {
-			skb = build_skb(data, fragsz);
+			skb = __build_skb(data, fragsz, flags);
 			if (unlikely(!skb))
 				put_page(virt_to_head_page(data));
 		}
@@ -637,7 +661,9 @@ static void kfree_skbmem(struct sk_buff *skb)
 
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
-		kmem_cache_free(skbuff_head_cache, skb);
+		(skb->qmempool) ?
+			qmempool_free(skbuff_head_pool, skb) :
+			kmem_cache_free(skbuff_head_cache, skb);
 		return;
 
 	case SKB_FCLONE_ORIG:
@@ -862,6 +888,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(end);
 	C(head);
 	C(head_frag);
+	C(qmempool);
 	C(data);
 	C(truesize);
 	atomic_set(&n->users, 1);
@@ -3370,6 +3397,12 @@ void __init skb_init(void)
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 						NULL);
+	/* connect qmempools to slabs */
+	skbuff_head_pool = qmempool_create(QMEMPOOL_LOCALQ,
+					   QMEMPOOL_SHAREDQ,
+					   QMEMPOOL_PREALLOC,
+					   skbuff_head_cache, GFP_ATOMIC);
+	BUG_ON(skbuff_head_pool == NULL);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2014-12-10 14:15   ` [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache Jesper Dangaard Brouer
@ 2014-12-10 14:22   ` David Laight
  2014-12-10 14:40     ` Jesper Dangaard Brouer
  2014-12-10 15:17   ` Christoph Lameter
  2014-12-10 19:51   ` Christoph Lameter
  5 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2014-12-10 14:22 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Christoph Lameter
  Cc: linux-api@vger.kernel.org, Eric Dumazet, David S. Miller,
	Hannes Frederic Sowa, Alexander Duyck, Alexei Starovoitov,
	Paul E. McKenney, Mathieu Desnoyers, Steven Rostedt

From: Jesper Dangaard Brouer
> The network stack have some use-cases that puts some extreme demands
> on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
> packet size[1], requires handling a packet every 67.2 ns (nanosec).
> 
> Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> elements), show "fast-path" instant reuse only costs 19 ns, but a
> closer to network usage pattern show the cost rise to 45 ns.
> 
> This patchset introduce a quick mempool (qmempool), which when used
> in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> drop in iptables "raw" table, but more importantly saves 40 ns with
> IP-forwarding, which were hitting the slower SLUB use-case.
> 
> 
> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue.  By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

It seems to me that these improvements could be added to the
underlying allocator itself.
Nesting allocators doesn't really seem right to me.

	David


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 14:22   ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) David Laight
@ 2014-12-10 14:40     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 14:40 UTC (permalink / raw)
  To: David Laight
  Cc: brouer, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Christoph Lameter, linux-api@vger.kernel.org,
	Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt

On Wed, 10 Dec 2014 14:22:22 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > The network stack have some use-cases that puts some extreme demands
> > on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
> > packet size[1], requires handling a packet every 67.2 ns (nanosec).
> > 
> > Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> > elements), show "fast-path" instant reuse only costs 19 ns, but a
> > closer to network usage pattern show the cost rise to 45 ns.
> > 
> > This patchset introduce a quick mempool (qmempool), which when used
> > in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> > drop in iptables "raw" table, but more importantly saves 40 ns with
> > IP-forwarding, which were hitting the slower SLUB use-case.
> > 
> > 
> > One of the building blocks for achieving this speedup is a cmpxchg
> > based Lock-Free queue that supports bulking, named alf_queue for
> > Array-based Lock-Free queue.  By bulking elements (pointers) from the
> > queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> > elements.
> 
> It seems to me that these improvements could be added to the
> underlying allocator itself.
> Nesting allocators doesn't really seem right to me.

Yes, I would very much like to see these ideas integrated into the
underlying allocators (hence addressing the mm-list).

This patchset demonstrates that it is possible to do something faster
than the existing SLUB allocator.  Which the network stack have a need
for.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 6/6] ethernet/broadcom: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align
  2014-12-10  9:52     ` David Laight
@ 2014-12-10 15:16       ` Alexander Duyck
  0 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10 15:16 UTC (permalink / raw)
  To: David Laight, 'Alexander Duyck', netdev@vger.kernel.org
  Cc: Florian Fainelli, eric.dumazet@gmail.com, Ariel Elior,
	brouer@redhat.com, Gary Zambrano, davem@davemloft.net,
	ast@plumgrid.com

On 12/10/2014 01:52 AM, David Laight wrote:
> From: David Laight
>> From: Alexander Duyck
>>> This patch replaces the calls to netdev_alloc_skb_ip_align in the
>>> copybreak paths.
>> Why?
>>
>> You still want the IP header to be aligned and you also want the
>> memcpy() to be copying aligned data.
>>
>> I suspect this fails on both counts?
> Or am I confused by the naming?
>
> 	David
>

The bit you are missing is that napi_alloc_skb is always IP aligned. 
The general idea with napi_alloc_skb is that it is a cheaper way
allocate frames that are about to be passed up the stack as a NAPI or
GRO receive so the IP aligned part is a given.

- Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
                     ` (3 preceding siblings ...)
  2014-12-10 14:22   ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) David Laight
@ 2014-12-10 15:17   ` Christoph Lameter
  2014-12-10 15:33     ` Jesper Dangaard Brouer
  2014-12-10 19:51   ` Christoph Lameter
  5 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-12-10 15:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, linux-kernel, linux-mm, linux-api

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

>  Patch1: alf_queue (Lock-Free queue)

For some reason that key patch is not in my linux-mm archives nor in my
inbox.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10  4:16   ` Alexei Starovoitov
@ 2014-12-10 15:21     ` Alexander Duyck
  0 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10 15:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexander Duyck
  Cc: Network Development, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer

On 12/09/2014 08:16 PM, Alexei Starovoitov wrote:
> On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck
> <alexander.h.duyck@redhat.com> wrote:
>> This patch splits the netdev_alloc_frag function up so that it can be used
>> on one of two page frag pools instead of being fixed on the
>> netdev_alloc_cache.  By doing this we can add a NAPI specific function
>> __napi_alloc_frag that accesses a pool that is only used from softirq
>> context.  The advantage to this is that we do not need to call
>> local_irq_save/restore which can be a significant savings.
>>
>> I also took the opportunity to refactor the core bits that were placed in
>> __alloc_page_frag.  First I updated the allocation to do either a 32K
>> allocation or an order 0 page.  This is based on the changes in commmit
>> d9b2938aa where it was found that latencies could be reduced in case of
> thanks for explaining that piece of it.
>
>> +       struct page *page = NULL;
>> +       gfp_t gfp = gfp_mask;
>> +
>> +       if (order) {
>> +               gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
>> +               page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
>> +               nc->frag.size = PAGE_SIZE << (page ? order : 0);
>> +       }
>>
>> -       local_irq_save(flags);
>> -       nc = this_cpu_ptr(&netdev_alloc_cache);
>> -       if (unlikely(!nc->frag.page)) {
>> +       if (unlikely(!page))
>> +               page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
> I'm guessing you're not combining this 'if' with above one to
> keep gfp untouched, so there is a 'warn' when it actually fails 2nd time.
> Tricky :)
> Anyway looks good to me and I think I understand it enough to say:
> Acked-by: Alexei Starovoitov <ast@plumgrid.com>

Thanks.  Yes the compiler is smart enough to combine the frag.size and
the second check into one if order is non-zero.  The other trick here is
if order is 0 then that whole block disappears and I don't have to touch
frag.size or gfp at all and the code gets much simpler as the *page =
NULL falls though and cancels out the 'if' as a compile time check.

- Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 15:17   ` Christoph Lameter
@ 2014-12-10 15:33     ` Jesper Dangaard Brouer
       [not found]       ` <20141210163321.0e4e4fd2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-10 15:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-kernel, linux-mm, linux-api, brouer

On Wed, 10 Dec 2014 09:17:32 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:
> 
> >  Patch1: alf_queue (Lock-Free queue)
> 
> For some reason that key patch is not in my linux-mm archives nor in my
> inbox.

That is very strange! I did notice that it was somehow delayed in
showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
and didn't show up on netdev either...

Inserting it below (hoping my email client does not screw it up):


[PATCH RFC] lib: adding an Array-based Lock-Free (ALF) queue

From: Jesper Dangaard Brouer <brouer@redhat.com>

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking.  The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
 MPMC = Multi-Producer-Multi-Consumer
 SPSC = Single-Producer-Single-Consumer

(none-bulking):  per element cost MPMC and SPSC
                   MPMC     -- SPSC
 simple         :  9.519 ns -- 1.282 ns
 multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant.  Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
         MPMC     -- SPSC
 bulk2 : 5.849 ns -- 1.748 ns
 bulk3 : 4.102 ns -- 1.531 ns
 bulk4 : 3.281 ns -- 1.383 ns
 bulk6 : 2.530 ns -- 1.238 ns
 bulk8 : 2.125 ns -- 1.196 ns
 bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
----

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure.  Now no detection happens.
---

 include/linux/alf_queue.h |  303 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 4 files changed, 365 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c


diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ *  - Array based for cache-line optimization
+ *  - Bounded by the array size
+ *  - FIFO Producer/Consumer queue, no queue traversal supported
+ *  - Very fast
+ *  - Designed as a queue for pointers to objects
+ *  - Bulk enqueue and dequeue support
+ *  - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+	u32 head;
+	u32 tail;
+};
+
+struct alf_queue {
+	u32 size;
+	u32 mask;
+	u32 flags;
+	struct alf_actor producer ____cacheline_aligned_in_smp;
+	struct alf_actor consumer ____cacheline_aligned_in_smp;
+	void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void		  alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ *  1. They can be reused for both "Single" and "Multi" variants
+ *  2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+			   void **ptr, const u32 n)
+{
+	int i, index = p_head;
+
+	for (i = 0; i < n; i++, index++)
+		q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+			  void **ptr, const u32 elems)
+{
+	int i, index = c_head;
+
+	for (i = 0; i < elems; i++, index++)
+		ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size.  Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful.  This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	do {
+		p_head = ACCESS_ONCE(q->producer.head);
+		c_tail = ACCESS_ONCE(q->consumer.tail);
+
+		space = q->size + c_tail - p_head;
+		if (n > space)
+			return 0;
+
+		p_next = p_head + n;
+	}
+	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Wait for other concurrent preceding enqueues not yet done,
+	 * this part make us none-wait-free and could be problematic
+	 * in case of congestion with many CPUs
+	 */
+	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+		cpu_relax();
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	do {
+		c_head = ACCESS_ONCE(q->consumer.head);
+		p_tail = ACCESS_ONCE(q->producer.tail);
+
+		elems = p_tail - c_head;
+
+		if (elems == 0)
+			return 0;
+		else
+			elems = min(elems, n);
+
+		c_next = c_head + elems;
+	}
+	while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+	/* LOAD the elems from the queue array.
+	 *   We don't need a smb_rmb() Read-Memory-Barrier here because
+	 *   the above cmpxchg is an implied full Memory-Barrier.
+	 */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough. Paired with enqueue
+	 * implicit full-MB in cmpxchg.
+	 */
+	smp_wmb();
+
+	/* Wait for other concurrent preceding dequeues not yet done */
+	while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+		cpu_relax();
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x)							\
+	do {								\
+		if (unlikely(!(x))) {					\
+			pr_crit("Assertion failed %s:%d: \"%s\"\n",	\
+				__FILE__, __LINE__, #x);		\
+			BUG();						\
+		}							\
+	} while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	p_head = q->producer.head;
+	smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+	c_tail = ACCESS_ONCE(q->consumer.tail);
+
+	space = q->size + c_tail - p_head;
+	if (n > space)
+		return 0;
+
+	p_next = p_head + n;
+	ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+	q->producer.head = p_next;
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	c_head = q->consumer.head;
+	p_tail = ACCESS_ONCE(q->producer.tail);
+
+	elems = p_tail - c_head;
+
+	if (elems == 0)
+		return 0;
+	else
+		elems = min(elems, n);
+
+	c_next = c_head + elems;
+	ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+	q->consumer.head = c_next;
+
+	smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough.
+	 */
+	smp_wmb();
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+	/* The empty (and initial state) is when consumer have reached
+	 * up with producer.
+	 *
+	 * DOUBLE-CHECK: Should we use producer.head, as this indicate
+	 * a producer is in-progress(?)
+	 */
+	return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+	u32 c_head = ACCESS_ONCE(q->consumer.head);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+	u32 elems;
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	elems = p_tail - c_head;
+	/* Thus, same as:
+	 *  elems = (p_tail - c_head) & q->mask;
+	 */
+	return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+	u32 p_head = ACCESS_ONCE(q->producer.head);
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 space;
+
+	/* The max avail space is q->size and
+	 * the empty state is when (consumer == producer)
+	 */
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	space = q->size + c_tail - p_head;
+	/* Thus, same as:
+	 *  space = (q->size + c_tail - p_head) & q->mask;
+	 */
+	return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
 	bool
 
 #
+# ALF queue
+#
+config ALF_QUEUE
+	bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+	default y
+	help
+	  This Array-based Lock-Free (ALF) queue, is a very fast
+	  bounded Producer-Consumer queue, supporting bulking.  The
+	  MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+	  cmpxchg, but the cost can be amorized by utilizing bulk
+	  enqueue/dequeue.
+
+#
 # Generic 64-bit atomic support is selected if needed
 #
 config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
 
 obj-$(CONFIG_NLATTR) += nlattr.o
 
+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
 obj-$(CONFIG_LRU_CACHE) += lru_cache.o
 
 obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ *  - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+	struct alf_queue *q;
+	size_t mem_size;
+
+	if (!(is_power_of_2(size)) || size > 65536)
+		return ERR_PTR(-EINVAL);
+
+	/* The ring array is allocated together with the queue struct */
+	mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+	q = kzalloc(mem_size, gfp);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	q->size = size;
+	q->mask = size - 1;
+
+	return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+	kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
  2014-12-10  4:16   ` Alexei Starovoitov
@ 2014-12-10 16:02   ` Eric Dumazet
  2014-12-10 17:06     ` Alexander Duyck
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2014-12-10 16:02 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, ast, davem, brouer

On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote:

> I also took the opportunity to refactor the core bits that were placed in
> __alloc_page_frag.  First I updated the allocation to do either a 32K
> allocation or an order 0 page.  This is based on the changes in commmit
> d9b2938aa where it was found that latencies could be reduced in case of
> failures. 


GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different.

I have no idea how expensive it is to attempt order-3, order-2, order-1
allocations with GFP_ATOMIC.

I did an interesting experiment on mlx4 driver, allocating the pages
needed to store the fragments, using a small layer before the
alloc_page() that is normally used :

- Attempt order-9 allocations, and use split_page() to give the
individual pages.

Boost in performance is 10% on TCP bulk receive, because of less TLB
misses.

With huge amount of memory these days, alloc_page() tend to give pages
spread all over memory, with poor TLB locality.

With this strategy, a 1024 RX ring is backed by 2 huge pages only.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
       [not found]       ` <20141210163321.0e4e4fd2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-12-10 16:17         ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2014-12-10 16:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> That is very strange! I did notice that it was somehow delayed in
> showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
> and didn't show up on netdev either...

It finally got through.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10 16:02   ` Eric Dumazet
@ 2014-12-10 17:06     ` Alexander Duyck
  2014-12-10 17:13       ` Eric Dumazet
  0 siblings, 1 reply; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10 17:06 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck; +Cc: netdev, ast, davem, brouer

On 12/10/2014 08:02 AM, Eric Dumazet wrote:
> On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote:
>
>> I also took the opportunity to refactor the core bits that were placed in
>> __alloc_page_frag.  First I updated the allocation to do either a 32K
>> allocation or an order 0 page.  This is based on the changes in commmit
>> d9b2938aa where it was found that latencies could be reduced in case of
>> failures. 
>
> GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different.
>
> I have no idea how expensive it is to attempt order-3, order-2, order-1
> allocations with GFP_ATOMIC.

The most likely case is the successful first allocation so I didn't see
much point in trying to optimize for the failure cases.  I personally
prefer to see a fast failure rather than one that is dragged out over
several failed allocation attempts.  In addition I can get away with
several optimization tricks that I cannot with the loop.

> I did an interesting experiment on mlx4 driver, allocating the pages
> needed to store the fragments, using a small layer before the
> alloc_page() that is normally used :
>
> - Attempt order-9 allocations, and use split_page() to give the
> individual pages.
>
> Boost in performance is 10% on TCP bulk receive, because of less TLB
> misses.
>
> With huge amount of memory these days, alloc_page() tend to give pages
> spread all over memory, with poor TLB locality.
>
> With this strategy, a 1024 RX ring is backed by 2 huge pages only.

That is an interesting idea.  I wonder if there would be a similar
benefit for small packets.  If nothing else I might try a few
experiments with ixgbe to see if I can take advantage of something similar.

- Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10 17:06     ` Alexander Duyck
@ 2014-12-10 17:13       ` Eric Dumazet
  2014-12-10 17:16         ` Alexander Duyck
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2014-12-10 17:13 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, ast, davem, brouer

On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote:

> That is an interesting idea.  I wonder if there would be a similar
> benefit for small packets.  If nothing else I might try a few
> experiments with ixgbe to see if I can take advantage of something similar.

I was planning to submit a core infrastructure when net-next reopens, a
simple layer to use instead of alloc_page(), and immediately usable in
Intel drivers.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
  2014-12-10 17:13       ` Eric Dumazet
@ 2014-12-10 17:16         ` Alexander Duyck
  0 siblings, 0 replies; 30+ messages in thread
From: Alexander Duyck @ 2014-12-10 17:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, ast, davem, brouer

On 12/10/2014 09:13 AM, Eric Dumazet wrote:
> On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote:
>
>> That is an interesting idea.  I wonder if there would be a similar
>> benefit for small packets.  If nothing else I might try a few
>> experiments with ixgbe to see if I can take advantage of something similar.
> I was planning to submit a core infrastructure when net-next reopens, a
> simple layer to use instead of alloc_page(), and immediately usable in
> Intel drivers.

Sound good.  I look forward to it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool
  2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
                   ` (6 preceding siblings ...)
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
@ 2014-12-10 18:32 ` David Miller
  7 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2014-12-10 18:32 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: netdev, ast, eric.dumazet, brouer

From: Alexander Duyck <alexander.h.duyck@redhat.com>
Date: Tue, 09 Dec 2014 19:40:35 -0800

> This patch series implements a means of allocating page fragments without
> the need for the local_irq_save/restore in __netdev_alloc_frag.  By doing
> this I am able to decrease packet processing time by 11ns per packet in my
> test environment.

Looks great, series applied, thanks Alex.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
                     ` (4 preceding siblings ...)
  2014-12-10 15:17   ` Christoph Lameter
@ 2014-12-10 19:51   ` Christoph Lameter
  2014-12-11 10:18     ` Jesper Dangaard Brouer
  5 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2014-12-10 19:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-kernel, linux-mm, linux-api, Eric Dumazet,
	David S. Miller, Hannes Frederic Sowa, Alexander Duyck,
	Alexei Starovoitov, Paul E. McKenney, Mathieu Desnoyers,
	Steven Rostedt

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue.  By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

This is a bit of an issue since the design of the SLUB allocator is such
that you should pick up an object, apply some processing and then take the
next one. The fetching of an object warms up the first cacheline and this
is tied into the way free objects are linked in SLUB.

So a bulk fetch from SLUB will not that effective and cause the touching
of many cachelines if we are dealing with just a few objects. If we are
looking at whole slab pages with all objects then SLUB can be effective
since we do not have to build up the linked pointer structure in each
page. SLAB has a different architecture there and a bulk fetch there is
possible without touching objects even for small sets since the freelist
management is separate from the objects.

If you do this bulking then you will later access cache cold objects?
Doesnt that negate the benefit that you gain? Or are these objects written
to by hardware and therefore by necessity cache cold?

We could provide a faster bulk alloc/free function.

	int kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags,
		size_t objects, void **array)

and this could be optimized by each slab allocator to provide fast
population of objects in that array. We then assume that the number of
objects is in the hundreds or so right?

The corresponding free function

	void kmem_cache_free_array(struct kmem_cache *s,
		size_t objects, void **array)

I think the queue management of the array can be improved by using a
similar technique as used the SLUB allocator using the cmpxchg_local.
cmpxchg_local is much faster than a full cmpxchg and we are operating on
per cpu structures anyways. So the overhead could still be reduced.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
  2014-12-10 19:51   ` Christoph Lameter
@ 2014-12-11 10:18     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-11 10:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: netdev, linux-kernel, linux-mm, linux-api, Eric Dumazet,
	David S. Miller, Hannes Frederic Sowa, Alexander Duyck,
	Alexei Starovoitov, Paul E. McKenney, Mathieu Desnoyers,
	Steven Rostedt, brouer

On Wed, 10 Dec 2014 13:51:32 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:
> 
> > One of the building blocks for achieving this speedup is a cmpxchg
> > based Lock-Free queue that supports bulking, named alf_queue for
> > Array-based Lock-Free queue.  By bulking elements (pointers) from the
> > queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> > elements.
> 
> This is a bit of an issue since the design of the SLUB allocator is such
> that you should pick up an object, apply some processing and then take the
> next one. The fetching of an object warms up the first cacheline and this
> is tied into the way free objects are linked in SLUB.
> 
> So a bulk fetch from SLUB will not that effective and cause the touching
> of many cachelines if we are dealing with just a few objects. If we are
> looking at whole slab pages with all objects then SLUB can be effective
> since we do not have to build up the linked pointer structure in each
> page. SLAB has a different architecture there and a bulk fetch there is
> possible without touching objects even for small sets since the freelist
> management is separate from the objects.
> 
> If you do this bulking then you will later access cache cold objects?
> Doesnt that negate the benefit that you gain? Or are these objects written
> to by hardware and therefore by necessity cache cold?

Cache warmup is a concern, but perhaps it's the callers responsibility
to prefetch for their use-case.  For qmempool I do have patches that
prefetch elems when going from the sharedq to the localq (per CPU), but
I didn't see much gain, and I could prove my point (of being faster than
slab) without it.  And I would use/need the slab bulk interface to add
elems to sharedq which I consider semi-cache cold.


> We could provide a faster bulk alloc/free function.
> 
> 	int kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags,
> 		size_t objects, void **array)

I like it :-)

> and this could be optimized by each slab allocator to provide fast
> population of objects in that array. We then assume that the number of
> objects is in the hundreds or so right?

I'm already seeing a benefit with 16 packets alloc/free "bulking".

On RX we have a "budget" of 64 packets/descriptors (taken from the NIC
RX ring) that need SKBs.

On TX packets are put into the TX ring, and later at TX completion the
TX ring is cleaned up, as many as 256 (as e.g. in the ixgbe driver).

Scientific articles on userspace networking (like netmap) report that
they need at least 8 packet bulking to see wirespeed 10G at 64 bytes.


> The corresponding free function
> 
> 	void kmem_cache_free_array(struct kmem_cache *s,
> 		size_t objects, void **array)
> 
> 
> I think the queue management of the array can be improved by using a
> similar technique as used the SLUB allocator using the cmpxchg_local.
> cmpxchg_local is much faster than a full cmpxchg and we are operating on
> per cpu structures anyways. So the overhead could still be reduced.

I think you missed that the per cpu localq is already not using cmpxchg
(it is a SPSC queue).  The sharedq (MPMC queue) does need and use the
locked cmpxchg.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue
  2014-12-10 14:15   ` [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue Jesper Dangaard Brouer
@ 2014-12-11 19:15     ` David Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2014-12-11 19:15 UTC (permalink / raw)
  To: brouer-H+wXaHxf7aLQT0dZR+AlfA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cl-vYTEC60ixJUAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	ast-uqk4Ao+rVK5Wk0Htik3J/w,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	mathieu.desnoyers-vg+e7yoeK/dWk0Htik3J/w,
	rostedt-nx8X9YLhiw1AfugRpC6u6w

From: Jesper Dangaard Brouer <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Wed, 10 Dec 2014 15:15:26 +0100

> +static inline int
> +alf_mp_enqueue(const u32 n;
> +	       struct alf_queue *q, void *ptr[n], const u32 n)
> +{
 ...
> +/* Main Multi-Consumer DEQUEUE */
> +static inline int
> +alf_mc_dequeue(const u32 n;
> +	       struct alf_queue *q, void *ptr[n], const u32 n)
> +{

I would seriously consider not inlining these.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb
  2014-12-10  3:40 ` [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb Alexander Duyck
@ 2014-12-11 21:43   ` Jeff Kirsher
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff Kirsher @ 2014-12-11 21:43 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, brouer, ast, eric.dumazet, davem

[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]

On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote:
> This change replaces calls to netdev_alloc_skb_ip_align with
> napi_alloc_skb.  The advantage of napi_alloc_skb is currently the fact
> that
> the page allocation doesn't make use of any irq disable calls.
> 
> There are few spots where I couldn't replace the calls as the buffer
> allocation routine is called as a part of init which is outside of the
> softirq context.
> 
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |    2 +-
>  drivers/net/ethernet/intel/e1000e/netdev.c    |    2 +-
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c |    4 ++--
>  drivers/net/ethernet/intel/igb/igb_main.c     |    3 +--
>  drivers/net/ethernet/intel/ixgb/ixgb_main.c   |    6 +++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>  6 files changed, 10 insertions(+), 11 deletions(-)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-12-11 21:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-10  3:40 [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool Alexander Duyck
2014-12-10  3:40 ` [net-next PATCH 1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag Alexander Duyck
2014-12-10  4:16   ` Alexei Starovoitov
2014-12-10 15:21     ` Alexander Duyck
2014-12-10 16:02   ` Eric Dumazet
2014-12-10 17:06     ` Alexander Duyck
2014-12-10 17:13       ` Eric Dumazet
2014-12-10 17:16         ` Alexander Duyck
2014-12-10  3:40 ` [net-next PATCH 2/6] net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb Alexander Duyck
2014-12-10  3:40 ` [net-next PATCH 3/6] ethernet/intel: Use napi_alloc_skb Alexander Duyck
2014-12-11 21:43   ` Jeff Kirsher
2014-12-10  3:41 ` [net-next PATCH 4/6] cxgb: Use napi_alloc_skb instead of netdev_alloc_skb_ip_align Alexander Duyck
2014-12-10  3:41 ` [net-next PATCH 5/6] ethernet/realtek: use " Alexander Duyck
2014-12-10  3:41 ` [net-next PATCH 6/6] ethernet/broadcom: Use " Alexander Duyck
2014-12-10  9:50   ` David Laight
2014-12-10  9:52     ` David Laight
2014-12-10 15:16       ` Alexander Duyck
2014-12-10 14:15 ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) Jesper Dangaard Brouer
2014-12-10 14:15   ` [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue Jesper Dangaard Brouer
2014-12-11 19:15     ` David Miller
2014-12-10 14:15   ` [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool Jesper Dangaard Brouer
2014-12-10 14:15   ` [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache Jesper Dangaard Brouer
2014-12-10 14:22   ` [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) David Laight
2014-12-10 14:40     ` Jesper Dangaard Brouer
2014-12-10 15:17   ` Christoph Lameter
2014-12-10 15:33     ` Jesper Dangaard Brouer
     [not found]       ` <20141210163321.0e4e4fd2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-12-10 16:17         ` Christoph Lameter
2014-12-10 19:51   ` Christoph Lameter
2014-12-11 10:18     ` Jesper Dangaard Brouer
2014-12-10 18:32 ` [net-next PATCH 0/6] net: Alloc NAPI page frags from their own pool David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).