netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: netdev@vger.kernel.org
Cc: lirongqing@baidu.com, linyunsheng@huawei.com,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Saeed Mahameed <saeedm@mellanox.com>,
	mhocko@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, brouer@redhat.com
Subject: Re: [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
Date: Wed, 18 Dec 2019 08:44:37 +0100	[thread overview]
Message-ID: <20191218084437.6db92d32@carbon> (raw)
In-Reply-To: <157662465670.141017.5588210646574716982.stgit@firesoul>

On Wed, 18 Dec 2019 00:17:36 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> 
> The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> recycle non-reusable pages"), were to have RX-pages that belongs to the
> same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> illustrated by the performance measurements.
> 
> This patch moves the NAPI checks out of fast-path, and at the same time
> solves the NUMA_NO_NODE issue.
> 
> First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> as the the preferred nid.  It is only in rare situations, where
> e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> preferred nid.  The page_pool API allows drivers to control the nid
> themselves via controlling pool->p.nid.
> 
> This patch moves the NAPI check to when alloc cache is refilled, via
> dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> will check the NUMA node. All current drivers using page_pool will
> alloc/refill RX-ring from same CPU running softirq/NAPI process.
> 
> Drivers that control the nid explicitly, also use page_pool_update_nid
> when changing nid runtime.  To speed up transision to new nid the alloc
> cache is now flushed on nid changes.  This force pages to come from
> ptr_ring, which does the appropate nid check.
> 
> For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> chunks per allocation and allocation fall-through to the real
> page-allocator with the new nid derived from numa_mem_id(). We accept
> that transitioning the alloc cache doesn't happen immediately.
> 
> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> Reported-by: Li RongQing <lirongqing@baidu.com>
> Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/page_pool.c |   64 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 49 insertions(+), 15 deletions(-)

I'm going to send a V4, because GCC doesn't generate the optimal ASM
code for the fast-path (details below).

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..37316ea66937 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -96,19 +96,22 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>  }
>  EXPORT_SYMBOL(page_pool_create);
>  
> +static void __page_pool_return_page(struct page_pool *pool, struct page *page);
> +
>  /* fast path */
>  static struct page *__page_pool_get_cached(struct page_pool *pool)
>  {
>  	struct ptr_ring *r = &pool->ring;
> +	struct page *first_page, *page;
>  	bool refill = false;
> -	struct page *page;
> +	int i, curr_nid;
>  
>  	/* Test for safe-context, caller should provide this guarantee */
>  	if (likely(in_serving_softirq())) {
>  		if (likely(pool->alloc.count)) {
>  			/* Fast-path */
> -			page = pool->alloc.cache[--pool->alloc.count];
> -			return page;
> +			first_page = pool->alloc.cache[--pool->alloc.count];
> +			return first_page;
>  		}
>  		refill = true;
>  	}

The compiler (gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)) doesn't
generate optimal ASM code for the likely fast-path of pool->alloc.cache
containing an element.  It does "isolate" it and return (retq) early in
this likely case, BUT it chooses to use %r15 (a call-preserved
register, aka callee-saved register).  Later other call-preserved
registers are used, which leads to pushing all the registers (%rbx,
%rbp, and %r12-r15) on the stack.

000000000000af0 <page_pool_alloc_pages>:
{
 af0:   e8 00 00 00 00          callq  af5 <page_pool_alloc_pages+0x5>
                        af1: R_X86_64_PLT32     __fentry__-0x4
 af5:   41 57                   push   %r15
 af7:   41 56                   push   %r14
 af9:   41 89 f6                mov    %esi,%r14d
 afc:   41 55                   push   %r13
 afe:   41 54                   push   %r12
 b00:   55                      push   %rbp
 b01:   48 89 fd                mov    %rdi,%rbp
 b04:   53                      push   %rbx
 b05:   48 83 ec 08             sub    $0x8,%rsp
 b09:   65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # b10 <page_pool_alloc_pages+
0x20>
                        b0c: R_X86_64_PC32      __preempt_count-0x4
        if (likely(in_serving_softirq())) {
 b10:   f6 c4 01                test   $0x1,%ah
 b13:   74 60                   je     b75 <page_pool_alloc_pages+0x85>
                if (likely(pool->alloc.count)) {
 b15:   8b 87 c0 00 00 00       mov    0xc0(%rdi),%eax
 b1b:   85 c0                   test   %eax,%eax
 b1d:   0f 84 94 01 00 00       je     cb7 <page_pool_alloc_pages+0x1c7>
                        first_page = pool->alloc.cache[--pool->alloc.count];
 b23:   83 e8 01                sub    $0x1,%eax
 b26:   89 87 c0 00 00 00       mov    %eax,0xc0(%rdi)
 b2c:   4c 8b bc c7 c8 00 00    mov    0xc8(%rdi,%rax,8),%r15
 b33:   00 
        if (page)
 b34:   4d 85 ff                test   %r15,%r15
 b37:   74 23                   je     b5c <page_pool_alloc_pages+0x6c>
}
 b39:   48 83 c4 08             add    $0x8,%rsp
 b3d:   4c 89 f8                mov    %r15,%rax
 b40:   5b                      pop    %rbx
 b41:   5d                      pop    %rbp
 b42:   41 5c                   pop    %r12
 b44:   41 5d                   pop    %r13
 b46:   41 5e                   pop    %r14
 b48:   41 5f                   pop    %r15
 b4a:   c3                      retq   
 [...]


> @@ -117,17 +120,42 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
>  	if (__ptr_ring_empty(r))
>  		return NULL;
>  
> -	/* Slow-path: Get page from locked ring queue,
> -	 * refill alloc array if requested.
> +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
>  	 */
> +	curr_nid = numa_mem_id();
> +
> +	/* Slower-path: Get pages from locked ring queue */
>  	spin_lock(&r->consumer_lock);
> -	page = __ptr_ring_consume(r);
> -	if (refill)
> -		pool->alloc.count = __ptr_ring_consume_batched(r,
> -							pool->alloc.cache,
> -							PP_ALLOC_CACHE_REFILL);
> +	first_page = __ptr_ring_consume(r);
> +
> +	/* Fallback to page-allocator if NUMA node doesn't match */
> +	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> +		__page_pool_return_page(pool, first_page);
> +		first_page = NULL;
> +	}
> +
> +	if (unlikely(!refill))
> +		goto out;
> +
> +	/* Refill alloc array, but only if NUMA node match */
> +	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> +		page = __ptr_ring_consume(r);
> +		if (unlikely(!page))
> +			break;
> +
> +		if (likely(page_to_nid(page) == curr_nid)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +		} else {
> +			/* Release page to page-allocator, assume
> +			 * refcnt == 1 invariant of cached pages
> +			 */
> +			__page_pool_return_page(pool, page);
> +		}
> +	}
> +out:
>  	spin_unlock(&r->consumer_lock);
> -	return page;
> +	return first_page;
>  }

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2019-12-18  7:45 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-17 23:17 [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition Jesper Dangaard Brouer
2019-12-18  7:44 ` Jesper Dangaard Brouer [this message]
2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
2019-12-18 14:27     ` 答复: " Li,Rongqing
2019-12-19 12:00       ` Jesper Dangaard Brouer
2019-12-19 12:47         ` 答复: " Li,Rongqing
2019-12-19  1:52     ` Yunsheng Lin
2019-12-19 12:15       ` Jesper Dangaard Brouer
2019-12-19 12:09     ` Michal Hocko
2019-12-19 13:35       ` Jesper Dangaard Brouer
2019-12-19 14:52         ` Michal Hocko
2019-12-19 15:28           ` Ilias Apalodimas
2019-12-19 14:20   ` [net-next v5 " Jesper Dangaard Brouer
2019-12-20 10:23     ` Ilias Apalodimas
2019-12-20 10:41       ` Jesper Dangaard Brouer
2019-12-20 10:49         ` Ilias Apalodimas
2019-12-20 15:22           ` Jesper Dangaard Brouer
2019-12-20 16:06             ` Ilias Apalodimas
2019-12-23  7:57               ` Ilias Apalodimas
2019-12-23 16:52                 ` Jesper Dangaard Brouer
2019-12-23 22:10                   ` Saeed Mahameed
2019-12-24  9:34                     ` Ilias Apalodimas
2019-12-24  7:41                   ` Ilias Apalodimas
2019-12-20 21:27       ` Saeed Mahameed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191218084437.6db92d32@carbon \
    --to=brouer@redhat.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=lirongqing@baidu.com \
    --cc=mhocko@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=saeedm@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).