From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: <netdev@vger.kernel.org>, <lirongqing@baidu.com>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
Saeed Mahameed <saeedm@mellanox.com>, <mhocko@kernel.org>,
<peterz@infradead.org>, <linux-kernel@vger.kernel.org>,
brouer@redhat.com
Subject: Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
Date: Thu, 19 Dec 2019 13:15:00 +0100 [thread overview]
Message-ID: <20191219131500.47970427@carbon> (raw)
In-Reply-To: <40fb6aff-beec-f186-2bc0-187ad370cf0b@huawei.com>
On Thu, 19 Dec 2019 09:52:14 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:
> On 2019/12/18 16:01, Jesper Dangaard Brouer wrote:
> > The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> > not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> >
> > The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> > recycle non-reusable pages"), were to have RX-pages that belongs to the
> > same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> > illustrated by the performance measurements.
> >
> > This patch moves the NAPI checks out of fast-path, and at the same time
> > solves the NUMA_NO_NODE issue.
> >
> > First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> > will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> > as the the preferred nid. It is only in rare situations, where
> > e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> > preferred nid. The page_pool API allows drivers to control the nid
> > themselves via controlling pool->p.nid.
> >
> > This patch moves the NAPI check to when alloc cache is refilled, via
> > dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> > pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> > will check the NUMA node. All current drivers using page_pool will
> > alloc/refill RX-ring from same CPU running softirq/NAPI process.
> >
> > Drivers that control the nid explicitly, also use page_pool_update_nid
> > when changing nid runtime. To speed up transision to new nid the alloc
> > cache is now flushed on nid changes. This force pages to come from
> > ptr_ring, which does the appropate nid check.
> >
> > For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> > node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> > chunks per allocation and allocation fall-through to the real
> > page-allocator with the new nid derived from numa_mem_id(). We accept
> > that transitioning the alloc cache doesn't happen immediately.
> >
> > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> > Reported-by: Li RongQing <lirongqing@baidu.com>
> > Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> > net/core/page_pool.c | 82 ++++++++++++++++++++++++++++++++++++++------------
> > 1 file changed, 63 insertions(+), 19 deletions(-)
> >
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..bd4f8b2c46b6 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -96,10 +96,61 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
> > }
> > EXPORT_SYMBOL(page_pool_create);
> >
> > +static void __page_pool_return_page(struct page_pool *pool, struct page *page);
>
> It is possible to avoid forword-declare it by move the __page_pool_return_page()?
> Maybe it is ok since this patch is targetting net-next?
>
> > +
> > +noinline
> > +static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
> > + bool refill)
> > +{
> > + struct ptr_ring *r = &pool->ring;
> > + struct page *first_page, *page;
> > + int i, curr_nid;
> > +
> > + /* Quicker fallback, avoid locks when ring is empty */
> > + if (__ptr_ring_empty(r))
> > + return NULL;
> > +
> > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > + */
> > + curr_nid = numa_mem_id();
> > +
> > + /* Slower-path: Get pages from locked ring queue */
> > + spin_lock(&r->consumer_lock);
> > + first_page = __ptr_ring_consume(r);
> > +
> > + /* Fallback to page-allocator if NUMA node doesn't match */
> > + if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> > + __page_pool_return_page(pool, first_page);
> > + first_page = NULL;
> > + }
> > +
> > + if (unlikely(!refill))
> > + goto out;
> > +
> > + /* Refill alloc array, but only if NUMA node match */
> > + for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> > + page = __ptr_ring_consume(r);
> > + if (unlikely(!page))
> > + break;
> > +
> > + if (likely(page_to_nid(page) == curr_nid)) {
> > + pool->alloc.cache[pool->alloc.count++] = page;
> > + } else {
> > + /* Release page to page-allocator, assume
> > + * refcnt == 1 invariant of cached pages
> > + */
> > + __page_pool_return_page(pool, page);
> > + }
> > + }
>
> The above code seems to not clear all the pages in the ptr_ring that
> is not in the local node in some case?
>
> I am not so familiar with asm, but does below code make sense and
> generate better asm code?
I'm not too concerned with ASM-level optimization for this function
call, as it only happens once every 64 packets.
> struct page *page = NULL;
>
> while (pool->alloc.count < PP_ALLOC_CACHE_REFILL || !refill) {
> page = __ptr_ring_consume(r);
>
> if (unlikely(!page || !refill))
> break;
>
> if (likely(page_to_nid(page) == curr_nid)) {
> pool->alloc.cache[pool->alloc.count++] = page;
> } else {
> /* Release page to page-allocator, assume
> * refcnt == 1 invariant of cached pages
> */
> __page_pool_return_page(pool, page);
> }
> }
>
> out:
> if (likely(refill && pool->alloc.count > 0))
> page = pool->alloc.cache[--pool->alloc.count];
>
> spin_unlock(&r->consumer_lock);
>
> return page;
>
>
> "The above code does not compile or test yet".
>
> the above will clear all the pages in the ptr_ring that is not in the
> local node and treat the refill and !refill case consistently.
I don't want to empty the entire ptr_ring in one go. That is
problematic, because we are running in Softirq with bh + preemption
disabled. Returning 1024 pages will undoubtedly trigger some page
buddy coalescing work. That is why I choose to max return 65 pages (I
felt this detail was important enought to mention it in the description
above).
I do acknowledge that the code can be improved. What I don't like with
my own code, is that I handle the 'first_page' as a special case. You
code did solve that case, so I'll try to improve my code and send V5.
>
> But for the refill case, the pool->alloc.count may be PP_ALLOC_CACHE_REFILL - 1
> after page_pool_refill_alloc_cache() returns.
>
>
> > +out:
> > + spin_unlock(&r->consumer_lock);
> > + return first_page;
> > +}
> > +
> > /* fast path */
> > static struct page *__page_pool_get_cached(struct page_pool *pool)
> > {
> > - struct ptr_ring *r = &pool->ring;
> > bool refill = false;
> > struct page *page;
> >
> > @@ -113,20 +164,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
> > refill = true;
> > }
> >
> > - /* Quicker fallback, avoid locks when ring is empty */
> > - if (__ptr_ring_empty(r))
> > - return NULL;
> > -
> > - /* Slow-path: Get page from locked ring queue,
> > - * refill alloc array if requested.
> > - */
> > - spin_lock(&r->consumer_lock);
> > - page = __ptr_ring_consume(r);
> > - if (refill)
> > - pool->alloc.count = __ptr_ring_consume_batched(r,
> > - pool->alloc.cache,
> > - PP_ALLOC_CACHE_REFILL);
> > - spin_unlock(&r->consumer_lock);
> > + page = page_pool_refill_alloc_cache(pool, refill);
> > return page;
> > }
> >
> > @@ -311,13 +349,10 @@ static bool __page_pool_recycle_direct(struct page *page,
> >
> > /* page is NOT reusable when:
> > * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> > - * 2) belongs to a different NUMA node than pool->p.nid.
> > - *
> > - * To update pool->p.nid users must call page_pool_update_nid.
> > */
> > static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> > {
> > - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> > + return !page_is_pfmemalloc(page);
> > }
> >
> > void __page_pool_put_page(struct page_pool *pool, struct page *page,
> > @@ -484,7 +519,16 @@ EXPORT_SYMBOL(page_pool_destroy);
> > /* Caller must provide appropriate safe context, e.g. NAPI. */
> > void page_pool_update_nid(struct page_pool *pool, int new_nid)
> > {
> > + struct page *page;
> > +
> > + WARN_ON(!in_serving_softirq());
> > trace_page_pool_update_nid(pool, new_nid);
> > pool->p.nid = new_nid;
> > +
> > + /* Flush pool alloc cache, as refill will check NUMA node */
> > + while (pool->alloc.count) {
> > + page = pool->alloc.cache[--pool->alloc.count];
> > + __page_pool_return_page(pool, page);
> > + }
> > }
> > EXPORT_SYMBOL(page_pool_update_nid);
> >
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2019-12-19 12:15 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-17 23:17 [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition Jesper Dangaard Brouer
2019-12-18 7:44 ` Jesper Dangaard Brouer
2019-12-18 8:01 ` [net-next v4 " Jesper Dangaard Brouer
2019-12-18 14:27 ` 答复: " Li,Rongqing
2019-12-19 12:00 ` Jesper Dangaard Brouer
2019-12-19 12:47 ` 答复: " Li,Rongqing
2019-12-19 1:52 ` Yunsheng Lin
2019-12-19 12:15 ` Jesper Dangaard Brouer [this message]
2019-12-19 12:09 ` Michal Hocko
2019-12-19 13:35 ` Jesper Dangaard Brouer
2019-12-19 14:52 ` Michal Hocko
2019-12-19 15:28 ` Ilias Apalodimas
2019-12-19 14:20 ` [net-next v5 " Jesper Dangaard Brouer
2019-12-20 10:23 ` Ilias Apalodimas
2019-12-20 10:41 ` Jesper Dangaard Brouer
2019-12-20 10:49 ` Ilias Apalodimas
2019-12-20 15:22 ` Jesper Dangaard Brouer
2019-12-20 16:06 ` Ilias Apalodimas
2019-12-23 7:57 ` Ilias Apalodimas
2019-12-23 16:52 ` Jesper Dangaard Brouer
2019-12-23 22:10 ` Saeed Mahameed
2019-12-24 9:34 ` Ilias Apalodimas
2019-12-24 7:41 ` Ilias Apalodimas
2019-12-20 21:27 ` Saeed Mahameed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191219131500.47970427@carbon \
--to=brouer@redhat.com \
--cc=ilias.apalodimas@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=lirongqing@baidu.com \
--cc=mhocko@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=saeedm@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.