From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: netdev@vger.kernel.org, lirongqing@baidu.com,
linyunsheng@huawei.com, Saeed Mahameed <saeedm@mellanox.com>,
mhocko@kernel.org, peterz@infradead.org,
linux-kernel@vger.kernel.org, brouer@redhat.com
Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
Date: Mon, 23 Dec 2019 17:52:57 +0100 [thread overview]
Message-ID: <20191223175257.164557cd@carbon> (raw)
In-Reply-To: <20191223075700.GA5333@apalos.home>
On Mon, 23 Dec 2019 09:57:00 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> Hi Jesper,
>
> Looking at the overall path again, i still need we need to reconsider
> pool->p.nid semantics.
>
> As i said i like the patch and the whole functionality and code seems fine,
> but here's the current situation.
> If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> page_pool_update_nid() the whole behavior feels a liitle odd.
As soon as driver uses page_pool_update_nid() than means they want to
control the NUMA placement explicitly. As soon as that happens, it is
the drivers responsibility and choice, and page_pool API must respect
that (and not automatically change that behind drivers back).
> page_pool_update_nid() first check will always be true since .nid =
> NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> overwriting what the user initially coded in.
>
> This is close to what i proposed in the previous mails on this
> thread. Always store a real nid even if the user explicitly requests
> NUMA_NO_NODE.
>
> So semantics is still a problem. I'll stick to what we initially
> suggested.
> 1. We either *always* store a real nid
> or
> 2. If NUMA_NO_NODE is present ignore every other check and recycle
> the memory blindly.
>
Hmm... I actually disagree with both 1 and 2.
My semantics proposal:
If driver configures page_pool with NUMA_NO_NODE, then page_pool tried
to help get the best default performance. (Which according to
performance measurements is to have RX-pages belong to the NUMA node
RX-processing runs on).
The reason I want this behavior is that during driver init/boot, it can
easily happen that a driver allocates RX-pages from wrong NUMA node.
This will cause a performance slowdown, that normally doesn't happen,
because without a cache (like page_pool) RX-pages would fairly quickly
transition over to the RX NUMA node (instead we keep recycling these,
in your case #2, where you suggest recycle blindly in case of
NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
driver developers.
--Jesper
> On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > wrote:
> > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > >
> > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > Brouer wrote:
> > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > >
> > > > > > Hi Jesper,
> > > > > >
> > > > > > I like the overall approach since this moves the check out
> > > > > > of the hotpath. @Saeed, since i got no hardware to test
> > > > > > this on, would it be possible to check that it still works
> > > > > > fine for mlx5?
> > > > > >
> > > > > > [...]
> > > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > > + struct page *page;
> > > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > > +
> > > > > > > + /* Quicker fallback, avoid locks when ring is
> > > > > > > empty */
> > > > > > > + if (__ptr_ring_empty(r))
> > > > > > > + return NULL;
> > > > > > > +
> > > > > > > + /* Softirq guarantee CPU and thus NUMA node is
> > > > > > > stable. This,
> > > > > > > + * assumes CPU refilling driver RX-ring will
> > > > > > > also run RX-NAPI.
> > > > > > > + */
> > > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > numa_mem_id() : pool->p.nid;
> > > > > >
> > > > > > One of the use cases for this is that during the allocation
> > > > > > we are not guaranteed to pick up the correct NUMA node.
> > > > > > This will get automatically fixed once the driver starts
> > > > > > recycling packets.
> > > > > >
> > > > > > I don't feel strongly about this, since i don't usually
> > > > > > like hiding value changes from the user but, would it make
> > > > > > sense to move this into __page_pool_alloc_pages_slow() and
> > > > > > change the pool->p.nid?
> > > > > >
> > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > numa_mem_id() regardless, why not store the actual node in
> > > > > > our page pool information? You can then skip this and check
> > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > configured.
> > > > >
> > > > > This single code line helps support that drivers can control
> > > > > the nid themselves. This is a feature that is only used my
> > > > > mlx5 AFAIK.
> > > > >
> > > > > I do think that is useful to allow the driver to "control"
> > > > > the nid, as pinning/preferring the pages to come from the
> > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > installed in does have benefits.
> > > >
> > > > Sure you can keep the if statement as-is, it won't break
> > > > anything. Would we want to store the actual numa id in
> > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?
> > >
> > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > > dynamic. If someone moves an RX IRQ to another CPU on another
> > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > transitioned automatically.
> > Ok this assumed that drivers were going to use
> > page_pool_nid_changed(), but with the current code we don't have to
> > force them to do that. Let's keep this as-is.
> >
> > I'll be running a few more tests and wait in case Saeed gets a
> > chance to test it and send my reviewed-by
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2019-12-23 16:53 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-17 23:17 [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition Jesper Dangaard Brouer
2019-12-18 7:44 ` Jesper Dangaard Brouer
2019-12-18 8:01 ` [net-next v4 " Jesper Dangaard Brouer
2019-12-18 14:27 ` 答复: " Li,Rongqing
2019-12-19 12:00 ` Jesper Dangaard Brouer
2019-12-19 12:47 ` 答复: " Li,Rongqing
2019-12-19 1:52 ` Yunsheng Lin
2019-12-19 12:15 ` Jesper Dangaard Brouer
2019-12-19 12:09 ` Michal Hocko
2019-12-19 13:35 ` Jesper Dangaard Brouer
2019-12-19 14:52 ` Michal Hocko
2019-12-19 15:28 ` Ilias Apalodimas
2019-12-19 14:20 ` [net-next v5 " Jesper Dangaard Brouer
2019-12-20 10:23 ` Ilias Apalodimas
2019-12-20 10:41 ` Jesper Dangaard Brouer
2019-12-20 10:49 ` Ilias Apalodimas
2019-12-20 15:22 ` Jesper Dangaard Brouer
2019-12-20 16:06 ` Ilias Apalodimas
2019-12-23 7:57 ` Ilias Apalodimas
2019-12-23 16:52 ` Jesper Dangaard Brouer [this message]
2019-12-23 22:10 ` Saeed Mahameed
2019-12-24 9:34 ` Ilias Apalodimas
2019-12-24 7:41 ` Ilias Apalodimas
2019-12-20 21:27 ` Saeed Mahameed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191223175257.164557cd@carbon \
--to=brouer@redhat.com \
--cc=ilias.apalodimas@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=lirongqing@baidu.com \
--cc=mhocko@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=saeedm@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.