All of lore.kernel.org
 help / color / mirror / Atom feed
From: Byungchul Park <byungchul@sk.com>
To: Mina Almasry <almasrymina@google.com>
Cc: linux-mm@kvack.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel_team@skhynix.com,
	harry.yoo@oracle.com, ast@kernel.org, daniel@iogearbox.net,
	davem@davemloft.net, kuba@kernel.org, hawk@kernel.org,
	john.fastabend@gmail.com, sdf@fomichev.me, saeedm@nvidia.com,
	leon@kernel.org, tariqt@nvidia.com, mbloch@nvidia.com,
	andrew+netdev@lunn.ch, edumazet@google.com, pabeni@redhat.com,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, horms@kernel.org, jackmanb@google.com,
	hannes@cmpxchg.org, ziy@nvidia.com, ilias.apalodimas@linaro.org,
	willy@infradead.org, brauner@kernel.org, kas@kernel.org,
	yuzhao@google.com, usamaarif642@gmail.com,
	baolin.wang@linux.alibaba.com, toke@redhat.com,
	asml.silence@gmail.com, bpf@vger.kernel.org,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH] mm, page_pool: introduce a new page type for page pool in page type
Date: Wed, 23 Jul 2025 13:46:10 +0900	[thread overview]
Message-ID: <20250723044610.GA80428@system.software.com> (raw)
In-Reply-To: <CAHS8izM11fxu6jHZw5VJsHXeZ+Tk+6ZBGDk0vHiOoHyXZoOvOg@mail.gmail.com>

On Tue, Jul 22, 2025 at 03:17:15PM -0700, Mina Almasry wrote:
> On Sun, Jul 20, 2025 at 10:49 PM Byungchul Park <byungchul@sk.com> wrote:
> >
> > Hi,
> >
> > I focused on converting the existing APIs accessing ->pp_magic field to
> > page type APIs.  However, yes.  Additional works would better be
> > considered on top like:
> >
> >    1. Adjust how to store and retrieve dma index.  Maybe network guys
> >       can work better on top.
> >
> >    2. Move the sanity check for page pool in mm/page_alloc.c to on free.
> >
> >    Byungchul
> >
> > ---8<---
> > From 7d207a1b3e9f4ff2a72f5b54b09e3ed0c4aaaca3 Mon Sep 17 00:00:00 2001
> > From: Byungchul Park <byungchul@sk.com>
> > Date: Mon, 21 Jul 2025 14:05:20 +0900
> > Subject: [PATCH] mm, page_pool: introduce a new page type for page pool in page type
> >
> > ->pp_magic field in struct page is current used to identify if a page
> > belongs to a page pool.  However, page type e.i. PGTY_netpp can be used
> > for that purpose.
> >
> > Use the page type APIs e.g. PageNetpp(), __SetPageNetpp(), and
> > __ClearPageNetpp() instead, and remove the existing APIs accessing
> > ->pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
> > netmem_clear_pp_magic() since they are totally replaced.
> >
> > This work was inspired by the following link by Pavel:
> >
> > [1] https://lore.kernel.org/all/582f41c0-2742-4400-9c81-0d46bf4e8314@gmail.com/
> >
> > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > Signed-off-by: Byungchul Park <byungchul@sk.com>
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  2 +-
> >  include/linux/mm.h                            | 28 ++-----------------
> >  include/linux/page-flags.h                    |  6 ++++
> >  include/net/netmem.h                          |  2 +-
> >  mm/page_alloc.c                               |  4 +--
> >  net/core/netmem_priv.h                        | 16 ++---------
> >  net/core/page_pool.c                          | 10 +++++--
> >  7 files changed, 24 insertions(+), 44 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > index 5d51600935a6..def274f5c1ca 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > @@ -707,7 +707,7 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
> >                                 xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
> >                                 page = xdpi.page.page;
> >
> > -                               /* No need to check page_pool_page_is_pp() as we
> > +                               /* No need to check PageNetpp() as we
> >                                  * know this is a page_pool page.
> >                                  */
> >                                 page_pool_recycle_direct(pp_page_to_nmdesc(page)->pp,
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ae50c1641bed..736061749535 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4135,10 +4135,9 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >   * DMA mapping IDs for page_pool
> >   *
> >   * When DMA-mapping a page, page_pool allocates an ID (from an xarray) and
> > - * stashes it in the upper bits of page->pp_magic. We always want to be able to
> > - * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP
> > - * pages can have arbitrary kernel pointers stored in the same field as pp_magic
> > - * (since it overlaps with page->lru.next), so we must ensure that we cannot
> > + * stashes it in the upper bits of page->pp_magic. Non-PP pages can have
> > + * arbitrary kernel pointers stored in the same field as pp_magic (since
> > + * it overlaps with page->lru.next), so we must ensure that we cannot
> >   * mistake a valid kernel pointer with any of the values we write into this
> >   * field.
> >   *
> > @@ -4168,25 +4167,4 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> >  #define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
> >                                   PP_DMA_INDEX_SHIFT)
> > -
> > -/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
> > - * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
> > - * the head page of compound page and bit 1 for pfmemalloc page, as well as the
> > - * bits used for the DMA index. page_is_pfmemalloc() is checked in
> > - * __page_pool_put_page() to avoid recycling the pfmemalloc page.
> > - */
> > -#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
> > -
> > -#ifdef CONFIG_PAGE_POOL
> > -static inline bool page_pool_page_is_pp(const struct page *page)
> > -{
> > -       return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE;
> > -}
> > -#else
> > -static inline bool page_pool_page_is_pp(const struct page *page)
> > -{
> > -       return false;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 4fe5ee67535b..906ba7c9e372 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -957,6 +957,7 @@ enum pagetype {
> >         PGTY_zsmalloc           = 0xf6,
> >         PGTY_unaccepted         = 0xf7,
> >         PGTY_large_kmalloc      = 0xf8,
> > +       PGTY_netpp              = 0xf9,
> >
> >         PGTY_mapcount_underflow = 0xff
> >  };
> > @@ -1101,6 +1102,11 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> >  PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
> >  FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)
> >
> > +/*
> > + * Marks page_pool allocated pages.
> > + */
> > +PAGE_TYPE_OPS(Netpp, netpp, netpp)
> > +
> >  /**
> >   * PageHuge - Determine if the page belongs to hugetlbfs
> >   * @page: The page to test.
> > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > index f7dacc9e75fd..3667334e16e7 100644
> > --- a/include/net/netmem.h
> > +++ b/include/net/netmem.h
> > @@ -298,7 +298,7 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
> >   */
> >  #define pp_page_to_nmdesc(p)                                           \
> >  ({                                                                     \
> > -       DEBUG_NET_WARN_ON_ONCE(!page_pool_page_is_pp(p));               \
> > +       DEBUG_NET_WARN_ON_ONCE(!PageNetpp(p));                          \
> >         __pp_page_to_nmdesc(p);                                         \
> >  })
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2ef3c07266b3..71c7666e48a9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -898,7 +898,7 @@ static inline bool page_expected_state(struct page *page,
> >  #ifdef CONFIG_MEMCG
> >                         page->memcg_data |
> >  #endif
> > -                       page_pool_page_is_pp(page) |
> > +                       PageNetpp(page) |
> >                         (page->flags & check_flags)))
> >                 return false;
> >
> > @@ -925,7 +925,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
> >         if (unlikely(page->memcg_data))
> >                 bad_reason = "page still charged to cgroup";
> >  #endif
> > -       if (unlikely(page_pool_page_is_pp(page)))
> > +       if (unlikely(PageNetpp(page)))
> >                 bad_reason = "page_pool leak";
> >         return bad_reason;
> >  }
> > diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
> > index cd95394399b4..39a97703d9ed 100644
> > --- a/net/core/netmem_priv.h
> > +++ b/net/core/netmem_priv.h
> > @@ -8,21 +8,11 @@ static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
> >         return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
> >  }
> >
> > -static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
> > -{
> > -       __netmem_clear_lsb(netmem)->pp_magic |= pp_magic;
> > -}
> > -
> > -static inline void netmem_clear_pp_magic(netmem_ref netmem)
> > -{
> > -       WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK);
> > -
> > -       __netmem_clear_lsb(netmem)->pp_magic = 0;
> > -}
> > -
> >  static inline bool netmem_is_pp(netmem_ref netmem)
> >  {
> > -       return (netmem_get_pp_magic(netmem) & PP_MAGIC_MASK) == PP_SIGNATURE;
> > +       if (netmem_is_net_iov(netmem))
> > +               return true;
> 
> As Pavel alludes, this is dubious, and at least it's difficult to
> reason about it.
> 
> There could be net_iovs that are not attached to pp, and should not be
> treated as pp memory. These are in the devmem (and future net_iov) tx
> paths.
> 
> We need a way to tell if a net_iov is pp or not. A couple of options:
> 
> 1. We could have it such that if net_iov->pp is set, then the
> netmem_is_pp == true, otherwise false.
> 2. We could implement a page-flags equivalent for net_iov.
> 
> Option #1 is simpler and is my preferred. To do that properly, you need to:
> 
> 1. Make sure everywhere net_iovs are allocated that pp=NULL in the
> non-pp case and pp=non NULL in the pp case. those callsites are
> net_devmem_bind_dmabuf (devmem rx & tx path), io_zcrx_create_area
> (io_uring rx path).
> 
> 2. Change netmem_is_pp to check net_iov->pp in the net_iov case.

Good idea, but I'm not sure if I could work on it without consuming your
additional review efforts.  Can anyone add net_iov_is_pp() helper?

Or use the page type, Netpp, as an additional way to identify if it's a
pp page for system memory, keeping the current way using ->pp_magic.
So the page type, Netpp, is used for system memory, and ->pp_magic is
used for net_iov.  The clean up for ->pp_magic can be done if needed.

	Byungchul

> --
> Thanks,
> Mina

  reply	other threads:[~2025-07-23  4:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21  5:49 [PATCH] mm, page_pool: introduce a new page type for page pool in page type Byungchul Park
2025-07-21  8:05 ` David Hildenbrand
2025-07-21  8:19   ` Byungchul Park
2025-07-21  8:49     ` David Hildenbrand
2025-07-22  1:04       ` Byungchul Park
2025-07-28 10:57       ` Byungchul Park
2025-07-21 11:12 ` Pavel Begunkov
2025-07-22  1:03   ` Byungchul Park
2025-07-22 22:20   ` Mina Almasry
2025-07-22 22:17 ` Mina Almasry
2025-07-23  4:46   ` Byungchul Park [this message]
2025-07-24 21:23     ` Mina Almasry
2025-07-25  0:26       ` Byungchul Park
2025-07-25  6:50   ` Byungchul Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250723044610.GA80428@system.software.com \
    --to=byungchul@sk.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=asml.silence@gmail.com \
    --cc=ast@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=edumazet@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jackmanb@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=kas@kernel.org \
    --cc=kernel_team@skhynix.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mbloch@nvidia.com \
    --cc=mhocko@suse.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rppt@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=surenb@google.com \
    --cc=tariqt@nvidia.com \
    --cc=toke@redhat.com \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.