Re: [Intel-wired-lan] [PATCH net-next v5 03/14] page_pool: avoid calling no-op externals when possible

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Alexander Lobakin <aleksander.lobakin@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	netdev@vger.kernel.org, Alexander Duyck <alexanderduyck@fb.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-kernel@vger.kernel.org,
	Michal Kubiak <michal.kubiak@intel.com>,
	intel-wired-lan@lists.osuosl.org,
	David Christensen <drc@linux.vnet.ibm.com>
Subject: Re: [Intel-wired-lan] [PATCH net-next v5 03/14] page_pool: avoid calling no-op externals when possible
Date: Sat, 25 Nov 2023 21:04:28 +0800	[thread overview]
Message-ID: <6bd14aa9-fa65-e4f6-579c-3a1064b2a382@huawei.com> (raw)
In-Reply-To: <20231124154732.1623518-4-aleksander.lobakin@intel.com>

On 2023/11/24 23:47, Alexander Lobakin wrote:
> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
> even on DMA-coherent platforms (like x86) with no active IOMMU or
> swiotlb, just for the call ladder.
> Indeed, it's
> 
> page_pool_put_page()
>   page_pool_put_defragged_page()                  <- external
>     __page_pool_put_page()
>       page_pool_dma_sync_for_device()             <- non-inline
>         dma_sync_single_range_for_device()
>           dma_sync_single_for_device()            <- external
>             dma_direct_sync_single_for_device()
>               dev_is_dma_coherent()               <- exit
> 
> For the inline functions, no guarantees the compiler won't uninline them
> (they're clearly not one-liners and sometimes the compilers manage to
> uninline 2 + 2). The first external call is necessary, but the rest 2+
> are done for nothing each time, plus a bunch of checks here and there.
> Since Page Pool mappings are long-term and for one "device + addr" pair
> dma_need_sync() will always return the same value (basically, whether it
> belongs to an swiotlb pool), addresses can be tested once right after
> they're obtained and the result can be reused until the page is
> unmapped.
> Do this test after a successful DMA mapping and reuse the lowest bit of
> page::dma_addr to store the result, since Page Pool addresses are always
> page-aligned and the lowest bits are unused.
> page_pool_{g,s}et_dma_addr() will then take this bit into account and
> exclude it from the actual address. Call page_pool_dma_sync_for_device()
> only if the bit is set, shaving off several function calls back and
> forth. If pool::dma_sync is not set, i.e. the driver didn't ask to
> perform syncs, don't do this test and never touch the lowest bit.
> On my x86_64, this gives from 2% to 5% performance benefit with no
> negative impact for cases when IOMMU is on and the shortcut can't be
> used.
> 
> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> # per-page flag

Sorry for not remembering the suggestion:(

If most of cases is that some pages need syncing and some page does not
need syncing for the same device, then perhaps per-page flag is more
beneficial, if not, maybe per-device flag is enough too?

What i wonder is that is there any reason this kind trick isn't done
in the dma layer instead of each subsystem or driver using dma api
duplicating this kind of trick?

At least xsk_buff seems to using the similar trick?

You may have explained that, it has been a litte long between v4 and v5,
so I may have forgotten about that.

> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  include/net/page_pool/helpers.h | 60 +++++++++++++++++++++++++++++----
>  net/core/page_pool.c            | 10 +++---
>  2 files changed, 58 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 4ebd544ae977..28bec368b8e9 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -52,6 +52,8 @@
>  #ifndef _NET_PAGE_POOL_HELPERS_H
>  #define _NET_PAGE_POOL_HELPERS_H
>  
> +#include <linux/dma-mapping.h>
> +
>  #include <net/page_pool/types.h>
>  
>  #ifdef CONFIG_PAGE_POOL_STATS
> @@ -354,6 +356,11 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>  	page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct);
>  }
>  
> +/* Occupies page::dma_addr[0:0] and indicates whether the mapping needs to be
> + * synchronized (either for device, for CPU, or both).
> + */
> +#define PAGE_POOL_DMA_ADDR_NEED_SYNC		BIT(0)

It seems we have less dma space from 32 bit arch with 64 bit dma address:(
But from my last searching in google, it seems we still have enough bit for
that:

arm: Large Physical Address Extension or LPAE, 40 bits of phys addr.
arc: Physical Address Extension or PAE, 40 bits of phys addr.
mips: eXtended Physical Addressing or PXA, 40 bits of phys addr.
powerpc: does not seems to have a name for the feature, and have 36
         bits of phys addr.
riscv: large physical address, 34 bits of phys addr.
x86: Physical Address Extension or PAE, 36 bits of phys addr.

> +
>  /**
>   * page_pool_get_dma_addr() - Retrieve the stored DMA address.
>   * @page:	page allocated from a page pool
> @@ -363,27 +370,66 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>   */
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret = page->dma_addr & ~PAGE_POOL_DMA_ADDR_NEED_SYNC;
>  
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
> -		ret <<= PAGE_SHIFT;
> +		ret <<= PAGE_SHIFT - 1;
>  
>  	return ret;
>  }
>  
> -static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> +/**
> + * page_pool_set_dma_addr - save the obtained DMA address for @page
> + * @pool: &page_pool this page belongs to and was mapped by
> + * @page: page to save the address for
> + * @addr: DMA address to save
> + *
> + * If the pool was created with the DMA synchronization enabled, it will test
> + * whether the address needs to be synced and store the result as well to
> + * conserve CPU cycles.
> + *
> + * Return: false if the address doesn't fit into the corresponding field and
> + * thus can't be stored, true on success.
> + */
> +static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
> +					  struct page *page,
> +					  dma_addr_t addr)
>  {
> +	unsigned long val = addr;
> +
> +	if (unlikely(!addr)) {
> +		page->dma_addr = 0;
> +		return true;
> +	}

The above seems unrelated change?

> +
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
> -		page->dma_addr = addr >> PAGE_SHIFT;
> +		val = addr >> (PAGE_SHIFT - 1);
>  
>  		/* We assume page alignment to shave off bottom bits,
>  		 * if this "compression" doesn't work we need to drop.
>  		 */
> -		return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT;
> +		if (addr != (dma_addr_t)val << (PAGE_SHIFT - 1))
> +			return false;
>  	}
>  
> -	page->dma_addr = addr;
> -	return false;
> +	if (pool->dma_sync && dma_need_sync(pool->p.dev, addr))
> +		val |= PAGE_POOL_DMA_ADDR_NEED_SYNC;
> +
> +	page->dma_addr = val;
> +
> +	return true;
> +}
> +

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan