From: Paolo Abeni <pabeni@redhat.com>
To: Eric Dumazet <edumazet@google.com>
Cc: patchwork-bot+netdevbpf@kernel.org,
netdev <netdev@vger.kernel.org>,
David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Alexander Duyck <alexanderduyck@fb.com>
Subject: Re: [PATCH net-next v4] net: skb: introduce and use a single page frag cache
Date: Mon, 03 Oct 2022 23:40:43 +0200 [thread overview]
Message-ID: <d15aab527c1979e0bf539e8e1609f0770b4170fc.camel@redhat.com> (raw)
In-Reply-To: <06a1cdb330980e3df051c95ae089fd77afee839b.camel@redhat.com>
On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:
> On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > >
> > > > Before the patch, cpus servicing NIC interrupts were allocating
> > > > SLAB/SLUB objects for incoming packets,
> > > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > > were processed. SLAB/SLUB caches
> > > > were efficient (hit ratio close to 100%)
> > >
> > > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> >
> > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> > host with 256 cpus...
>
> The most similar host I can easily grab is a 2 numa nodes AMD with 16
> cores/32 threads each.
>
> I tried tcp_rr with different number of flows in (1-2048) range with
> both slub (I tried first that because is the default allocator in my
> builds) and slab but so far I can't reproduce it: results differences
> between pre-patch and post-patch kernel are within noise and numastat
> never show misses.
>
> I'm likely missing some incredient to the recipe. I'll try next to pin
> the netperf/netserver processes on a numa node different from the NIC's
> one and to increase the number of concurrent flows.
I'm still stuck trying to reproduce the issue. I tried pinning and
increasing the flows numbers, but I could not find a scenario with a
clear regression. I see some contention, but it's related to the timer
wheel spinlock, and independent from my patch.
Which kind of delta should I observe? Could you please share any
additional setup hints?
> I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
> freeing deferral to per-cpu lists") the CPUs servicing the NIC
> interrupts both allocate and (defer) free the memory for the incoming
> packets, so they should not have to access remote caches ?!? Or at
> least isn't the allocator behavior always asymmetric, with rx alloc, rx
> free, tx free on the same core and tx alloc possibly on a different
> one?
>
> >
> > We could first try in tcp_stream_alloc_skb()
>
> I'll try to test something alike the following - after reproducing the
> issue.
> ---
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f15d5b62539b..d5e9be98e8bd 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
> return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> }
>
> +#if PAGE_SIZE == SZ_4K
> +
> +#define NAPI_HAS_SMALL_PAGE_FRAG 1
> +
> +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
> +
> +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> +{
> + if (size <= SKB_WITH_OVERHEAD(1024))
> + return __alloc_skb_fclone_frag(size, priority);
> +
> + return alloc_skb_fclone(size, priority);
> +}
> +
> +#else
> +#define NAPI_HAS_SMALL_PAGE_FRAG 0
> +
> +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> +{
> + return alloc_skb_fclone(size, priority);
> +}
> +
> +#endif
> +
> struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
> void skb_headers_offset_update(struct sk_buff *skb, int off);
> int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 81d63f95e865..0c63653c9951 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
> #define NAPI_SKB_CACHE_BULK 16
> #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2)
>
> -#if PAGE_SIZE == SZ_4K
> +#if NAPI_HAS_SMALL_PAGE_FRAG
>
> -#define NAPI_HAS_SMALL_PAGE_FRAG 1
> #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc)
>
> /* specialized page frag allocator using a single order 0 page
> @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
> nc->offset = offset;
> return nc->va + offset;
> }
> +
> #else
>
> /* the small page is actually unused in this build; add dummy helpers
> * to please the compiler and avoid later preprocessor's conditionals
> */
> -#define NAPI_HAS_SMALL_PAGE_FRAG 0
> #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false
>
> struct page_frag_1k {
> @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> }
> EXPORT_SYMBOL(__alloc_skb);
>
> +#if NAPI_HAS_SMALL_PAGE_FRAG
> +/* optimized skb fast-clone allocation variant, using the small
> + * page frag cache
> + */
> +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
> +{
> + struct sk_buff_fclones *fclones;
> + struct napi_alloc_cache *nc;
> + struct sk_buff *skb;
> + bool pfmemalloc;
> + void *data;
> +
> + /* Get the HEAD */
> + skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
> + if (unlikely(!skb))
> + return NULL;
> + prefetchw(skb);
> +
> + /* We can access the napi cache only in BH context */
> + nc = this_cpu_ptr(&napi_alloc_cache);
> + local_bh_disable();
For the record, the above is wrong, this_cpu_ptr must be after the
local_bh_disable() call.
Thanks,
Paolo
next prev parent reply other threads:[~2022-10-03 21:40 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-28 8:43 [PATCH net-next v4] net: skb: introduce and use a single page frag cache Paolo Abeni
2022-09-28 14:11 ` Eric Dumazet
2022-09-28 14:24 ` Alexander H Duyck
2022-09-30 2:21 ` patchwork-bot+netdevbpf
2022-09-30 16:43 ` Eric Dumazet
2022-09-30 17:30 ` Paolo Abeni
2022-09-30 17:45 ` Eric Dumazet
2022-10-02 14:55 ` Paolo Abeni
2022-10-03 21:40 ` Paolo Abeni [this message]
2022-10-03 22:30 ` Eric Dumazet
2022-10-05 7:33 ` Paolo Abeni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d15aab527c1979e0bf539e8e1609f0770b4170fc.camel@redhat.com \
--to=pabeni@redhat.com \
--cc=alexanderduyck@fb.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=patchwork-bot+netdevbpf@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).