[PATCH net-next] net: adopt SLUB sheaves for skbuff_small

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
@ 2026-02-28 14:12 Eric Dumazet
  2026-02-28 19:51 ` Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-02-28 14:12 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, netdev, eric.dumazet,
	Eric Dumazet, Vlastimil Babka, David Rientjes, Roman Gushchin

skbuff_small_head is used both on receive and send paths,
serving potentially 80 million allocations and frees per second.

Tuning it on large servers has been problematic, especially
on AMD Turins platforms, where "lock cmpxch16b" latency can
be over 30,000 cycles.

Switching to SLUB sheaves fixes the issue nicely.

tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
on AMD Turin.

Other platforms show benefits with tcp_rr with more than 30,000
flows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
---
 net/core/skbuff.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 513cbfed19bc34bbb6767cdd7a50dad68be430fb..79eb7eb6eea9aa4a76c555e6ddd33bf0bc84c921 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5174,6 +5174,19 @@ static void skb_extensions_init(void) {}
 
 void __init skb_init(void)
 {
+	struct kmem_cache_args kmem_args_small_head = {
+		.align		= 0,
+		.ctor		= NULL,
+		/* usercopy should only access first SKB_SMALL_HEAD_HEADROOM
+		 * bytes.
+		 * struct skb_shared_info is located at the end of skb->head,
+		 * and should not be copied to/from user.
+		 */
+		.useroffset	= 0,
+		.usersize	= SKB_SMALL_HEAD_HEADROOM,
+		.sheaf_capacity = 32,
+	};
+
 	net_hotdata.skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",
 					      sizeof(struct sk_buff),
 					      0,
@@ -5189,17 +5202,13 @@ void __init skb_init(void)
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 						NULL);
-	/* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
-	 * struct skb_shared_info is located at the end of skb->head,
-	 * and should not be copied to/from user.
-	 */
-	net_hotdata.skb_small_head_cache = kmem_cache_create_usercopy("skbuff_small_head",
-						SKB_SMALL_HEAD_CACHE_SIZE,
-						0,
-						SLAB_HWCACHE_ALIGN | SLAB_PANIC,
-						0,
-						SKB_SMALL_HEAD_HEADROOM,
-						NULL);
+
+	net_hotdata.skb_small_head_cache = kmem_cache_create(
+			"skbuff_small_head",
+			SKB_SMALL_HEAD_CACHE_SIZE,
+			&kmem_args_small_head,
+			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+
 	skb_extensions_init();
 }
 
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
  2026-02-28 14:12 [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head Eric Dumazet
@ 2026-02-28 19:51 ` Kuniyuki Iwashima
  2026-03-01  8:32 ` Jason Xing
  2026-03-01 11:24 ` Vlastimil Babka
  2 siblings, 0 replies; 5+ messages in thread
From: Kuniyuki Iwashima @ 2026-02-28 19:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Vlastimil Babka, David Rientjes,
	Roman Gushchin

On Sat, Feb 28, 2026 at 6:12 AM Eric Dumazet <edumazet@google.com> wrote:
>
> skbuff_small_head is used both on receive and send paths,
> serving potentially 80 million allocations and frees per second.
>
> Tuning it on large servers has been problematic, especially
> on AMD Turins platforms, where "lock cmpxch16b" latency can
> be over 30,000 cycles.
>
> Switching to SLUB sheaves fixes the issue nicely.
>
> tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
> on AMD Turin.
>
> Other platforms show benefits with tcp_rr with more than 30,000
> flows.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
  2026-02-28 14:12 [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head Eric Dumazet
  2026-02-28 19:51 ` Kuniyuki Iwashima
@ 2026-03-01  8:32 ` Jason Xing
  2026-03-01 11:24 ` Vlastimil Babka
  2 siblings, 0 replies; 5+ messages in thread
From: Jason Xing @ 2026-03-01  8:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, netdev, eric.dumazet, Vlastimil Babka,
	David Rientjes, Roman Gushchin

On Sat, Feb 28, 2026 at 10:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> skbuff_small_head is used both on receive and send paths,
> serving potentially 80 million allocations and frees per second.
>
> Tuning it on large servers has been problematic, especially
> on AMD Turins platforms, where "lock cmpxch16b" latency can
> be over 30,000 cycles.
>
> Switching to SLUB sheaves fixes the issue nicely.
>
> tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
> on AMD Turin.
>
> Other platforms show benefits with tcp_rr with more than 30,000
> flows.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

> ---
>  net/core/skbuff.c | 31 ++++++++++++++++++++-----------
>  1 file changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 513cbfed19bc34bbb6767cdd7a50dad68be430fb..79eb7eb6eea9aa4a76c555e6ddd33bf0bc84c921 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5174,6 +5174,19 @@ static void skb_extensions_init(void) {}
>
>  void __init skb_init(void)
>  {
> +       struct kmem_cache_args kmem_args_small_head = {
> +               .align          = 0,
> +               .ctor           = NULL,
> +               /* usercopy should only access first SKB_SMALL_HEAD_HEADROOM
> +                * bytes.
> +                * struct skb_shared_info is located at the end of skb->head,
> +                * and should not be copied to/from user.
> +                */
> +               .useroffset     = 0,
> +               .usersize       = SKB_SMALL_HEAD_HEADROOM,
> +               .sheaf_capacity = 32,
> +       };
> +
>         net_hotdata.skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",

Looking at the comment of kmem_cache_create_usercopy[1], I realize
that skbuff_head_cache() can also be switched to kmem_cache_create()
as well?

[1]
"This is a legacy wrapper, new code should use either
KMEM_CACHE_USERCOPY()...kmem_cache_create()..."

Thanks,
Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
  2026-02-28 14:12 [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head Eric Dumazet
  2026-02-28 19:51 ` Kuniyuki Iwashima
  2026-03-01  8:32 ` Jason Xing
@ 2026-03-01 11:24 ` Vlastimil Babka
  2026-03-01 16:30   ` Eric Dumazet
  2 siblings, 1 reply; 5+ messages in thread
From: Vlastimil Babka @ 2026-03-01 11:24 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, netdev, eric.dumazet,
	Vlastimil Babka, David Rientjes, Roman Gushchin, Harry Yoo,
	Hao Li

On 2/28/26 15:12, Eric Dumazet wrote:
> skbuff_small_head is used both on receive and send paths,
> serving potentially 80 million allocations and frees per second.
> 
> Tuning it on large servers has been problematic, especially
> on AMD Turins platforms, where "lock cmpxch16b" latency can
> be over 30,000 cycles.

Huh, really? That sounds insane. Any pointers about that?

> Switching to SLUB sheaves fixes the issue nicely.
> 
> tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
> on AMD Turin.
> 
> Other platforms show benefits with tcp_rr with more than 30,000
> flows.

That's nice, thanks!

However I must point out some caveates. I assume you did this on 6.19, where
sheaves are still opt-in. But also, when you opt-in, the pre-existing
per-cpu caching layer of percpu slab and percpu partial slabs is also still
there, so effectively the amount of percpu cached slab objects increase,
which can be the main performance difference for some workloads, and not the
difference between sheaves and percpu (partial) slabs implementation.

Note: but hopefully for your workload it's really the implementation.
"(lock) cmpxch16b" should be avoided, until you start freeing NUMA-remote
(to the freeing cpu) objects in significant volumes.

In 7.0-rc1 sheaves are enabled for every cache automatically, and cpu
(partial) caches are gone completely. Their size is calculated to roughly
match the average amount of percpu caching the old scheme achieved (but that
effectively depended on the workload too, so can't be exactly translated)
and the result is visible in /sys/kernel/slab/$cache/sheaf_capacity
the args.sheaf_capacity can override that automatic sizing, if the specified
one is larger.

So what I would suggest is checking the performance betwen 6.19 and 7.0-rc1
without this patch (hope there won't be any other factors in the upgrade
influencing this much), noting the auto-calculated capacity. If it still
looks good, you don't need to do anything, otherwise you can try making the
capacity larger and see what happens.

> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  net/core/skbuff.c | 31 ++++++++++++++++++++-----------
>  1 file changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 513cbfed19bc34bbb6767cdd7a50dad68be430fb..79eb7eb6eea9aa4a76c555e6ddd33bf0bc84c921 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5174,6 +5174,19 @@ static void skb_extensions_init(void) {}
>  
>  void __init skb_init(void)
>  {
> +	struct kmem_cache_args kmem_args_small_head = {
> +		.align		= 0,
> +		.ctor		= NULL,
> +		/* usercopy should only access first SKB_SMALL_HEAD_HEADROOM
> +		 * bytes.
> +		 * struct skb_shared_info is located at the end of skb->head,
> +		 * and should not be copied to/from user.
> +		 */
> +		.useroffset	= 0,
> +		.usersize	= SKB_SMALL_HEAD_HEADROOM,
> +		.sheaf_capacity = 32,
> +	};
> +
>  	net_hotdata.skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",
>  					      sizeof(struct sk_buff),
>  					      0,
> @@ -5189,17 +5202,13 @@ void __init skb_init(void)
>  						0,
>  						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>  						NULL);
> -	/* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
> -	 * struct skb_shared_info is located at the end of skb->head,
> -	 * and should not be copied to/from user.
> -	 */
> -	net_hotdata.skb_small_head_cache = kmem_cache_create_usercopy("skbuff_small_head",
> -						SKB_SMALL_HEAD_CACHE_SIZE,
> -						0,
> -						SLAB_HWCACHE_ALIGN | SLAB_PANIC,
> -						0,
> -						SKB_SMALL_HEAD_HEADROOM,
> -						NULL);
> +
> +	net_hotdata.skb_small_head_cache = kmem_cache_create(
> +			"skbuff_small_head",
> +			SKB_SMALL_HEAD_CACHE_SIZE,
> +			&kmem_args_small_head,
> +			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
> +
>  	skb_extensions_init();
>  }
>  

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head
  2026-03-01 11:24 ` Vlastimil Babka
@ 2026-03-01 16:30   ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-03-01 16:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kuniyuki Iwashima, netdev, eric.dumazet, Vlastimil Babka,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li

On Sun, Mar 1, 2026 at 12:24 PM Vlastimil Babka <vbabka@suse.com> wrote:
>
> On 2/28/26 15:12, Eric Dumazet wrote:
> > skbuff_small_head is used both on receive and send paths,
> > serving potentially 80 million allocations and frees per second.
> >
> > Tuning it on large servers has been problematic, especially
> > on AMD Turins platforms, where "lock cmpxch16b" latency can
> > be over 30,000 cycles.
>
> Huh, really? That sounds insane. Any pointers about that?

Yes, obviously on semi-contended cache lines.

>
> > Switching to SLUB sheaves fixes the issue nicely.
> >
> > tcp_rr benchmark with 10,000 flows goes from 25 Mpps to 40 Mpps
> > on AMD Turin.
> >
> > Other platforms show benefits with tcp_rr with more than 30,000
> > flows.
>
> That's nice, thanks!
>
> However I must point out some caveates. I assume you did this on 6.19, where
> sheaves are still opt-in. But also, when you opt-in, the pre-existing
> per-cpu caching layer of percpu slab and percpu partial slabs is also still
> there, so effectively the amount of percpu cached slab objects increase,
> which can be the main performance difference for some workloads, and not the
> difference between sheaves and percpu (partial) slabs implementation.
>

Tests are on 6.18 LTS kernel, on which our latest production kernel is based.

> Note: but hopefully for your workload it's really the implementation.
> "(lock) cmpxch16b" should be avoided, until you start freeing NUMA-remote
> (to the freeing cpu) objects in significant volumes.

Right, __slab_free() is absolutely not 'slow path' when we have
~80,000 in-flight objects
on a 512 cpu host.

>
> In 7.0-rc1 sheaves are enabled for every cache automatically, and cpu
> (partial) caches are gone completely. Their size is calculated to roughly
> match the average amount of percpu caching the old scheme achieved (but that
> effectively depended on the workload too, so can't be exactly translated)
> and the result is visible in /sys/kernel/slab/$cache/sheaf_capacity
> the args.sheaf_capacity can override that automatic sizing, if the specified
> one is larger.
>

Nice, I did not know that (I am not following lkml traffic)

> So what I would suggest is checking the performance betwen 6.19 and 7.0-rc1
> without this patch (hope there won't be any other factors in the upgrade
> influencing this much), noting the auto-calculated capacity. If it still
> looks good, you don't need to do anything, otherwise you can try making the
> capacity larger and see what happens.

I can not test this yet using 7.0-rc1.

I guess we will carry this patch privately, and will come back in a
few months when
I can get our infra ready.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-01 16:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-28 14:12 [PATCH net-next] net: adopt SLUB sheaves for skbuff_small_head Eric Dumazet
2026-02-28 19:51 ` Kuniyuki Iwashima
2026-03-01  8:32 ` Jason Xing
2026-03-01 11:24 ` Vlastimil Babka
2026-03-01 16:30   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox