* [PATCH net-next 0/3] gro: some minor optimization
@ 2022-02-03 15:48 Paolo Abeni
2022-02-03 15:48 ` [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Paolo Abeni @ 2022-02-03 15:48 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet
This series collects a few small optimization for the GRO engine.
I measure a 10% performance improvements in micro-benchmarks
around dev_gro_receive(), but deltas are within noise range in tput
tests.
Still with big TCP coming every cycle saved from the GRO engine will
count - I hope ;)
The only change from the RFC is in patch 2, as per Alexander feedback
Paolo Abeni (3):
net: gro: avoid re-computing truesize twice on recycle
net: gro: minor optimization for dev_gro_receive()
net: gro: register gso and gro offload on separate lists
include/linux/netdevice.h | 3 +-
include/net/gro.h | 52 +++++++++---------
net/core/gro.c | 109 +++++++++++++++++++++-----------------
3 files changed, 91 insertions(+), 73 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle 2022-02-03 15:48 [PATCH net-next 0/3] gro: some minor optimization Paolo Abeni @ 2022-02-03 15:48 ` Paolo Abeni 2022-02-03 16:16 ` Alexander Lobakin 2022-02-03 15:48 ` [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni 2 siblings, 1 reply; 12+ messages in thread From: Paolo Abeni @ 2022-02-03 15:48 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") and commit af352460b465 ("net: fix GRO skb truesize update") the truesize of freed skb is properly updated by the GRO engine, we don't need anymore resetting it at recycle time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- net/core/gro.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/core/gro.c b/net/core/gro.c index a11b286d1495..d43d42215bdb 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -634,7 +634,6 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb) skb->encapsulation = 0; skb_shinfo(skb)->gso_type = 0; - skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); if (unlikely(skb->slow_gro)) { skb_orphan(skb); skb_ext_reset(skb); -- 2.34.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle 2022-02-03 15:48 ` [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni @ 2022-02-03 16:16 ` Alexander Lobakin 0 siblings, 0 replies; 12+ messages in thread From: Alexander Lobakin @ 2022-02-03 16:16 UTC (permalink / raw) To: Paolo Abeni Cc: Alexander Lobakin, netdev, David S. Miller, Jakub Kicinski, Eric Dumazet From: Paolo Abeni <pabeni@redhat.com> Date: Thu, 3 Feb 2022 16:48:21 +0100 > After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb > carring sock reference") and commit af352460b465 ("net: fix GRO > skb truesize update") the truesize of freed skb is properly updated ^^^^^ One nit here, I'd change this to "truesize of skb with stolen head" or so. It took me a bit of time to get why we should update the truesize of skb already freed (: Right, napi_reuse_skb() makes use of stolen-data skbs. > by the GRO engine, we don't need anymore resetting it at recycle time. > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > net/core/gro.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/core/gro.c b/net/core/gro.c > index a11b286d1495..d43d42215bdb 100644 > --- a/net/core/gro.c > +++ b/net/core/gro.c > @@ -634,7 +634,6 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb) > > skb->encapsulation = 0; > skb_shinfo(skb)->gso_type = 0; > - skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); > if (unlikely(skb->slow_gro)) { > skb_orphan(skb); > skb_ext_reset(skb); > -- > 2.34.1 Thanks, Al ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() 2022-02-03 15:48 [PATCH net-next 0/3] gro: some minor optimization Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni @ 2022-02-03 15:48 ` Paolo Abeni 2022-02-03 16:09 ` Alexander Lobakin 2022-02-03 16:39 ` Alexander H Duyck 2022-02-03 15:48 ` [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni 2 siblings, 2 replies; 12+ messages in thread From: Paolo Abeni @ 2022-02-03 15:48 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet While inspecting some perf report, I noticed that the compiler emits suboptimal code for the napi CB initialization, fetching and storing multiple times the memory for flags bitfield. This is with gcc 10.3.1, but I observed the same with older compiler versions. We can help the compiler to do a nicer work clearing several fields at once using an u32 alias. The generated code is quite smaller, with the same number of conditional. Before: objdump -t net/core/gro.o | grep " F .text" 0000000000000bb0 l F .text 0000000000000357 dev_gro_receive After: 0000000000000bb0 l F .text 000000000000033c dev_gro_receive RFC -> v1: - use __struct_group to delimt the zeroed area (Alexander) Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- include/net/gro.h | 52 +++++++++++++++++++++++++---------------------- net/core/gro.c | 18 +++++++--------- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/include/net/gro.h b/include/net/gro.h index 8f75802d50fd..fa1bb0f0ad28 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -29,46 +29,50 @@ struct napi_gro_cb { /* Number of segments aggregated. */ u16 count; - /* Start offset for remote checksum offload */ - u16 gro_remcsum_start; + /* Used in ipv6_gro_receive() and foo-over-udp */ + u16 proto; /* jiffies when first packet was created/queued */ unsigned long age; - /* Used in ipv6_gro_receive() and foo-over-udp */ - u16 proto; + /* portion of the cb set to zero at every gro iteration */ + __struct_group(/* no tag */, zeroed, /* no attrs */, + + /* Start offset for remote checksum offload */ + u16 gro_remcsum_start; - /* This is non-zero if the packet may be of the same flow. */ - u8 same_flow:1; + /* This is non-zero if the packet may be of the same flow. */ + u8 same_flow:1; - /* Used in tunnel GRO receive */ - u8 encap_mark:1; + /* Used in tunnel GRO receive */ + u8 encap_mark:1; - /* GRO checksum is valid */ - u8 csum_valid:1; + /* GRO checksum is valid */ + u8 csum_valid:1; - /* Number of checksums via CHECKSUM_UNNECESSARY */ - u8 csum_cnt:3; + /* Number of checksums via CHECKSUM_UNNECESSARY */ + u8 csum_cnt:3; - /* Free the skb? */ - u8 free:2; + /* Free the skb? */ + u8 free:2; #define NAPI_GRO_FREE 1 #define NAPI_GRO_FREE_STOLEN_HEAD 2 - /* Used in foo-over-udp, set in udp[46]_gro_receive */ - u8 is_ipv6:1; + /* Used in foo-over-udp, set in udp[46]_gro_receive */ + u8 is_ipv6:1; - /* Used in GRE, set in fou/gue_gro_receive */ - u8 is_fou:1; + /* Used in GRE, set in fou/gue_gro_receive */ + u8 is_fou:1; - /* Used to determine if flush_id can be ignored */ - u8 is_atomic:1; + /* Used to determine if flush_id can be ignored */ + u8 is_atomic:1; - /* Number of gro_receive callbacks this packet already went through */ - u8 recursion_counter:4; + /* Number of gro_receive callbacks this packet already went through */ + u8 recursion_counter:4; - /* GRO is done by frag_list pointer chaining. */ - u8 is_flist:1; + /* GRO is done by frag_list pointer chaining. */ + u8 is_flist:1; + ); /* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; diff --git a/net/core/gro.c b/net/core/gro.c index d43d42215bdb..fc56be9408c7 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -435,6 +435,9 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head) napi_gro_complete(napi, oldest); } +#define zeroed_len (offsetof(struct napi_gro_cb, zeroed_end) - \ + offsetof(struct napi_gro_cb, zeroed_start)) + static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb) { u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1); @@ -459,29 +462,22 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff skb_set_network_header(skb, skb_gro_offset(skb)); skb_reset_mac_len(skb); - NAPI_GRO_CB(skb)->same_flow = 0; + BUILD_BUG_ON(sizeof_field(struct napi_gro_cb, zeroed) != sizeof(u32)); + BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed), + sizeof(u32))); /* Avoid slow unaligned acc */ + *(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0; NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb); - NAPI_GRO_CB(skb)->free = 0; - NAPI_GRO_CB(skb)->encap_mark = 0; - NAPI_GRO_CB(skb)->recursion_counter = 0; - NAPI_GRO_CB(skb)->is_fou = 0; NAPI_GRO_CB(skb)->is_atomic = 1; - NAPI_GRO_CB(skb)->gro_remcsum_start = 0; /* Setup for GRO checksum validation */ switch (skb->ip_summed) { case CHECKSUM_COMPLETE: NAPI_GRO_CB(skb)->csum = skb->csum; NAPI_GRO_CB(skb)->csum_valid = 1; - NAPI_GRO_CB(skb)->csum_cnt = 0; break; case CHECKSUM_UNNECESSARY: NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1; - NAPI_GRO_CB(skb)->csum_valid = 0; break; - default: - NAPI_GRO_CB(skb)->csum_cnt = 0; - NAPI_GRO_CB(skb)->csum_valid = 0; } pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive, -- 2.34.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() 2022-02-03 15:48 ` [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni @ 2022-02-03 16:09 ` Alexander Lobakin 2022-02-03 16:39 ` Alexander H Duyck 1 sibling, 0 replies; 12+ messages in thread From: Alexander Lobakin @ 2022-02-03 16:09 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet From: Paolo Abeni <pabeni@redhat.com> Date: Thu, 3 Feb 2022 16:48:22 +0100 > While inspecting some perf report, I noticed that the compiler > emits suboptimal code for the napi CB initialization, fetching > and storing multiple times the memory for flags bitfield. > This is with gcc 10.3.1, but I observed the same with older compiler > versions. > > We can help the compiler to do a nicer work clearing several > fields at once using an u32 alias. The generated code is quite > smaller, with the same number of conditional. > > Before: > objdump -t net/core/gro.o | grep " F .text" > 0000000000000bb0 l F .text 0000000000000357 dev_gro_receive > > After: > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > RFC -> v1: > - use __struct_group to delimt the zeroed area (Alexander) "delimit"? > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > include/net/gro.h | 52 +++++++++++++++++++++++++---------------------- > net/core/gro.c | 18 +++++++--------- > 2 files changed, 35 insertions(+), 35 deletions(-) > > diff --git a/include/net/gro.h b/include/net/gro.h > index 8f75802d50fd..fa1bb0f0ad28 100644 > --- a/include/net/gro.h > +++ b/include/net/gro.h > @@ -29,46 +29,50 @@ struct napi_gro_cb { > /* Number of segments aggregated. */ > u16 count; > > - /* Start offset for remote checksum offload */ > - u16 gro_remcsum_start; > + /* Used in ipv6_gro_receive() and foo-over-udp */ > + u16 proto; > > /* jiffies when first packet was created/queued */ > unsigned long age; > > - /* Used in ipv6_gro_receive() and foo-over-udp */ > - u16 proto; > + /* portion of the cb set to zero at every gro iteration */ > + __struct_group(/* no tag */, zeroed, /* no attrs */, Oh, only after sending this in my reply I noticed that there's a shorthand for this struct_group(name, ...) means exactly __struct_group(/* no tag */, name, /* no attrs */, ...) Sorry for that. I got confused by that Kees used __struct_group() directly in one place although there was a possibility to use short struct_group() instead. Since there should already be a v2 (see below), probably worth changing. > + > + /* Start offset for remote checksum offload */ > + u16 gro_remcsum_start; > > - /* This is non-zero if the packet may be of the same flow. */ > - u8 same_flow:1; > + /* This is non-zero if the packet may be of the same flow. */ > + u8 same_flow:1; > > - /* Used in tunnel GRO receive */ > - u8 encap_mark:1; > + /* Used in tunnel GRO receive */ > + u8 encap_mark:1; > > - /* GRO checksum is valid */ > - u8 csum_valid:1; > + /* GRO checksum is valid */ > + u8 csum_valid:1; > > - /* Number of checksums via CHECKSUM_UNNECESSARY */ > - u8 csum_cnt:3; > + /* Number of checksums via CHECKSUM_UNNECESSARY */ > + u8 csum_cnt:3; > > - /* Free the skb? */ > - u8 free:2; > + /* Free the skb? */ > + u8 free:2; > #define NAPI_GRO_FREE 1 > #define NAPI_GRO_FREE_STOLEN_HEAD 2 > > - /* Used in foo-over-udp, set in udp[46]_gro_receive */ > - u8 is_ipv6:1; > + /* Used in foo-over-udp, set in udp[46]_gro_receive */ > + u8 is_ipv6:1; > > - /* Used in GRE, set in fou/gue_gro_receive */ > - u8 is_fou:1; > + /* Used in GRE, set in fou/gue_gro_receive */ > + u8 is_fou:1; > > - /* Used to determine if flush_id can be ignored */ > - u8 is_atomic:1; > + /* Used to determine if flush_id can be ignored */ > + u8 is_atomic:1; > > - /* Number of gro_receive callbacks this packet already went through */ > - u8 recursion_counter:4; > + /* Number of gro_receive callbacks this packet already went through */ > + u8 recursion_counter:4; > > - /* GRO is done by frag_list pointer chaining. */ > - u8 is_flist:1; > + /* GRO is done by frag_list pointer chaining. */ > + u8 is_flist:1; > + ); > > /* used to support CHECKSUM_COMPLETE for tunneling protocols */ > __wsum csum; > diff --git a/net/core/gro.c b/net/core/gro.c > index d43d42215bdb..fc56be9408c7 100644 > --- a/net/core/gro.c > +++ b/net/core/gro.c > @@ -435,6 +435,9 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head) > napi_gro_complete(napi, oldest); > } > > +#define zeroed_len (offsetof(struct napi_gro_cb, zeroed_end) - \ > + offsetof(struct napi_gro_cb, zeroed_start)) > + A leftover from the RFC I guess? It's not used anywhere in the code, so it doesn't break build. > static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb) > { > u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1); > @@ -459,29 +462,22 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff > > skb_set_network_header(skb, skb_gro_offset(skb)); > skb_reset_mac_len(skb); > - NAPI_GRO_CB(skb)->same_flow = 0; > + BUILD_BUG_ON(sizeof_field(struct napi_gro_cb, zeroed) != sizeof(u32)); > + BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed), > + sizeof(u32))); /* Avoid slow unaligned acc */ > + *(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0; Looks relatively elegant and self-explanatory to me, nice. > NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb); > - NAPI_GRO_CB(skb)->free = 0; > - NAPI_GRO_CB(skb)->encap_mark = 0; > - NAPI_GRO_CB(skb)->recursion_counter = 0; > - NAPI_GRO_CB(skb)->is_fou = 0; > NAPI_GRO_CB(skb)->is_atomic = 1; > - NAPI_GRO_CB(skb)->gro_remcsum_start = 0; > > /* Setup for GRO checksum validation */ > switch (skb->ip_summed) { > case CHECKSUM_COMPLETE: > NAPI_GRO_CB(skb)->csum = skb->csum; > NAPI_GRO_CB(skb)->csum_valid = 1; > - NAPI_GRO_CB(skb)->csum_cnt = 0; > break; > case CHECKSUM_UNNECESSARY: > NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1; > - NAPI_GRO_CB(skb)->csum_valid = 0; > break; > - default: > - NAPI_GRO_CB(skb)->csum_cnt = 0; > - NAPI_GRO_CB(skb)->csum_valid = 0; > } > > pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive, > -- > 2.34.1 Al ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() 2022-02-03 15:48 ` [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni 2022-02-03 16:09 ` Alexander Lobakin @ 2022-02-03 16:39 ` Alexander H Duyck 2022-02-03 17:44 ` Paolo Abeni 1 sibling, 1 reply; 12+ messages in thread From: Alexander H Duyck @ 2022-02-03 16:39 UTC (permalink / raw) To: Paolo Abeni, netdev Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet On Thu, 2022-02-03 at 16:48 +0100, Paolo Abeni wrote: > While inspecting some perf report, I noticed that the compiler > emits suboptimal code for the napi CB initialization, fetching > and storing multiple times the memory for flags bitfield. > This is with gcc 10.3.1, but I observed the same with older compiler > versions. > > We can help the compiler to do a nicer work clearing several > fields at once using an u32 alias. The generated code is quite > smaller, with the same number of conditional. > > Before: > objdump -t net/core/gro.o | grep " F .text" > 0000000000000bb0 l F .text 0000000000000357 dev_gro_receive > > After: > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > RFC -> v1: > - use __struct_group to delimt the zeroed area (Alexander) > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > include/net/gro.h | 52 +++++++++++++++++++++++++-------------------- > -- > net/core/gro.c | 18 +++++++--------- > 2 files changed, 35 insertions(+), 35 deletions(-) > > diff --git a/include/net/gro.h b/include/net/gro.h > index 8f75802d50fd..fa1bb0f0ad28 100644 > --- a/include/net/gro.h > +++ b/include/net/gro.h > @@ -29,46 +29,50 @@ struct napi_gro_cb { > /* Number of segments aggregated. */ > u16 count; > > - /* Start offset for remote checksum offload */ > - u16 gro_remcsum_start; > + /* Used in ipv6_gro_receive() and foo-over-udp */ > + u16 proto; > > /* jiffies when first packet was created/queued */ > unsigned long age; > > - /* Used in ipv6_gro_receive() and foo-over-udp */ > - u16 proto; > + /* portion of the cb set to zero at every gro iteration */ > + __struct_group(/* no tag */, zeroed, /* no attrs */, Any specific reason for using __struct_group here rather than the struct_group macro instead? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() 2022-02-03 16:39 ` Alexander H Duyck @ 2022-02-03 17:44 ` Paolo Abeni 0 siblings, 0 replies; 12+ messages in thread From: Paolo Abeni @ 2022-02-03 17:44 UTC (permalink / raw) To: Alexander H Duyck, netdev Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet On Thu, 2022-02-03 at 08:39 -0800, Alexander H Duyck wrote: > On Thu, 2022-02-03 at 16:48 +0100, Paolo Abeni wrote: > > While inspecting some perf report, I noticed that the compiler > > emits suboptimal code for the napi CB initialization, fetching > > and storing multiple times the memory for flags bitfield. > > This is with gcc 10.3.1, but I observed the same with older compiler > > versions. > > > > We can help the compiler to do a nicer work clearing several > > fields at once using an u32 alias. The generated code is quite > > smaller, with the same number of conditional. > > > > Before: > > objdump -t net/core/gro.o | grep " F .text" > > 0000000000000bb0 l F .text 0000000000000357 dev_gro_receive > > > > After: > > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > > > RFC -> v1: > > - use __struct_group to delimt the zeroed area (Alexander) > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > --- > > include/net/gro.h | 52 +++++++++++++++++++++++++-------------------- > > -- > > net/core/gro.c | 18 +++++++--------- > > 2 files changed, 35 insertions(+), 35 deletions(-) > > > > diff --git a/include/net/gro.h b/include/net/gro.h > > index 8f75802d50fd..fa1bb0f0ad28 100644 > > --- a/include/net/gro.h > > +++ b/include/net/gro.h > > @@ -29,46 +29,50 @@ struct napi_gro_cb { > > /* Number of segments aggregated. */ > > u16 count; > > > > - /* Start offset for remote checksum offload */ > > - u16 gro_remcsum_start; > > + /* Used in ipv6_gro_receive() and foo-over-udp */ > > + u16 proto; > > > > /* jiffies when first packet was created/queued */ > > unsigned long age; > > > > - /* Used in ipv6_gro_receive() and foo-over-udp */ > > - u16 proto; > > + /* portion of the cb set to zero at every gro iteration */ > > + __struct_group(/* no tag */, zeroed, /* no attrs */, > > Any specific reason for using __struct_group here rather than the > struct_group macro instead? Just sheer ignorance on my side. I'll fix on the next iteration. Thanks! Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists 2022-02-03 15:48 [PATCH net-next 0/3] gro: some minor optimization Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni @ 2022-02-03 15:48 ` Paolo Abeni 2022-02-03 16:11 ` Eric Dumazet 2 siblings, 1 reply; 12+ messages in thread From: Paolo Abeni @ 2022-02-03 15:48 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Jakub Kicinski, Alexander Lobakin, Eric Dumazet So that we know each element in gro_list has valid gro callbacks (and the same for gso). This allows dropping a bunch of conditional in fastpath. Before: objdump -t net/core/gro.o | grep " F .text" 0000000000000bb0 l F .text 000000000000033c dev_gro_receive After: 0000000000000bb0 l F .text 0000000000000325 dev_gro_receive Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- include/linux/netdevice.h | 3 +- net/core/gro.c | 90 +++++++++++++++++++++++---------------- 2 files changed, 56 insertions(+), 37 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3213c7227b59..406cb457d788 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2564,7 +2564,8 @@ struct packet_offload { __be16 type; /* This is really htons(ether_type). */ u16 priority; struct offload_callbacks callbacks; - struct list_head list; + struct list_head gro_list; + struct list_head gso_list; }; /* often modified stats are per-CPU, other are shared (netdev->stats) */ diff --git a/net/core/gro.c b/net/core/gro.c index fc56be9408c7..bd619d494fdd 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -10,10 +10,21 @@ #define GRO_MAX_HEAD (MAX_HEADER + 128) static DEFINE_SPINLOCK(offload_lock); -static struct list_head offload_base __read_mostly = LIST_HEAD_INIT(offload_base); +static struct list_head gro_offload_base __read_mostly = LIST_HEAD_INIT(gro_offload_base); +static struct list_head gso_offload_base __read_mostly = LIST_HEAD_INIT(gso_offload_base); /* Maximum number of GRO_NORMAL skbs to batch up for list-RX */ int gro_normal_batch __read_mostly = 8; +#define offload_list_insert(head, poff, list) \ +({ \ + struct packet_offload *elem; \ + list_for_each_entry(elem, head, list) { \ + if ((poff)->priority < elem->priority) \ + break; \ + } \ + list_add_rcu(&(poff)->list, elem->list.prev); \ +}) + /** * dev_add_offload - register offload handlers * @po: protocol offload declaration @@ -28,18 +39,33 @@ int gro_normal_batch __read_mostly = 8; */ void dev_add_offload(struct packet_offload *po) { - struct packet_offload *elem; - spin_lock(&offload_lock); - list_for_each_entry(elem, &offload_base, list) { - if (po->priority < elem->priority) - break; - } - list_add_rcu(&po->list, elem->list.prev); + if (po->callbacks.gro_receive && po->callbacks.gro_complete) + offload_list_insert(&gro_offload_base, po, gro_list); + else if (po->callbacks.gro_complete) + pr_warn("missing gro_receive callback"); + else if (po->callbacks.gro_receive) + pr_warn("missing gro_complete callback"); + + if (po->callbacks.gso_segment) + offload_list_insert(&gso_offload_base, po, gso_list); spin_unlock(&offload_lock); } EXPORT_SYMBOL(dev_add_offload); +#define offload_list_remove(type, head, poff, list) \ +({ \ + struct packet_offload *elem; \ + list_for_each_entry(elem, head, list) { \ + if ((poff) == elem) { \ + list_del_rcu(&(poff)->list); \ + break; \ + } \ + } \ + if (elem != (poff)) \ + pr_warn("dev_remove_offload: %p not found in %s list\n", (poff), type); \ +}) + /** * __dev_remove_offload - remove offload handler * @po: packet offload declaration @@ -55,20 +81,12 @@ EXPORT_SYMBOL(dev_add_offload); */ static void __dev_remove_offload(struct packet_offload *po) { - struct list_head *head = &offload_base; - struct packet_offload *po1; - spin_lock(&offload_lock); + if (po->callbacks.gro_receive) + offload_list_remove("gro", &gro_offload_base, po, gso_list); - list_for_each_entry(po1, head, list) { - if (po == po1) { - list_del_rcu(&po->list); - goto out; - } - } - - pr_warn("dev_remove_offload: %p not found\n", po); -out: + if (po->callbacks.gso_segment) + offload_list_remove("gso", &gso_offload_base, po, gro_list); spin_unlock(&offload_lock); } @@ -111,8 +129,8 @@ struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb, __skb_pull(skb, vlan_depth); rcu_read_lock(); - list_for_each_entry_rcu(ptype, &offload_base, list) { - if (ptype->type == type && ptype->callbacks.gso_segment) { + list_for_each_entry_rcu(ptype, &gso_offload_base, gso_list) { + if (ptype->type == type) { segs = ptype->callbacks.gso_segment(skb, features); break; } @@ -250,7 +268,7 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb) { struct packet_offload *ptype; __be16 type = skb->protocol; - struct list_head *head = &offload_base; + struct list_head *head = &gro_offload_base; int err = -ENOENT; BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb)); @@ -261,8 +279,8 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb) } rcu_read_lock(); - list_for_each_entry_rcu(ptype, head, list) { - if (ptype->type != type || !ptype->callbacks.gro_complete) + list_for_each_entry_rcu(ptype, head, gro_list) { + if (ptype->type != type) continue; err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete, @@ -273,7 +291,7 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb) rcu_read_unlock(); if (err) { - WARN_ON(&ptype->list == head); + WARN_ON(&ptype->gro_list == head); kfree_skb(skb); return; } @@ -442,7 +460,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff { u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1); struct gro_list *gro_list = &napi->gro_hash[bucket]; - struct list_head *head = &offload_base; + struct list_head *head = &gro_offload_base; struct packet_offload *ptype; __be16 type = skb->protocol; struct sk_buff *pp = NULL; @@ -456,8 +474,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff gro_list_prepare(&gro_list->list, skb); rcu_read_lock(); - list_for_each_entry_rcu(ptype, head, list) { - if (ptype->type != type || !ptype->callbacks.gro_receive) + list_for_each_entry_rcu(ptype, head, gro_list) { + if (ptype->type != type) continue; skb_set_network_header(skb, skb_gro_offset(skb)); @@ -487,7 +505,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff } rcu_read_unlock(); - if (&ptype->list == head) + if (&ptype->gro_list == head) goto normal; if (PTR_ERR(pp) == -EINPROGRESS) { @@ -543,11 +561,11 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff struct packet_offload *gro_find_receive_by_type(__be16 type) { - struct list_head *offload_head = &offload_base; + struct list_head *offload_head = &gro_offload_base; struct packet_offload *ptype; - list_for_each_entry_rcu(ptype, offload_head, list) { - if (ptype->type != type || !ptype->callbacks.gro_receive) + list_for_each_entry_rcu(ptype, offload_head, gro_list) { + if (ptype->type != type) continue; return ptype; } @@ -557,11 +575,11 @@ EXPORT_SYMBOL(gro_find_receive_by_type); struct packet_offload *gro_find_complete_by_type(__be16 type) { - struct list_head *offload_head = &offload_base; + struct list_head *offload_head = &gro_offload_base; struct packet_offload *ptype; - list_for_each_entry_rcu(ptype, offload_head, list) { - if (ptype->type != type || !ptype->callbacks.gro_complete) + list_for_each_entry_rcu(ptype, offload_head, gro_list) { + if (ptype->type != type) continue; return ptype; } -- 2.34.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists 2022-02-03 15:48 ` [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni @ 2022-02-03 16:11 ` Eric Dumazet 2022-02-03 16:26 ` Alexander Lobakin 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-02-03 16:11 UTC (permalink / raw) To: Paolo Abeni; +Cc: netdev, David S. Miller, Jakub Kicinski, Alexander Lobakin On Thu, Feb 3, 2022 at 7:48 AM Paolo Abeni <pabeni@redhat.com> wrote: > > So that we know each element in gro_list has valid gro callbacks > (and the same for gso). This allows dropping a bunch of conditional > in fastpath. > > Before: > objdump -t net/core/gro.o | grep " F .text" > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > After: > 0000000000000bb0 l F .text 0000000000000325 dev_gro_receive > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > include/linux/netdevice.h | 3 +- > net/core/gro.c | 90 +++++++++++++++++++++++---------------- > 2 files changed, 56 insertions(+), 37 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 3213c7227b59..406cb457d788 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2564,7 +2564,8 @@ struct packet_offload { > __be16 type; /* This is really htons(ether_type). */ > u16 priority; > struct offload_callbacks callbacks; > - struct list_head list; > + struct list_head gro_list; > + struct list_head gso_list; > }; > On the other hand, this makes this object bigger, increasing the risk of spanning cache lines. It would be nice to group all struct packet_offload together in the same section to increase data locality. I played in the past with a similar idea, but splitting struct packet_offload in two structures, one for GRO, one for GSO. (Note that GSO is hardly ever use with modern NIC) But the gains were really marginal. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists 2022-02-03 16:11 ` Eric Dumazet @ 2022-02-03 16:26 ` Alexander Lobakin 2022-02-03 16:41 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Alexander Lobakin @ 2022-02-03 16:26 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Lobakin, Paolo Abeni, netdev, David S. Miller, Jakub Kicinski From: Eric Dumazet <edumazet@google.com> Date: Thu, 3 Feb 2022 08:11:43 -0800 > On Thu, Feb 3, 2022 at 7:48 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > So that we know each element in gro_list has valid gro callbacks > > (and the same for gso). This allows dropping a bunch of conditional > > in fastpath. > > > > Before: > > objdump -t net/core/gro.o | grep " F .text" > > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > > > After: > > 0000000000000bb0 l F .text 0000000000000325 dev_gro_receive > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > --- > > include/linux/netdevice.h | 3 +- > > net/core/gro.c | 90 +++++++++++++++++++++++---------------- > > 2 files changed, 56 insertions(+), 37 deletions(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 3213c7227b59..406cb457d788 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -2564,7 +2564,8 @@ struct packet_offload { > > __be16 type; /* This is really htons(ether_type). */ > > u16 priority; > > struct offload_callbacks callbacks; > > - struct list_head list; > > + struct list_head gro_list; > > + struct list_head gso_list; > > }; > > > > On the other hand, this makes this object bigger, increasing the risk > of spanning cache lines. As you said, GSO callbacks are barely used with most modern NICs. Plus GRO and GSO callbacks are used in the completely different operations. `gro_list` occupies the same place where the `list` previously was. Does it make a lot of sense to care about `gso_list` being placed in a cacheline separate from `gro_list`? > > It would be nice to group all struct packet_offload together in the > same section to increase data locality. > > I played in the past with a similar idea, but splitting struct > packet_offload in two structures, one for GRO, one for GSO. > (Note that GSO is hardly ever use with modern NIC) > > But the gains were really marginal. Thanks, Al ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists 2022-02-03 16:26 ` Alexander Lobakin @ 2022-02-03 16:41 ` Eric Dumazet 2022-02-03 18:07 ` Paolo Abeni 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-02-03 16:41 UTC (permalink / raw) To: Alexander Lobakin; +Cc: Paolo Abeni, netdev, David S. Miller, Jakub Kicinski On Thu, Feb 3, 2022 at 8:28 AM Alexander Lobakin <alexandr.lobakin@intel.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Thu, 3 Feb 2022 08:11:43 -0800 > > > On Thu, Feb 3, 2022 at 7:48 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > So that we know each element in gro_list has valid gro callbacks > > > (and the same for gso). This allows dropping a bunch of conditional > > > in fastpath. > > > > > > Before: > > > objdump -t net/core/gro.o | grep " F .text" > > > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > > > > > After: > > > 0000000000000bb0 l F .text 0000000000000325 dev_gro_receive > > > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > > --- > > > include/linux/netdevice.h | 3 +- > > > net/core/gro.c | 90 +++++++++++++++++++++++---------------- > > > 2 files changed, 56 insertions(+), 37 deletions(-) > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > index 3213c7227b59..406cb457d788 100644 > > > --- a/include/linux/netdevice.h > > > +++ b/include/linux/netdevice.h > > > @@ -2564,7 +2564,8 @@ struct packet_offload { > > > __be16 type; /* This is really htons(ether_type). */ > > > u16 priority; > > > struct offload_callbacks callbacks; > > > - struct list_head list; > > > + struct list_head gro_list; > > > + struct list_head gso_list; > > > }; > > > > > > > On the other hand, this makes this object bigger, increasing the risk > > of spanning cache lines. > > As you said, GSO callbacks are barely used with most modern NICs. > Plus GRO and GSO callbacks are used in the completely different > operations. > `gro_list` occupies the same place where the `list` previously was. > Does it make a lot of sense to care about `gso_list` being placed > in a cacheline separate from `gro_list`? Not sure what you are asking in this last sentence. Putting together all struct packet_offload and struct net_offload would reduce number of cache lines, but making these objects bigger is conflicting. Another approach to avoiding conditional tests would be to provide non NULL values for all "struct offload_callbacks" members, just in case we missed something. I think the test over ptype->type == type should be enough, right ? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists 2022-02-03 16:41 ` Eric Dumazet @ 2022-02-03 18:07 ` Paolo Abeni 0 siblings, 0 replies; 12+ messages in thread From: Paolo Abeni @ 2022-02-03 18:07 UTC (permalink / raw) To: Eric Dumazet, Alexander Lobakin; +Cc: netdev, David S. Miller, Jakub Kicinski On Thu, 2022-02-03 at 08:41 -0800, Eric Dumazet wrote: > On Thu, Feb 3, 2022 at 8:28 AM Alexander Lobakin > <alexandr.lobakin@intel.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Thu, 3 Feb 2022 08:11:43 -0800 > > > > > On Thu, Feb 3, 2022 at 7:48 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > So that we know each element in gro_list has valid gro callbacks > > > > (and the same for gso). This allows dropping a bunch of conditional > > > > in fastpath. > > > > > > > > Before: > > > > objdump -t net/core/gro.o | grep " F .text" > > > > 0000000000000bb0 l F .text 000000000000033c dev_gro_receive > > > > > > > > After: > > > > 0000000000000bb0 l F .text 0000000000000325 dev_gro_receive > > > > > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > > > --- > > > > include/linux/netdevice.h | 3 +- > > > > net/core/gro.c | 90 +++++++++++++++++++++++---------------- > > > > 2 files changed, 56 insertions(+), 37 deletions(-) > > > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > > index 3213c7227b59..406cb457d788 100644 > > > > --- a/include/linux/netdevice.h > > > > +++ b/include/linux/netdevice.h > > > > @@ -2564,7 +2564,8 @@ struct packet_offload { > > > > __be16 type; /* This is really htons(ether_type). */ > > > > u16 priority; > > > > struct offload_callbacks callbacks; > > > > - struct list_head list; > > > > + struct list_head gro_list; > > > > + struct list_head gso_list; > > > > }; > > > > > > > > > > On the other hand, this makes this object bigger, increasing the risk > > > of spanning cache lines. > > > > As you said, GSO callbacks are barely used with most modern NICs. > > Plus GRO and GSO callbacks are used in the completely different > > operations. > > `gro_list` occupies the same place where the `list` previously was. > > Does it make a lot of sense to care about `gso_list` being placed > > in a cacheline separate from `gro_list`? > > Not sure what you are asking in this last sentence. > > Putting together all struct packet_offload and struct net_offload > would reduce number of cache lines, > but making these objects bigger is conflicting. > > Another approach to avoiding conditional tests would be to provide non > NULL values > for all "struct offload_callbacks" members, just in case we missed something. > > I think the test over ptype->type == type should be enough, right ? I'll try this later approach, it looks simpler. Also I'll keep this patch separate, as the others looks less controversial. Thanks! Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-02-03 18:07 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-02-03 15:48 [PATCH net-next 0/3] gro: some minor optimization Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 1/3] net: gro: avoid re-computing truesize twice on recycle Paolo Abeni 2022-02-03 16:16 ` Alexander Lobakin 2022-02-03 15:48 ` [PATCH net-next 2/3] net: gro: minor optimization for dev_gro_receive() Paolo Abeni 2022-02-03 16:09 ` Alexander Lobakin 2022-02-03 16:39 ` Alexander H Duyck 2022-02-03 17:44 ` Paolo Abeni 2022-02-03 15:48 ` [PATCH net-next 3/3] net: gro: register gso and gro offload on separate lists Paolo Abeni 2022-02-03 16:11 ` Eric Dumazet 2022-02-03 16:26 ` Alexander Lobakin 2022-02-03 16:41 ` Eric Dumazet 2022-02-03 18:07 ` Paolo Abeni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).