* [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP
@ 2023-04-27 19:24 Eric Dumazet
2023-04-27 21:08 ` David Ahern
0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2023-04-27 19:24 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet, David Ahern, Xin Long,
Willem de Bruijn, Coco Li
David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
using hugepages, and skb length bigger than ~68 KB.
skb_copy_ubufs() assumed it could copy all payload using up to
MAX_SKB_FRAGS order-0 pages.
This assumption broke when BIG TCP was able to put up to 512 KB per skb.
We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
and limit gso_max_size to 180000.
A solution is to use higher order pages if needed.
Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536")
Reported-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Coco Li <lixiaoyan@google.com>
---
net/core/skbuff.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2112146092bfe24061b24b9e4684bcc031a045b9..891ecca40b29af1e15a89745f7fc630b19ea0202 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1758,7 +1758,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
{
int num_frags = skb_shinfo(skb)->nr_frags;
struct page *page, *head = NULL;
- int i, new_frags;
+ int i, order, psize, new_frags;
u32 d_off;
if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
@@ -1767,9 +1767,17 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
if (!num_frags)
goto release;
- new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* We might have to allocate high order pages, so compute what minimum
+ * page order is needed.
+ */
+ order = 0;
+ while ((PAGE_SIZE << order) * MAX_SKB_FRAGS < __skb_pagelen(skb))
+ order++;
+ psize = (PAGE_SIZE << order);
+
+ new_frags = (__skb_pagelen(skb) + psize - 1) >> (PAGE_SHIFT + order);
for (i = 0; i < new_frags; i++) {
- page = alloc_page(gfp_mask);
+ page = alloc_pages(gfp_mask, order);
if (!page) {
while (head) {
struct page *next = (struct page *)page_private(head);
@@ -1796,11 +1804,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
vaddr = kmap_atomic(p);
while (done < p_len) {
- if (d_off == PAGE_SIZE) {
+ if (d_off == psize) {
d_off = 0;
page = (struct page *)page_private(page);
}
- copy = min_t(u32, PAGE_SIZE - d_off, p_len - done);
+ copy = min_t(u32, psize - d_off, p_len - done);
memcpy(page_address(page) + d_off,
vaddr + p_off + done, copy);
done += copy;
@@ -1816,7 +1824,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
/* skb frags point to kernel buffers */
for (i = 0; i < new_frags - 1; i++) {
- __skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
+ __skb_fill_page_desc(skb, i, head, 0, psize);
head = (struct page *)page_private(head);
}
__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
--
2.40.1.495.gc816e09b53d-goog
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP
2023-04-27 19:24 [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP Eric Dumazet
@ 2023-04-27 21:08 ` David Ahern
2023-04-27 21:21 ` Willem de Bruijn
0 siblings, 1 reply; 5+ messages in thread
From: David Ahern @ 2023-04-27 21:08 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Xin Long, Willem de Bruijn, Coco Li
On 4/27/23 1:24 PM, Eric Dumazet wrote:
> David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
> using hugepages, and skb length bigger than ~68 KB.
>
> skb_copy_ubufs() assumed it could copy all payload using up to
> MAX_SKB_FRAGS order-0 pages.
>
> This assumption broke when BIG TCP was able to put up to 512 KB per skb.
Just an FYI - the problem was triggered at 128kB.
>
> We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
> and limit gso_max_size to 180000.
>
> A solution is to use higher order pages if needed.
>
> Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536")
> Reported-by: David Ahern <dsahern@kernel.org>
> Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Coco Li <lixiaoyan@google.com>
> ---
> net/core/skbuff.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
Reviewed-by: David Ahern <dsahern@kernel.org>
Tested-by: David Ahern <dsahern@kernel.org>
Thanks, Eric. With ConnectX-6's connected back-to-back and S/W only
settings (max GRO, GSO size and 9000 MTU) able to hit ~160G with both
IPv4 and IPv6. I added nr_frags to the net_dev_xmit to verify various
permutations; no issues with the patch.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP
2023-04-27 21:08 ` David Ahern
@ 2023-04-27 21:21 ` Willem de Bruijn
[not found] ` <CADvbK_eX0t9KGuhdERweue3BkefVtzZx-ZQBVhfe8jb1aO6eZw@mail.gmail.com>
0 siblings, 1 reply; 5+ messages in thread
From: Willem de Bruijn @ 2023-04-27 21:21 UTC (permalink / raw)
To: David Ahern, Eric Dumazet, David S . Miller, Jakub Kicinski,
Paolo Abeni
Cc: netdev, eric.dumazet, Xin Long, Willem de Bruijn, Coco Li
David Ahern wrote:
> On 4/27/23 1:24 PM, Eric Dumazet wrote:
> > David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
> > using hugepages, and skb length bigger than ~68 KB.
> >
> > skb_copy_ubufs() assumed it could copy all payload using up to
> > MAX_SKB_FRAGS order-0 pages.
> >
> > This assumption broke when BIG TCP was able to put up to 512 KB per skb.
>
> Just an FYI - the problem was triggered at 128kB.
>
> >
> > We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
> > and limit gso_max_size to 180000.
> >
> > A solution is to use higher order pages if needed.
> >
> > Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536")
> > Reported-by: David Ahern <dsahern@kernel.org>
> > Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Xin Long <lucien.xin@gmail.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Coco Li <lixiaoyan@google.com>
> > ---
> > net/core/skbuff.c | 20 ++++++++++++++------
> > 1 file changed, 14 insertions(+), 6 deletions(-)
> >
>
>
> Reviewed-by: David Ahern <dsahern@kernel.org>
> Tested-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP
[not found] ` <CADvbK_eX0t9KGuhdERweue3BkefVtzZx-ZQBVhfe8jb1aO6eZw@mail.gmail.com>
@ 2023-04-28 2:13 ` David Ahern
2023-04-28 4:19 ` Eric Dumazet
1 sibling, 0 replies; 5+ messages in thread
From: David Ahern @ 2023-04-28 2:13 UTC (permalink / raw)
To: Xin Long, Willem de Bruijn
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
netdev, eric.dumazet, Willem de Bruijn, Coco Li
On 4/27/23 6:40 PM, Xin Long wrote:
>
> I just ran David's test scripts in a metal machine:
>
> There seem memleak with this patch, and the performance is impaired too:
>
> # free -h
> total used free shared buff/cache
> available
> Mem: 31Gi 999Mi 30Gi 12Mi 303Mi
> 30Gi
I can confirm the memleak; thanks for noticing that. It is really easy
to reproduce on a real nic by just running tcpdump with large tso packets.
>
> I could also see some call trace:
>
> [ 271.416989] warn_alloc: 640 callbacks suppressed
> [ 271.417006] lt-iperf3: page allocation failure: order:1,
> mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-1
> [ 271.432684] CPU: 1 PID: 2218 Comm: lt-iperf3 Tainted: G S
> 6.3.0.net0 #10
> [ 271.440783] Hardware name: Supermicro
> X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.2 06/04/2015
> [ 271.449831] Call Trace:
> [ 271.452276] <IRQ>
> [ 271.454286] dump_stack_lvl+0x36/0x50
> [ 271.457953] warn_alloc+0x11b/0x190
> [ 271.461445] __alloc_pages_slowpath.constprop.119+0xcb9/0xd40
> [ 271.467192] __alloc_pages+0x32d/0x340
> [ 271.470944] skb_copy_ubufs+0x11b/0x630
I did not trigger this ... perhaps due to memory pressure from the leak?
performance wise, veth hits the skb_copy_ubufs when the packet crosses
the namespace, so ZC performance (with and without hugepages ) lags
non-ZC for the same host use case.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP
[not found] ` <CADvbK_eX0t9KGuhdERweue3BkefVtzZx-ZQBVhfe8jb1aO6eZw@mail.gmail.com>
2023-04-28 2:13 ` David Ahern
@ 2023-04-28 4:19 ` Eric Dumazet
1 sibling, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2023-04-28 4:19 UTC (permalink / raw)
To: Xin Long
Cc: Willem de Bruijn, David Ahern, David S . Miller, Jakub Kicinski,
Paolo Abeni, netdev, eric.dumazet, Willem de Bruijn, Coco Li
On Fri, Apr 28, 2023 at 2:40 AM Xin Long <lucien.xin@gmail.com> wrote:
>
>
>
> On Thu, Apr 27, 2023 at 5:21 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>>
>> David Ahern wrote:
>> > On 4/27/23 1:24 PM, Eric Dumazet wrote:
>> > > David Ahern reported crashes in skb_copy_ubufs() caused by TCP tx zerocopy
>> > > using hugepages, and skb length bigger than ~68 KB.
>> > >
>> > > skb_copy_ubufs() assumed it could copy all payload using up to
>> > > MAX_SKB_FRAGS order-0 pages.
>> > >
>> > > This assumption broke when BIG TCP was able to put up to 512 KB per skb.
>> >
>> > Just an FYI - the problem was triggered at 128kB.
>> >
>> > >
>> > > We did not hit this bug at Google because we use CONFIG_MAX_SKB_FRAGS=45
>> > > and limit gso_max_size to 180000.
>> > >
>> > > A solution is to use higher order pages if needed.
>> > >
>> > > Fixes: 7c4e983c4f3c ("net: allow gso_max_size to exceed 65536")
>> > > Reported-by: David Ahern <dsahern@kernel.org>
>> > > Link: https://lore.kernel.org/netdev/c70000f6-baa4-4a05-46d0-4b3e0dc1ccc8@gmail.com/T/
>> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > > Cc: Xin Long <lucien.xin@gmail.com>
>> > > Cc: Willem de Bruijn <willemb@google.com>
>> > > Cc: Coco Li <lixiaoyan@google.com>
>> > > ---
>> > > net/core/skbuff.c | 20 ++++++++++++++------
>> > > 1 file changed, 14 insertions(+), 6 deletions(-)
>> > >
>> >
>> >
>> > Reviewed-by: David Ahern <dsahern@kernel.org>
>> > Tested-by: David Ahern <dsahern@kernel.org>
>>
>> Reviewed-by: Willem de Bruijn <willemb@google.com>
>
>
> I just ran David's test scripts in a metal machine:
>
> There seem memleak with this patch, and the performance is impaired too:
Oops, it seems I forgot __GFP_COMP
I will test the following on top of V1
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 891ecca40b29af1e15a89745f7fc630b19ea0202..26a586007d8b1ae39ab7a09eecf8575e04dadfeb
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1777,7 +1777,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
new_frags = (__skb_pagelen(skb) + psize - 1) >> (PAGE_SHIFT + order);
for (i = 0; i < new_frags; i++) {
- page = alloc_pages(gfp_mask, order);
+ page = alloc_pages(gfp_mask | __GFP_COMP, order);
if (!page) {
while (head) {
struct page *next = (struct page
*)page_private(head);
>
> # free -h
> total used free shared buff/cache available
> Mem: 31Gi 999Mi 30Gi 12Mi 303Mi 30Gi
>
> 1 =>
> # ./src/iperf3 -c 172.16.253.2 --zc_api
> with zerocopy
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 23.0 GBytes 19.8 Gbits/sec 18 sender
> [ 5] 0.00-10.00 sec 23.0 GBytes 19.8 Gbits/sec receiver
>
> # free -h
> total used free shared buff/cache available
> Mem: 31Gi 12Gi 18Gi 12Mi 277Mi 18Gi
>
> 2 =>
> # ./src/iperf3 -c 172.16.253.2 --zc_api
> with zerocopy
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 8.50 GBytes 7.30 Gbits/sec 14986 sender
> [ 5] 0.00-10.00 sec 8.50 GBytes 7.30 Gbits/sec receiver
>
> # free -h
> total used free shared buff/cache available
> Mem: 31Gi 16Gi 15Gi 5.0Mi 216Mi 14Gi
>
> 3 =>
> # ./src/iperf3 -c 172.16.253.2 --zc_api
> with zerocopy
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.03 sec 8.19 GBytes 7.02 Gbits/sec 92410 sender
> [ 5] 0.00-10.03 sec 8.19 GBytes 7.02 Gbits/sec receiver
>
> # free -h
> total used free shared buff/cache available
> Mem: 31Gi 16Gi 15Gi 4.0Mi 94Mi 14Gi
>
>
> I could also see some call trace:
>
> [ 271.416989] warn_alloc: 640 callbacks suppressed
> [ 271.417006] lt-iperf3: page allocation failure: order:1, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-1
Seems orthogonal to the patch, right ?
I have not changed gfp flags, so a memory allocation error triggers
this trace, no matter the order of page allocations.
We might add GFP_NOWARN to silence this.
> [ 271.432684] CPU: 1 PID: 2218 Comm: lt-iperf3 Tainted: G S 6.3.0.net0 #10
> [ 271.440783] Hardware name: Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.2 06/04/2015
> [ 271.449831] Call Trace:
> [ 271.452276] <IRQ>
> [ 271.454286] dump_stack_lvl+0x36/0x50
> [ 271.457953] warn_alloc+0x11b/0x190
> [ 271.461445] __alloc_pages_slowpath.constprop.119+0xcb9/0xd40
> [ 271.467192] __alloc_pages+0x32d/0x340
> [ 271.470944] skb_copy_ubufs+0x11b/0x630
> [ 271.474781] ? ip_protocol_deliver_rcu+0x40/0x2d0
> [ 271.479488] __netif_receive_skb_core+0xcb0/0x1060
> [ 271.484280] ? ip_local_deliver+0x6e/0x120
> [ 271.488371] ? ip_rcv_finish_core.isra.22+0x438/0x480
> [ 271.493424] ? ip_rcv+0x53/0x100
> [ 271.496657] __netif_receive_skb_one_core+0x3c/0xa0
> [ 271.501537] process_backlog+0xb7/0x160
> [ 271.505375] __napi_poll+0x2b/0x1b0
> [ 271.508860] net_rx_action+0x25a/0x340
> [ 271.512612] ? __note_gp_changes+0x15f/0x170
> [ 271.516884] __do_softirq+0xb8/0x2a7
> [ 271.520478] do_softirq+0x5b/0x70
> [ 271.523800] </IRQ>
> [ 271.525897] <TASK>
> [ 271.527995] __local_bh_enable_ip+0x5f/0x70
> [ 271.532174] __dev_queue_xmit+0x34c/0xcc0
> [ 271.536186] ? eth_header+0x2a/0xd0
> [ 271.539678] ip_finish_output2+0x183/0x530
> [ 271.543771] ip_output+0x75/0x110
> [ 271.547089] ? __pfx_ip_finish_output+0x10/0x10
> [ 271.551620] __ip_queue_xmit+0x175/0x410
> [ 271.555546] __tcp_transmit_skb+0xa91/0xbf0
> [ 271.559724] ? kmalloc_reserve+0x8e/0xf0
> [ 271.563652] tcp_write_xmit+0x229/0x12c0
> [ 271.567578] __tcp_push_pending_frames+0x36/0x100
> [ 271.572282] tcp_sendmsg_locked+0x48c/0xc80
> [ 271.576474] ? sock_has_perm+0x7c/0xa0
> [ 271.580223] tcp_sendmsg+0x2b/0x40
> [ 271.583629] sock_sendmsg+0x8f/0xa0
> [ 271.587120] ? sockfd_lookup_light+0x12/0x70
> [ 271.591392] __sys_sendto+0xfe/0x170
> [ 271.594964] ? _copy_to_user+0x26/0x40
> [ 271.598716] ? poll_select_finish+0x123/0x220
> [ 271.603075] ? kern_select+0xc4/0x110
> [ 271.606734] __x64_sys_sendto+0x28/0x30
> [ 271.610574] do_syscall_64+0x3e/0x90
> [ 271.614153] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ 271.619205] RIP: 0033:0x7ff965530871
> [ 271.622784] Code: 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 b5 4e 29 00 41 89 ca 8b 00 85 c0 75 1c 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 67 c3 66 0f 1f 44 00 00 41 56 41 89 ce 41 55
> [ 271.641521] RSP: 002b:00007fffcde17878 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> [ 271.649088] RAX: ffffffffffffffda RBX: 0000000000fde620 RCX: 00007ff965530871
> [ 271.656211] RDX: 0000000000020000 RSI: 00007ff965200000 RDI: 0000000000000005
> [ 271.663336] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
> [ 271.670474] R10: 0000000004000000 R11: 0000000000000246 R12: 0000000000000000
> [ 271.677602] R13: 0000000000000009 R14: 0000000000000000 R15: 0000000000fdd2a0
> [ 271.684727] </TASK>
> [ 271.686921] Mem-Info:
> [ 271.689211] active_anon:319 inactive_anon:23067 isolated_anon:0
> [ 271.689211] active_file:22074 inactive_file:21543 isolated_file:0
> [ 271.689211] unevictable:0 dirty:0 writeback:0
> [ 271.689211] slab_reclaimable:10462 slab_unreclaimable:40073
> [ 271.689211] mapped:14400 shmem:1348 pagetables:2320
> [ 271.689211] sec_pagetables:0 bounce:0
> [ 271.689211] kernel_misc_reclaimable:0
> [ 271.689211] free:3955680 free_pcp:3032 free_cma:0
>
>
> Just as a comparison:
> With no zc_api:
> # ./src/iperf3 -c 172.16.253.2
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 50.0 GBytes 42.9 Gbits/sec 0 sender
>
> With my improper patch:
Your patch is not what we want, as it would involve very high-order
allocation, bound to fail ?
Thanks.
> # ./src/iperf3 -c 172.16.253.2 --zc_api
> with zerocopy
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 31.8 GBytes 27.3 Gbits/sec 40 sender
>
> After running a couple of times, both have no memleak and call trace found.
>
> Thanks.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-04-28 4:19 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-27 19:24 [PATCH net] tcp: fix skb_copy_ubufs() vs BIG TCP Eric Dumazet
2023-04-27 21:08 ` David Ahern
2023-04-27 21:21 ` Willem de Bruijn
[not found] ` <CADvbK_eX0t9KGuhdERweue3BkefVtzZx-ZQBVhfe8jb1aO6eZw@mail.gmail.com>
2023-04-28 2:13 ` David Ahern
2023-04-28 4:19 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).