From: Jason Wang <jasowang@redhat.com>
To: Eric Dumazet <edumazet@google.com>,
Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
netdev <netdev@vger.kernel.org>,
Alexander Duyck <alexanderduyck@fb.com>,
Paolo Abeni <pabeni@redhat.com>, Greg Thelen <gthelen@google.com>,
"David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
su-lifan@linux.alibaba.com, "dust.li" <dust.li@linux.alibaba.com>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs
Date: Fri, 2 Apr 2021 10:52:12 +0800 [thread overview]
Message-ID: <4a29cb99-749d-e697-34a6-db50361cedff@redhat.com> (raw)
In-Reply-To: <CANn89iJA=RMKFNk5LFUxuQPUsXBoL2UbcskkX94UeFZo-B9LEw@mail.gmail.com>
在 2021/4/1 下午5:58, Eric Dumazet 写道:
> On Thu, Apr 1, 2021 at 11:04 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>> On Thu, 1 Apr 2021 15:14:18 +0800, Jason Wang <jasowang@redhat.com> wrote:
>>> 在 2021/3/31 下午4:11, Michael S. Tsirkin 写道:
>>>> On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
>>>>> On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>>>>>> On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>>>
>>>>>>> Both virtio net and napi_get_frags() allocate skbs
>>>>>>> with a very small skb->head
>>>>>>>
>>>>>>> While using page fragments instead of a kmalloc backed skb->head might give
>>>>>>> a small performance improvement in some cases, there is a huge risk of
>>>>>>> under estimating memory usage.
>>>>>>>
>>>>>>> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
>>>>>>> per page (order-3 page in x86), or even 64 on PowerPC
>>>>>>>
>>>>>>> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
>>>>>>> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
>>>>>>>
>>>>>>> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
>>>>>>> would still be there on arches with PAGE_SIZE >= 32768
>>>>>>>
>>>>>>> This patch makes sure that small skb head are kmalloc backed, so that
>>>>>>> other objects in the slab page can be reused instead of being held as long
>>>>>>> as skbs are sitting in socket queues.
>>>>>>>
>>>>>>> Note that we might in the future use the sk_buff napi cache,
>>>>>>> instead of going through a more expensive __alloc_skb()
>>>>>>>
>>>>>>> Another idea would be to use separate page sizes depending
>>>>>>> on the allocated length (to never have more than 4 frags per page)
>>>>>>>
>>>>>>> I would like to thank Greg Thelen for his precious help on this matter,
>>>>>>> analysing crash dumps is always a time consuming task.
>>>>>> This patch causes a performance degradation of about 10% in the scenario of
>>>>>> virtio-net + GRO.
>>>>>>
>>>>>> For GRO, there is no way to merge skbs based on frags with this patch, only
>>>>>> frag_list can be used to link skbs. The problem that this cause are that compared
>>>>>> to the GRO package merged into the frags way, the current skb needs to call
>>>>>> kfree_skb_list to release each skb, resulting in performance degradation.
>>>>>>
>>>>>> virtio-net will store some data onto the linear space after receiving it. In
>>>>>> addition to the header, there are also some payloads, so "headlen <= offset"
>>>>>> fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
>>>>>>
>>>>> Thanks for the report.
>>>>>
>>>>> There is no way we can make things both fast for existing strategies
>>>>> used by _insert_your_driver
>>>>> and malicious usages of data that can sit for seconds/minutes in socket queues.
>>>>>
>>>>> I think that if you want to gain this 10% back, you have to change
>>>>> virtio_net to meet optimal behavior.
>>>>>
>>>>> Normal drivers make sure to not pull payload in skb->head, only headers.
>>>> Hmm we do have hdr_len field, but seem to ignore it on RX.
>>>> Jason do you see any issues with using it for the head len?
>>>
>>> This might work only if the device sets a correct hdr_len. I'm not sure
>>> all of the devices can do this properly. E.g for tap, we use
>>> skb_headlen() in virtio_net_hdr_from_skb() which depends highly on the
>>> behaviour of the underlayer layers (device driver or GRO). And we only
>>> set this hint for GSO packet but virtio-net may tries to do GRO for non
>>> GSO packets.
>>>
>>> Thanks
>> hi, Jason
>>
>> I personally prefer to use build_skb to create skb, so the problem here is
>> actually gone.
>>
>> The premise of this is that the buffer added by add_recvbuf_mergeable must
>> retain a skb_shared_info. Of course, then rx frags coalescing won't
>> work. But I consider that suppose the size of the mrg_avg_pkt_len 1500, so we
>> can still store 17 * 1500 = 24k packets in a skb. If the packet is really big,
>> the mrg_avg_pkt_len will also increase, and the buffer allocated later will
>> increase. When the mrg_avg_pkt_len is greater than PAGE_SIZE/2, rx frags
>> coalesce is no longer needed. Because we can't allocate two bufs with a value of
>> mrg_avg_pkt_len on the same page.
>>
> For the record I implemented build_skb() 10 years ago, so you can
> trust me when I
> am saying this will not help.
>
> Using build_skb() will waste additional skb_shared_info per MSS.
> That's an increase of 20% of memory, for nothing at all.
So I wonder something like the following like this help. We know the
frag size, that means, if we know there's sufficient tailroom we can use
build_skb() without reserving dedicated room for skb_shared_info.
Thanks
>
> Also there are cases when this won't be possible, say if you use an MTU of 4000
>
>
>
>
>
>> Thanks.
>>
>>>
>>>>
>>>>> Optimal GRO packets are when payload is in page fragments.
>>>>>
>>>>> (I am speaking not only for raw performance, but ability for systems
>>>>> to cope with network outages and sudden increase of memory usage in
>>>>> out of order queues)
>>>>>
>>>>> This has been quite clearly stated in my changelog.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>>>>>> {
>>>>>> struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
>>>>>> unsigned int offset = skb_gro_offset(skb);
>>>>>> unsigned int headlen = skb_headlen(skb);
>>>>>>
>>>>>> .......
>>>>>>
>>>>>> if (headlen <= offset) { // virtio-net will fail
>>>>>> ........ // merge by frags
>>>>>> goto done;
>>>>>> } else if (skb->head_frag) { // skb->head_frag is fail when use kmalloc() for skb->head allocation
>>>>>> ........ // merge by frags
>>>>>> goto done;
>>>>>> }
>>>>>>
>>>>>> merge:
>>>>>> ......
>>>>>>
>>>>>> if (NAPI_GRO_CB(p)->last == p)
>>>>>> skb_shinfo(p)->frag_list = skb;
>>>>>> else
>>>>>> NAPI_GRO_CB(p)->last->next = skb;
>>>>>>
>>>>>> ......
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> test cmd:
>>>>>> for i in $(seq 1 4)
>>>>>> do
>>>>>> redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h <ip> -p 6379 2>&1 | grep 'per second' &
>>>>>> done
>>>>>>
>>>>>> Reported-by: su-lifan@linux.alibaba.com
>>>>>>
>>>>>>> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
>>>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>>>> Cc: Alexander Duyck <alexanderduyck@fb.com>
>>>>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Cc: Greg Thelen <gthelen@google.com>
>>>>>>> ---
>>>>>>> net/core/skbuff.c | 9 +++++++--
>>>>>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
>>>>>>> --- a/net/core/skbuff.c
>>>>>>> +++ b/net/core/skbuff.c
>>>>>>> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
>>>>>>> struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>>>> gfp_t gfp_mask)
>>>>>>> {
>>>>>>> - struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
>>>>>>> + struct napi_alloc_cache *nc;
>>>>>>> struct sk_buff *skb;
>>>>>>> void *data;
>>>>>>>
>>>>>>> len += NET_SKB_PAD + NET_IP_ALIGN;
>>>>>>>
>>>>>>> - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>>>>>>> + /* If requested length is either too small or too big,
>>>>>>> + * we use kmalloc() for skb->head allocation.
>>>>>>> + */
>>>>>>> + if (len <= SKB_WITH_OVERHEAD(1024) ||
>>>>>>> + len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>>>>>>> (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>>>>>>> skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>>>>>>> if (!skb)
>>>>>>> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>>>> goto skb_success;
>>>>>>> }
>>>>>>>
>>>>>>> + nc = this_cpu_ptr(&napi_alloc_cache);
>>>>>>> len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>>>>> len = SKB_DATA_ALIGN(len);
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.0.284.gd98b1dd5eaa7-goog
>>>>>>>
next prev parent reply other threads:[~2021-04-02 2:52 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-13 16:18 [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs Eric Dumazet
2021-01-13 18:00 ` Alexander Duyck
2021-01-13 19:19 ` Michael S. Tsirkin
2021-01-13 22:23 ` David Laight
2021-01-14 5:16 ` Eric Dumazet
2021-01-14 9:29 ` David Laight
2021-01-14 19:00 ` patchwork-bot+netdevbpf
[not found] ` <1617007696.5731978-1-xuanzhuo@linux.alibaba.com>
2021-03-29 9:06 ` Eric Dumazet
2021-03-31 8:11 ` Michael S. Tsirkin
2021-03-31 8:36 ` Eric Dumazet
2021-03-31 8:46 ` Eric Dumazet
2021-03-31 8:49 ` Eric Dumazet
2021-03-31 8:54 ` Eric Dumazet
[not found] ` <1617248264.4993114-2-xuanzhuo@linux.alibaba.com>
2021-04-01 5:06 ` Eric Dumazet
[not found] ` <1617357110.3822439-1-xuanzhuo@linux.alibaba.com>
2021-04-02 12:52 ` Eric Dumazet
2021-04-01 13:51 ` Michael S. Tsirkin
2021-04-01 14:08 ` Eric Dumazet
2021-04-01 7:14 ` Jason Wang
[not found] ` <1617267183.5697193-1-xuanzhuo@linux.alibaba.com>
2021-04-01 9:58 ` Eric Dumazet
2021-04-02 2:52 ` Jason Wang [this message]
[not found] ` <1617361253.1788838-2-xuanzhuo@linux.alibaba.com>
2021-04-02 12:53 ` Eric Dumazet
2021-04-06 2:04 ` Jason Wang
[not found] ` <1617190239.1035674-1-xuanzhuo@linux.alibaba.com>
2021-03-31 12:08 ` Eric Dumazet
2021-04-01 13:36 ` Michael S. Tsirkin
2022-09-07 20:19 ` Paolo Abeni
2022-09-07 20:40 ` Eric Dumazet
2022-09-08 10:48 ` Paolo Abeni
2022-09-08 12:20 ` Eric Dumazet
2022-09-08 14:26 ` Paolo Abeni
2022-09-08 16:00 ` Eric Dumazet
2022-09-07 21:36 ` Alexander H Duyck
2022-09-08 11:00 ` Paolo Abeni
2022-09-08 14:53 ` Alexander H Duyck
2022-09-08 18:01 ` Paolo Abeni
2022-09-08 19:26 ` Alexander Duyck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4a29cb99-749d-e697-34a6-db50361cedff@redhat.com \
--to=jasowang@redhat.com \
--cc=alexanderduyck@fb.com \
--cc=davem@davemloft.net \
--cc=dust.li@linux.alibaba.com \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=gthelen@google.com \
--cc=kuba@kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=su-lifan@linux.alibaba.com \
--cc=xuanzhuo@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.