From: Leon Romanovsky <leon@kernel.org>
To: Jakub Kicinski <kuba@kernel.org>
Cc: davem@davemloft.net, netdev@vger.kernel.org, willemb@google.com,
eric.dumazet@gmail.com, dsahern@gmail.com,
yoshfuji@linux-ipv6.org, brouer@redhat.com,
Dave Jones <dsj@fb.com>
Subject: Re: [PATCH net-next v5] net: ip: avoid OOM kills with large UDP sends over loopback
Date: Sun, 27 Jun 2021 11:11:04 +0300 [thread overview]
Message-ID: <YNgymLc+1iUatk9F@unreal> (raw)
In-Reply-To: <20210624175724.2389359-1-kuba@kernel.org>
On Thu, Jun 24, 2021 at 10:57:24AM -0700, Jakub Kicinski wrote:
> Dave observed number of machines hitting OOM on the UDP send
> path. The workload seems to be sending large UDP packets over
> loopback. Since loopback has MTU of 64k kernel will try to
> allocate an skb with up to 64k of head space. This has a good
> chance of failing under memory pressure. What's worse if
> the message length is <32k the allocation may trigger an
> OOM killer.
>
> This is entirely avoidable, we can use an skb with page frags.
>
> af_unix solves a similar problem by limiting the head
> length to SKB_MAX_ALLOC. This seems like a good and simple
> approach. It means that UDP messages > 16kB will now
> use fragments if underlying device supports SG, if extra
> allocator pressure causes regressions in real workloads
> we can switch to trying the large allocation first and
> falling back.
>
> v4: pre-calculate all the additions to alloclen so
> we can be sure it won't go over order-2
> v5: s/< MAX/<= MAX/ ; reorder variable declaration
Jakub,
Can you please put changelog under "---" below SOB?
It gives no value to see all vX in the final git log.
Thanks
>
> Reported-by: Dave Jones <dsj@fb.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> net/ipv4/ip_output.c | 32 ++++++++++++++++++--------------
> net/ipv6/ip6_output.c | 32 +++++++++++++++++---------------
> 2 files changed, 35 insertions(+), 29 deletions(-)
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c3efc7d658f6..f5c398036efa 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1050,11 +1050,11 @@ static int __ip_append_data(struct sock *sk,
> if (copy < length)
> copy = maxfraglen - skb->len;
> if (copy <= 0) {
> + unsigned int alloclen, alloc_extra;
> char *data;
> unsigned int datalen;
> unsigned int fraglen;
> unsigned int fraggap;
> - unsigned int alloclen;
> unsigned int pagedlen;
> struct sk_buff *skb_prev;
> alloc_new_skb:
> @@ -1074,35 +1074,39 @@ static int __ip_append_data(struct sock *sk,
> fraglen = datalen + fragheaderlen;
> pagedlen = 0;
>
> + alloc_extra = hh_len + 15;
> + alloc_extra += exthdrlen;
> +
> + /* The last fragment gets additional space at tail.
> + * Note, with MSG_MORE we overallocate on fragments,
> + * because we have no idea what fragment will be
> + * the last.
> + */
> + if (datalen == length + fraggap)
> + alloc_extra += rt->dst.trailer_len;
> +
> if ((flags & MSG_MORE) &&
> !(rt->dst.dev->features&NETIF_F_SG))
> alloclen = mtu;
> - else if (!paged)
> + else if (!paged &&
> + (fraglen + alloc_extra <= SKB_MAX_ALLOC ||
> + !(rt->dst.dev->features & NETIF_F_SG)))
> alloclen = fraglen;
> else {
> alloclen = min_t(int, fraglen, MAX_HEADER);
> pagedlen = fraglen - alloclen;
> }
>
> - alloclen += exthdrlen;
> -
> - /* The last fragment gets additional space at tail.
> - * Note, with MSG_MORE we overallocate on fragments,
> - * because we have no idea what fragment will be
> - * the last.
> - */
> - if (datalen == length + fraggap)
> - alloclen += rt->dst.trailer_len;
> + alloclen += alloc_extra;
>
> if (transhdrlen) {
> - skb = sock_alloc_send_skb(sk,
> - alloclen + hh_len + 15,
> + skb = sock_alloc_send_skb(sk, alloclen,
> (flags & MSG_DONTWAIT), &err);
> } else {
> skb = NULL;
> if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
> 2 * sk->sk_sndbuf)
> - skb = alloc_skb(alloclen + hh_len + 15,
> + skb = alloc_skb(alloclen,
> sk->sk_allocation);
> if (unlikely(!skb))
> err = -ENOBUFS;
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index ff4f9ebcf7f6..afe887dd5e35 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1551,11 +1551,11 @@ static int __ip6_append_data(struct sock *sk,
> copy = maxfraglen - skb->len;
>
> if (copy <= 0) {
> + unsigned int alloclen, alloc_extra;
> char *data;
> unsigned int datalen;
> unsigned int fraglen;
> unsigned int fraggap;
> - unsigned int alloclen;
> unsigned int pagedlen;
> alloc_new_skb:
> /* There's no room in the current skb */
> @@ -1582,17 +1582,28 @@ static int __ip6_append_data(struct sock *sk,
> fraglen = datalen + fragheaderlen;
> pagedlen = 0;
>
> + alloc_extra = hh_len;
> + alloc_extra += dst_exthdrlen;
> + alloc_extra += rt->dst.trailer_len;
> +
> + /* We just reserve space for fragment header.
> + * Note: this may be overallocation if the message
> + * (without MSG_MORE) fits into the MTU.
> + */
> + alloc_extra += sizeof(struct frag_hdr);
> +
> if ((flags & MSG_MORE) &&
> !(rt->dst.dev->features&NETIF_F_SG))
> alloclen = mtu;
> - else if (!paged)
> + else if (!paged &&
> + (fraglen + alloc_extra <= SKB_MAX_ALLOC ||
> + !(rt->dst.dev->features & NETIF_F_SG)))
> alloclen = fraglen;
> else {
> alloclen = min_t(int, fraglen, MAX_HEADER);
> pagedlen = fraglen - alloclen;
> }
> -
> - alloclen += dst_exthdrlen;
> + alloclen += alloc_extra;
>
> if (datalen != length + fraggap) {
> /*
> @@ -1602,30 +1613,21 @@ static int __ip6_append_data(struct sock *sk,
> datalen += rt->dst.trailer_len;
> }
>
> - alloclen += rt->dst.trailer_len;
> fraglen = datalen + fragheaderlen;
>
> - /*
> - * We just reserve space for fragment header.
> - * Note: this may be overallocation if the message
> - * (without MSG_MORE) fits into the MTU.
> - */
> - alloclen += sizeof(struct frag_hdr);
> -
> copy = datalen - transhdrlen - fraggap - pagedlen;
> if (copy < 0) {
> err = -EINVAL;
> goto error;
> }
> if (transhdrlen) {
> - skb = sock_alloc_send_skb(sk,
> - alloclen + hh_len,
> + skb = sock_alloc_send_skb(sk, alloclen,
> (flags & MSG_DONTWAIT), &err);
> } else {
> skb = NULL;
> if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
> 2 * sk->sk_sndbuf)
> - skb = alloc_skb(alloclen + hh_len,
> + skb = alloc_skb(alloclen,
> sk->sk_allocation);
> if (unlikely(!skb))
> err = -ENOBUFS;
> --
> 2.31.1
>
prev parent reply other threads:[~2021-06-27 8:11 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-24 17:57 [PATCH net-next v5] net: ip: avoid OOM kills with large UDP sends over loopback Jakub Kicinski
2021-06-27 8:11 ` Leon Romanovsky [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YNgymLc+1iUatk9F@unreal \
--to=leon@kernel.org \
--cc=brouer@redhat.com \
--cc=davem@davemloft.net \
--cc=dsahern@gmail.com \
--cc=dsj@fb.com \
--cc=eric.dumazet@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
--cc=yoshfuji@linux-ipv6.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).