From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: brouer@redhat.com, David Miller <davem@davemloft.net>,
netdev <netdev@vger.kernel.org>, Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
Date: Thu, 8 Dec 2016 10:46:20 +0100 [thread overview]
Message-ID: <20161208104620.5fc691b8@redhat.com> (raw)
In-Reply-To: <1481131173.4930.36.camel@edumazet-glaptop3.roam.corp.google.com>
On Wed, 07 Dec 2016 09:19:33 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Under UDP flood, many softirq producers try to add packets to
> UDP receive queue, and one user thread is burning one cpu trying
> to dequeue packets as fast as possible.
>
> Two parts of the per packet cost are :
> - copying payload from kernel space to user space,
> - freeing memory pieces associated with skb.
>
> If socket is under pressure, softirq handler(s) can try to pull in
> skb->head the payload of the packet if it fits.
>
> Meaning the softirq handler(s) can free/reuse the page fragment
> immediately, instead of letting udp_recvmsg() do this hundreds of usec
> later, possibly from another node.
>
>
> Additional gains :
> - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
> - We avoid cache line misses at copyout() time and consume_skb() time,
> and avoid one put_page() with potential alien freeing on NUMA hosts.
>
> This comes at the cost of a copy, bounded to available tail room, which
> is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
> than necessary)
>
> This patch gave me about 5 % increase in throughput in my tests.
Hmmm... I'm not thrilled to have such heuristics, that change memory
behavior when half of the queue size (sk->sk_rcvbuf) is reached.
Most of the win comes from doing a local atomic page-refcnt decrement
oppose to doing a remote CPU refcnf-dec. And as you noticed the
benefit is quite high saving 241 cycles (see [1]). And you patch is
"using" these cycles to copy the packet instead.
This might no be a win in the future. I'm working on a more generic
solution (page_pool) that (as one objective) target this remote recfnt.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench03.c
Measured on: i7-4790K CPU @ 4.00GHz
Same CPU release cost : 251 cycles
Remote CPU release cost: 492 cycles
> skb_condense() helper could probably used in other contexts.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> ---
[...]
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b45cd1494243fc99686016949f4546dbba11f424..84151cf40aebb973bad5bee3ee4be0758084d83c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4931,3 +4931,31 @@ struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
> EXPORT_SYMBOL(pskb_extract);
> +
> +/**
> + * skb_condense - try to get rid of fragments/frag_list if possible
> + * @skb: buffer
> + *
> + * Can be used to save memory before skb is added to a busy queue.
> + * If packet has bytes in frags and enough tail room in skb->head,
> + * pull all of them, so that we can free the frags right now and adjust
> + * truesize.
> + * Notes:
> + * We do not reallocate skb->head thus can not fail.
> + * Caller must re-evaluate skb->truesize if needed.
> + */
> +void skb_condense(struct sk_buff *skb)
> +{
> + if (!skb->data_len ||
> + skb->data_len > skb->end - skb->tail ||
> + skb_cloned(skb))
> + return;
So this only active, depending on how driver constructed the SKB, but
all end-up doing a function call (not inlined).
> + /* Nice, we can free page frag(s) right now */
> + __pskb_pull_tail(skb, skb->data_len);
> +
> + /* Now adjust skb->truesize, since __pskb_pull_tail() does
> + * not do this.
> + */
> + skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> +}
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 16d88ba9ff1c402f77063cfb5eea2708d86da2fc..f5628ada47b53f0d92d08210e5d7e4132a107f73 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
[...]
> @@ -1208,6 +1208,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> if (rmem > sk->sk_rcvbuf)
> goto drop;
>
> + /* Under mem pressure, it might be helpful to help udp_recvmsg()
> + * having linear skbs :
> + * - Reduce memory overhead and thus increase receive queue capacity
> + * - Less cache line misses at copyout() time
> + * - Less work at consume_skb() (less alien page frag freeing)
> + */
> + if (rmem > (sk->sk_rcvbuf >> 1))
> + skb_condense(skb);
> + size = skb->truesize;
> +
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2016-12-08 9:46 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-07 17:19 [PATCH net-next] udp: under rx pressure, try to condense skbs Eric Dumazet
2016-12-07 19:31 ` Eric Dumazet
2016-12-08 9:46 ` Jesper Dangaard Brouer [this message]
2016-12-08 15:30 ` Eric Dumazet
2016-12-08 15:36 ` Rick Jones
2016-12-08 16:08 ` Eric Dumazet
2016-12-08 18:26 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161208104620.5fc691b8@redhat.com \
--to=brouer@redhat.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).