optimizations to sk_buff handling in rds_tcp_data

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* optimizations to sk_buff handling in rds_tcp_data_ready
@ 2016-04-07 13:11 Sowmini Varadhan
  2016-04-07 14:16 ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Sowmini Varadhan @ 2016-04-07 13:11 UTC (permalink / raw)
  To: alexei.starovoitov, tom, netdev, sowmini.varadhan

I was going back to Alexei's comment here:
  http://permalink.gmane.org/gmane.linux.network/387806
In rds-stress profiling,  we are indeed seeing the pksb_pull 
(from rds_tcp_data_recv) show up in the perf profile.

At least for rds-tcp, we cannot re-use the skb even if
it is not shared, because what we need to do is to carve out
the interesting bits (start at offset, and trim it to to_copy)
and queue up those interesting bits on the PF_RDS socket,
while allowing tcp_data_read to go back and read the next 
tcp segment (which may be part of the next rds dgram).

But, when  pskb_expand_head is invoked in the call-stack
  pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
it will memcpy the offset bytes to the start of data. At least 
for the rds_tcp_data_recv, we are not interested in being able 
to do a *skb_push after the *skb_pull, so we dont really care 
about the intactness of these bytes in offset.
Thus what I am finding is that when delta > 0, if we skip the 
following in pskb_expand_head (for the rds-tcp recv case only!)

        /* Copy only real data... and, alas, header. This should be
         * optimized for the cases when header is void.
         */
        memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);

and also (only for this case!) this one in __pskb_pull_tail

 if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
                BUG();

I am able to get a 40% improvement in the measured IOPS (req/response 
transactions per second, using 8K byte requests, 256 byte responses,
16 concurrent threads), so this optimization seems worth doing.

Does my analysis above make sense? If yes, the question is, how to
achieve this bypass in a neat way.  Clearly there are many callers of
pskb_expand_head who will expect to find the skb_header_len bytes at
the start of data, but we also dont want to duplicate code in these
functions. One thought I had is to pass a flag arouund saying "caller
doesnt care about retaining offset bytes", and use this flag
- in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
  pass delta to pskb_expand_head,
- in pskb_expand_head, only do the memcpy listed above 
  if delta <= 0
Any other ideas for how to achieve this?

--Sowmini

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
  2016-04-07 13:11 optimizations to sk_buff handling in rds_tcp_data_ready Sowmini Varadhan
@ 2016-04-07 14:16 ` Eric Dumazet
  2016-04-07 19:29   ` Sowmini Varadhan
  2016-04-10  0:18   ` Sowmini Varadhan
  0 siblings, 2 replies; 5+ messages in thread
From: Eric Dumazet @ 2016-04-07 14:16 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: alexei.starovoitov, tom, netdev

On Thu, 2016-04-07 at 09:11 -0400, Sowmini Varadhan wrote:
> I was going back to Alexei's comment here:
>   http://permalink.gmane.org/gmane.linux.network/387806
> In rds-stress profiling,  we are indeed seeing the pksb_pull 
> (from rds_tcp_data_recv) show up in the perf profile.
> 
> At least for rds-tcp, we cannot re-use the skb even if
> it is not shared, because what we need to do is to carve out
> the interesting bits (start at offset, and trim it to to_copy)
> and queue up those interesting bits on the PF_RDS socket,
> while allowing tcp_data_read to go back and read the next 
> tcp segment (which may be part of the next rds dgram).
> 
> But, when  pskb_expand_head is invoked in the call-stack
>   pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
> it will memcpy the offset bytes to the start of data. At least 
> for the rds_tcp_data_recv, we are not interested in being able 
> to do a *skb_push after the *skb_pull, so we dont really care 
> about the intactness of these bytes in offset.
> Thus what I am finding is that when delta > 0, if we skip the 
> following in pskb_expand_head (for the rds-tcp recv case only!)
> 
>         /* Copy only real data... and, alas, header. This should be
>          * optimized for the cases when header is void.
>          */
>         memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
> 
> and also (only for this case!) this one in __pskb_pull_tail
> 
>  if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
>                 BUG();
> 
> I am able to get a 40% improvement in the measured IOPS (req/response 
> transactions per second, using 8K byte requests, 256 byte responses,
> 16 concurrent threads), so this optimization seems worth doing.
> 
> Does my analysis above make sense? If yes, the question is, how to
> achieve this bypass in a neat way.  Clearly there are many callers of
> pskb_expand_head who will expect to find the skb_header_len bytes at
> the start of data, but we also dont want to duplicate code in these
> functions. One thought I had is to pass a flag arouund saying "caller
> doesnt care about retaining offset bytes", and use this flag
> - in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
>   pass delta to pskb_expand_head,
> - in pskb_expand_head, only do the memcpy listed above 
>   if delta <= 0
> Any other ideas for how to achieve this?

Use skb split like TCP in output path ?

Really, pskb_expand_head() is not supposed to copy payload ;)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
  2016-04-07 14:16 ` Eric Dumazet
@ 2016-04-07 19:29   ` Sowmini Varadhan
  2016-04-10  0:18   ` Sowmini Varadhan
  1 sibling, 0 replies; 5+ messages in thread
From: Sowmini Varadhan @ 2016-04-07 19:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: alexei.starovoitov, tom, netdev

On (04/07/16 07:16), Eric Dumazet wrote:
> Use skb split like TCP in output path ?

That almost looks like what I want, but skb_split modifies both
skb and skb1, and I want to leave skb untouched (otherwise 
I will mess up the book-keeping in tcp_read_sock). But skb_split 
is a good template- I think it could even be extended to avoid copying
the frags that we'd later trim off in rds_tcp_data_recv anyway, 
let me see what I can come up with, based on that code.

> Really, pskb_expand_head() is not supposed to copy payload ;)

That would make my world very easy! But it has callers from all
over the kernel, e.g., skb_realloc_headroom, and changing it is
obviously risky

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
  2016-04-07 14:16 ` Eric Dumazet
  2016-04-07 19:29   ` Sowmini Varadhan
@ 2016-04-10  0:18   ` Sowmini Varadhan
  2016-04-10 23:54     ` Eric Dumazet
  1 sibling, 1 reply; 5+ messages in thread
From: Sowmini Varadhan @ 2016-04-10  0:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On (04/07/16 07:16), Eric Dumazet wrote:
> Use skb split like TCP in output path ?
> Really, pskb_expand_head() is not supposed to copy payload ;)

Question- how come skb_split doesnt have to deal with frag_list
and do a skb_walk_frags()? Couldn't the split-line be somewhere
in the frag_list? Also even for the skb_split_inside_header,
dont we have to set
  skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list;
and cut loose the skb_shinfo(skb)->frag_list?

As I try to mimic skb_split in some new set of "skb_carve"
funtions, I'm running into all  the various frag_list cases. I'm 
afraid I might end up needing most of the stuff under the "Pure 
masohism" (sic) comment in __pskb_pull_tail(). 

--Sowmini

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
  2016-04-10  0:18   ` Sowmini Varadhan
@ 2016-04-10 23:54     ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2016-04-10 23:54 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: netdev

On Sat, 2016-04-09 at 20:18 -0400, Sowmini Varadhan wrote:
> On (04/07/16 07:16), Eric Dumazet wrote:
> > Use skb split like TCP in output path ?
> > Really, pskb_expand_head() is not supposed to copy payload ;)
> 
> Question- how come skb_split doesnt have to deal with frag_list
> and do a skb_walk_frags()? Couldn't the split-line be somewhere
> in the frag_list? Also even for the skb_split_inside_header,
> dont we have to set
>   skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list;
> and cut loose the skb_shinfo(skb)->frag_list?
> 
> As I try to mimic skb_split in some new set of "skb_carve"
> funtions, I'm running into all  the various frag_list cases. I'm 
> afraid I might end up needing most of the stuff under the "Pure 
> masohism" (sic) comment in __pskb_pull_tail(). 
> 

The helper was written for TCP I guess. TCP in output path never uses
skb_shinfo(skb)->frag_list : It is guaranteed to be NULL for all skbs.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-10 23:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-07 13:11 optimizations to sk_buff handling in rds_tcp_data_ready Sowmini Varadhan
2016-04-07 14:16 ` Eric Dumazet
2016-04-07 19:29   ` Sowmini Varadhan
2016-04-10  0:18   ` Sowmini Varadhan
2016-04-10 23:54     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).