Re: [PATCH net-next] tcp: add tcp_add_backlog()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>
Subject: Re: [PATCH net-next] tcp: add tcp_add_backlog()
Date: Fri, 23 Sep 2016 12:12:06 -0300	[thread overview]
Message-ID: <20160923151205.GC17222@localhost.localdomain> (raw)
In-Reply-To: <1474641392.28155.27.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Sep 23, 2016 at 07:36:32AM -0700, Eric Dumazet wrote:
> On Fri, 2016-09-23 at 11:09 -0300, Marcelo Ricardo Leitner wrote:
> > On Fri, Sep 23, 2016 at 06:42:51AM -0700, Eric Dumazet wrote:
> > > On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
> > > 
> > > > Aye. In that case, what about using tail instead of end?
> > > 
> > > 
> > > What do you mean exactly ?
> > 
> > Something like:
> > -skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> > +skb->truesize = SKB_TRUESIZE(skb_tail_offset(skb));
> 
> Certainly not ;)
> 
> This would be lying.

Yep, but also so is adding txbuf to the equation to account for that,
right? :-) Unless you're considering that acks should/can be sort of
accounted by the txbuf instead, then it makes sense to sum the buffer
sizes in there.

> We really want a precise memory accounting to avoid OOM.

Indeed

> 
> Some USB drivers use 8KB for their skb->head, you do not want to pretend
> its 66+NET_SKB_PAF=F bytes just because there is no payload in the
> packet.

Oh.

> 
> > 
> > And define skb_tail_offset() to something similar skb_end_offset(), so
> > that it would account only for the data and not unused space in the
> > buffer.
> > 
> > > 
> > > >  Because
> > > > accounting for something that we have to tweak the limits to accept is
> > > > like adding a constant to both sides of the equation.
> > > > But perhaps that would cut out too much of the fat which could be used
> > > > later by the stack.
> > > 
> > > Are you facing a particular problem with current code ?
> > > 
> > 
> > For TCP, no, just wondering. :-)
> > 
> > I'm having similar issues with SCTP: if the socket gets backlogged, the
> > buffer accounting gets pretty messy. SCTP calculates the rwnd to be just
> > rcvbuf/2, but this ratio is often different in backlog and it causes the
> > advertized window to be too big, resulting in packet drops in the
> > backlog.
> > 
> > SCTP has some way to identify and compensate this "extra" rwnd, via
> > rwnd_press, and will shrink it if it detects that the window is bigger
> > than the buffer available. But as the socket is backlogged, it's doesn't
> > kick in soon enough to prevent such drops.
> > 
> > It's not just a matter of adjusting the overhead ratio (rcvbuf/2)
> > because with SCTP the packets may have different sizes, so a packet with
> > a chunk of 100 bytes will have a ratio and another with 1000 bytes will
> > have another, within the same association.
> > 
> > So I'm leaning towards on updating truesize before adding to the
> > backlog, but to account just for the actual packet, regardless of the
> > buffer that was used for it. It still has the overhead ratio issue with
> > different packet sizes, though, but smaller.
> > 
> > Note that SCTP doesn't have buffer auto-tuning yet.
> 
> Also for TCP, we might need to use sk->sk_wmem_queued instead of
> sk->sk_sndbuf

This is interesting. Then it would only stretch the backlog limit if
there is a heavy tx going on.

> 
> This is because SACK processing can suddenly split skbs in 1-MSS pieces.
> 
> Problem is that for very large BDP, we can end up with thousands of skb
> in backlog. So I am also considering to try to coalesce stupid ACK sent
> by non GRO receivers or simply the verbose SACK blocks...
> 
> eg if backlog is under pressure and its tail is : 
> 
> ACK 1   <sack 4000:5000>
> 
> and the incoming packet is :
> 
> ACK 1  <sack 4000:6000>
> 
> Then we could replace the tail by the incoming packet with minimal
> impact.
> 
> Since we might receive hundred of 'sequential' SACK blocks, this would
> help to reduce time taken by the application to process the (now
> smaller) backlog
> 

That would be cool.

Thanks,
Marcelo

     prev parent reply	other threads:[~2016-09-23 15:12 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-27 14:37 [PATCH net-next] tcp: add tcp_add_backlog() Eric Dumazet
2016-08-27 16:13 ` Yuchung Cheng
2016-08-27 16:25   ` Eric Dumazet
2016-08-29 16:53     ` Yuchung Cheng
2016-08-27 18:24 ` Neal Cardwell
2016-08-29  4:20 ` David Miller
2016-08-29 18:51 ` Marcelo Ricardo Leitner
2016-08-29 19:22   ` Eric Dumazet
2016-08-29 19:33     ` Marcelo Ricardo Leitner
2016-09-22 22:34 ` Marcelo Ricardo Leitner
2016-09-22 23:21   ` Eric Dumazet
2016-09-23 12:45     ` Marcelo Ricardo Leitner
2016-09-23 13:42       ` Eric Dumazet
2016-09-23 14:09         ` Marcelo Ricardo Leitner
2016-09-23 14:36           ` Eric Dumazet
2016-09-23 14:43             ` David Laight
2016-09-23 15:12             ` Marcelo Ricardo Leitner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160923151205.GC17222@localhost.localdomain \
    --to=marcelo.leitner@gmail.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.