Re: Generalizing mmap'ed sockets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Rick Jones <rick.jones2@hp.com>
To: David Miller <davem@davemloft.net>
Cc: therbert@google.com, netdev@vger.kernel.org
Subject: Re: Generalizing mmap'ed sockets
Date: Fri, 19 Nov 2010 14:47:01 -0800	[thread overview]
Message-ID: <4CE6FE65.9040302@hp.com> (raw)
In-Reply-To: <20101119.140818.242132853.davem@davemloft.net>

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:58:21 -0800
> 
> 
>>David Miller wrote:
>>
>>>From: Rick Jones <rick.jones2@hp.com>
>>>Date: Fri, 19 Nov 2010 13:32:57 -0800
>>>
>>>
>>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>>to "know" that certain data had been ACKed by the remote?  For TCP
>>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>>the data out of the ring and assert "completion."
>>>
>>>Yes, that's implicit in his design, the kernel manages the consumer
>>>pointer in the ring and this is how userspace can see when ring
>>>entries
>>>are reusable.
>>
>>But does one really want to lock-in that the update to the consumer
>>pointer means the data has been ACKed by the remote (or I suppose that
>>DMA have completed if it were UDP)?
> 
> 
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer.  Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.

I'm not worried about the ACK/kfree_skb() moving the pointer.  I'm simply 
worried about what the application should infer from the pointer's movement. 
That is, if the design is documented  "Movement of the consumer pointer implies 
that the corresponding data has been ACKed by the remote TCP" that is locking 
the design into a semantic I don't know that it will always want to maintain, 
because there may end-up being some cases where the stack might indeed want to 
copy and so not maintain that "pointer update means the remote TCP has the data" 
semantic.

> But you do bring up an interesting point about TX buffer space sizing.
> 
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
> 
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all.  With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.

In some ways, the stone ages were nicer :)

What if... :)  the stack had a way to communicate to the application that it 
wanted to change the effective socket buffer size?  If that is indeed 
sufficiently infrequent, perhaps a "signal the new size and the app does a fresh 
mmap()" mechanism would suffice. The app would, I presume need to first wait for 
the existing ring to drain, which could cause some complications I suppose.  Is 
there a way to flip the sense and have the kernel allocate the ring(s) and 
communicate that to the application?

But doesn't the whole idea of having an explicitly mmap()ed area of memory fly 
in the face of autotuning to begin with?  (Mind you, I've not always been a fan 
of autotuning as some of my previous "Why is it growing the window so large?!?" 
will attest :)  It is suggesting that the application has some "communications 
memory" (that it won't be itself copying to/from) and presumably knows or thinks 
it knows how much of that it needs.  For all we know, Tom is thinking that this 
mmap()ed region of memory will be rather larger than the maximum autotuned 
socket buffer sizes in the first place.  Going back to his initial email I don't 
see anything that explicitly describes the relationship between the size of this 
mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would 
set them.  Sure, it would have to be an effective upper bound for copy-less 
transmit and receive, but there is nothing that says the windows TCP is using 
have to be that large.

rick

next prev parent reply	other threads:[~2010-11-19 22:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
2010-11-19 21:32 ` Rick Jones
2010-11-19 21:52   ` David Miller
2010-11-19 21:55     ` Tom Herbert
2010-11-19 21:58     ` Rick Jones
2010-11-19 22:08       ` David Miller
2010-11-19 22:47         ` Rick Jones [this message]
2010-11-19 22:49         ` Tom Herbert
2010-11-24 19:57           ` Michael S. Tsirkin
2010-11-19 22:10 ` Andrew Grover

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CE6FE65.9040302@hp.com \
    --to=rick.jones2@hp.com \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=therbert@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).