From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Re: Generalizing mmap'ed sockets Date: Fri, 19 Nov 2010 14:47:01 -0800 Message-ID: <4CE6FE65.9040302@hp.com> References: <4CE6ED09.70602@hp.com> <20101119.135213.15239226.davem@davemloft.net> <4CE6F2FD.8080301@hp.com> <20101119.140818.242132853.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: therbert@google.com, netdev@vger.kernel.org To: David Miller Return-path: Received: from g4t0016.houston.hp.com ([15.201.24.19]:26210 "EHLO g4t0016.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755371Ab0KSWrD (ORCPT ); Fri, 19 Nov 2010 17:47:03 -0500 In-Reply-To: <20101119.140818.242132853.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: David Miller wrote: > From: Rick Jones > Date: Fri, 19 Nov 2010 13:58:21 -0800 > > >>David Miller wrote: >> >>>From: Rick Jones >>>Date: Fri, 19 Nov 2010 13:32:57 -0800 >>> >>> >>>>I suppose then one would be able to track the consumer pointer (on tx) >>>>to "know" that certain data had been ACKed by the remote? For TCP >>>>anyway - and assuming there wouldn't be a case where TCP might copy >>>>the data out of the ring and assert "completion." >>> >>>Yes, that's implicit in his design, the kernel manages the consumer >>>pointer in the ring and this is how userspace can see when ring >>>entries >>>are reusable. >> >>But does one really want to lock-in that the update to the consumer >>pointer means the data has been ACKed by the remote (or I suppose that >>DMA have completed if it were UDP)? > > > I think the ACK (or for UDP, the kfree_skb() after TX completes) should > move the consumer pointer. Otherwise you have to copy, and the ACKs > do not clock the sender process properly. I'm not worried about the ACK/kfree_skb() moving the pointer. I'm simply worried about what the application should infer from the pointer's movement. That is, if the design is documented "Movement of the consumer pointer implies that the corresponding data has been ACKed by the remote TCP" that is locking the design into a semantic I don't know that it will always want to maintain, because there may end-up being some cases where the stack might indeed want to copy and so not maintain that "pointer update means the remote TCP has the data" semantic. > But you do bring up an interesting point about TX buffer space sizing. > > This whole scheme currently seems to completely ignore buffer size > auto-tuning done by TCP, and that won't fly I think. :-) > > The whole point is to make it so that applications do not need to know > about that aspect of buffering at all. With the current mmap design > we're back to the stone ages where the app essentially has to pick an > explicit send buffer size. In some ways, the stone ages were nicer :) What if... :) the stack had a way to communicate to the application that it wanted to change the effective socket buffer size? If that is indeed sufficiently infrequent, perhaps a "signal the new size and the app does a fresh mmap()" mechanism would suffice. The app would, I presume need to first wait for the existing ring to drain, which could cause some complications I suppose. Is there a way to flip the sense and have the kernel allocate the ring(s) and communicate that to the application? But doesn't the whole idea of having an explicitly mmap()ed area of memory fly in the face of autotuning to begin with? (Mind you, I've not always been a fan of autotuning as some of my previous "Why is it growing the window so large?!?" will attest :) It is suggesting that the application has some "communications memory" (that it won't be itself copying to/from) and presumably knows or thinks it knows how much of that it needs. For all we know, Tom is thinking that this mmap()ed region of memory will be rather larger than the maximum autotuned socket buffer sizes in the first place. Going back to his initial email I don't see anything that explicitly describes the relationship between the size of this mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would set them. Sure, it would have to be an effective upper bound for copy-less transmit and receive, but there is nothing that says the windows TCP is using have to be that large. rick