Generalizing mmap'ed sockets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Generalizing mmap'ed sockets
@ 2010-11-19 20:04 Tom Herbert
  2010-11-19 21:32 ` Rick Jones
  2010-11-19 22:10 ` Andrew Grover
  0 siblings, 2 replies; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 20:04 UTC (permalink / raw)
  To: Linux Netdev List

This is a project I'm contemplating.  If you have any comments or can
point me to prior work in this area that would be appreciated.

It seems like should be fairly straight forward to extend the mmap
packet ring mechanisms to be used for arbitrary sockets (like TCP,
UDP, etc.). The idea is that we create a ring buffer for a socket
which is mmap'ed to share between user and kernel.  This can be done
for both transmit and receive side, and is basically modeled as a
consumer/producer queue.  There are semantic differences between
stream and datagram sockets that need to be considered, but I don't
think anything here is untenable.

The expected benefits of this are:

TX:
 - Zero copy transmit (which is already supported by vmsplice(), but
this might be simpler)
 - One system call needed on transmit which can cover multiple
datagrams or what would have been multiple writes (the call is just to
kick kernel to start sending)

RX:
 - Zero system calls needed to do receive (determining data ready is
accomplished by polling)
 - Immediate data placement in kernel available all the time,
including OOO placement
 - Potential for true zero copy on receive with device support (like
per flow queues, UDP queues)

The userland use of this for TCP might look something like:

struct mmap_sock_hdr {,
   __u32 prod_ptr;
   __u32 consumer_ptr;
};

int s;
struct mmap_sock_hdr *tx, *rx;
void *tx_base, *rx_base;

struct s_mmap_req {
   size_t size;
} mmap_req;

s = socket(AF_INET, SOCKET_STREAM, 0);

/* Set up ring buffer on socket and mmap into user space for TX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size  = size;
setsockopt(s, SOL_SOCKET, TX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
tx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
tx_base = (void *)tx[1];

/* Now do same thing for RX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size  = size;
setsockopt(s, SOL_SOCKET, RX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
rx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
rx_base = (void *)rx[1];

bind(s, ...) /* Normal bind */
connect(s, ...) /* Normal connect */

/* Transmit */

/* Application fills some of the available buffer (up to consumer pointer) */
for (i = 0; i < 10000; i++)
   tx_base[prod_ptr + i] = i % 256;

/* Advance producer pointer */
prod_ptr += 10000;

send(s, NULL, 0); /* Tells stack to send new data indicated by prod
pointer, just a trigger */

/* Polling for POLLOUT should work as expected */

/*********** Receive */

while (1) {
   poll(fds);
   if (s has POLLIN set) {
       Process data from rx_base[rx->consume_ptr] to
rx_base[rx->prod_ptr], modulo size of buffer of course
       rx->consume_ptr = rx->prod_ptr;    /* Gives back buffer space
to the kernel */
  }
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
@ 2010-11-19 21:32 ` Rick Jones
  2010-11-19 21:52   ` David Miller
  2010-11-19 22:10 ` Andrew Grover
  1 sibling, 1 reply; 10+ messages in thread
From: Rick Jones @ 2010-11-19 21:32 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List

I suppose then one would be able to track the consumer pointer (on tx) to "know" 
that certain data had been ACKed by the remote?  For TCP anyway - and assuming 
there wouldn't be a case where TCP might copy the data out of the ring and 
assert "completion."

How would the mmap'ing interact with autotuning?  Particularly in the "future 
case" of HW support for true zero copy on receive.

Today I can be assured that data I receieve is on a "nice" boundary in my memory 
by virtue of the buffer pointer I pass to the recv() call, but with the "rings" 
(they are simply some chunk of virtually continguous data yes?) it will just be 
bumped up against the end of the last one - I'm wondering if that might not be a 
problem, especially for UDP datagrams, but even for TCP data?  It is one thing 
to have people pad structures in the memory of a system, but telling them to 
pad-out the messages they send across the network to maintain alignment is 
rather different.

How do you differentiate between a "there is data to send" and "I want to send a 
zero-legnth datagram for a UDP socket?  I think you will have to do/overload 
something other than a send() call for the trigger.

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 21:32 ` Rick Jones
@ 2010-11-19 21:52   ` David Miller
  2010-11-19 21:55     ` Tom Herbert
  2010-11-19 21:58     ` Rick Jones
  0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2010-11-19 21:52 UTC (permalink / raw)
  To: rick.jones2; +Cc: therbert, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:32:57 -0800

> I suppose then one would be able to track the consumer pointer (on tx)
> to "know" that certain data had been ACKed by the remote?  For TCP
> anyway - and assuming there wouldn't be a case where TCP might copy
> the data out of the ring and assert "completion."

Yes, that's implicit in his design, the kernel manages the consumer
pointer in the ring and this is how userspace can see when ring entries
are reusable.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 21:52   ` David Miller
@ 2010-11-19 21:55     ` Tom Herbert
  2010-11-19 21:58     ` Rick Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev

On Fri, Nov 19, 2010 at 1:52 PM, David Miller <davem@davemloft.net> wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
>
>> I suppose then one would be able to track the consumer pointer (on tx)
>> to "know" that certain data had been ACKed by the remote?  For TCP
>> anyway - and assuming there wouldn't be a case where TCP might copy
>> the data out of the ring and assert "completion."
>
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.
>

And, for stream sockets the ring would be one big contiguous buffer,
for datagram would be packetized buffer like with packet interface.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 21:52   ` David Miller
  2010-11-19 21:55     ` Tom Herbert
@ 2010-11-19 21:58     ` Rick Jones
  2010-11-19 22:08       ` David Miller
  1 sibling, 1 reply; 10+ messages in thread
From: Rick Jones @ 2010-11-19 21:58 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
> 
>>I suppose then one would be able to track the consumer pointer (on tx)
>>to "know" that certain data had been ACKed by the remote?  For TCP
>>anyway - and assuming there wouldn't be a case where TCP might copy
>>the data out of the ring and assert "completion."
> 
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.

But does one really want to lock-in that the update to the consumer pointer 
means the data has been ACKed by the remote (or I suppose that DMA have 
completed if it were UDP)?  We can think of no case where the stack will want to 
copy out of the ring and assert completion to the user before it got ACKed by 
the remote?  Say when the stack wants to autotune the send socket buffer size to 
something larger than the tx ring?

rick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 21:58     ` Rick Jones
@ 2010-11-19 22:08       ` David Miller
  2010-11-19 22:47         ` Rick Jones
  2010-11-19 22:49         ` Tom Herbert
  0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2010-11-19 22:08 UTC (permalink / raw)
  To: rick.jones2; +Cc: therbert, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:58:21 -0800

> David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com>
>> Date: Fri, 19 Nov 2010 13:32:57 -0800
>> 
>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>to "know" that certain data had been ACKed by the remote?  For TCP
>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>the data out of the ring and assert "completion."
>> Yes, that's implicit in his design, the kernel manages the consumer
>> pointer in the ring and this is how userspace can see when ring
>> entries
>> are reusable.
> 
> But does one really want to lock-in that the update to the consumer
> pointer means the data has been ACKed by the remote (or I suppose that
> DMA have completed if it were UDP)?

I think the ACK (or for UDP, the kfree_skb() after TX completes) should
move the consumer pointer.  Otherwise you have to copy, and the ACKs
do not clock the sender process properly.

But you do bring up an interesting point about TX buffer space sizing.

This whole scheme currently seems to completely ignore buffer size
auto-tuning done by TCP, and that won't fly I think. :-)

The whole point is to make it so that applications do not need to know
about that aspect of buffering at all.  With the current mmap design
we're back to the stone ages where the app essentially has to pick an
explicit send buffer size.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
  2010-11-19 21:32 ` Rick Jones
@ 2010-11-19 22:10 ` Andrew Grover
  1 sibling, 0 replies; 10+ messages in thread
From: Andrew Grover @ 2010-11-19 22:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List

On Fri, Nov 19, 2010 at 12:04 PM, Tom Herbert <therbert@google.com> wrote:
> This is a project I'm contemplating.  If you have any comments or can
> point me to prior work in this area that would be appreciated.

> TX:
>  - Zero copy transmit (which is already supported by vmsplice(), but
> this might be simpler)
>  - One system call needed on transmit which can cover multiple
> datagrams or what would have been multiple writes (the call is just to
> kick kernel to start sending)

I'd look at our existing recvmmsg syscall -- there was talk of doing a
sendmmsg, which sounds close to what you want.

> RX:
>  - Zero system calls needed to do receive (determining data ready is
> accomplished by polling)
>  - Immediate data placement in kernel available all the time,
> including OOO placement
>  - Potential for true zero copy on receive with device support (like
> per flow queues, UDP queues)

Mentioning zero-copy per-flow queues in userspace suggests Infiniband
is prior work in this area.

Regards -- Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 22:08       ` David Miller
@ 2010-11-19 22:47         ` Rick Jones
  2010-11-19 22:49         ` Tom Herbert
  1 sibling, 0 replies; 10+ messages in thread
From: Rick Jones @ 2010-11-19 22:47 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:58:21 -0800
> 
> 
>>David Miller wrote:
>>
>>>From: Rick Jones <rick.jones2@hp.com>
>>>Date: Fri, 19 Nov 2010 13:32:57 -0800
>>>
>>>
>>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>>to "know" that certain data had been ACKed by the remote?  For TCP
>>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>>the data out of the ring and assert "completion."
>>>
>>>Yes, that's implicit in his design, the kernel manages the consumer
>>>pointer in the ring and this is how userspace can see when ring
>>>entries
>>>are reusable.
>>
>>But does one really want to lock-in that the update to the consumer
>>pointer means the data has been ACKed by the remote (or I suppose that
>>DMA have completed if it were UDP)?
> 
> 
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer.  Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.

I'm not worried about the ACK/kfree_skb() moving the pointer.  I'm simply 
worried about what the application should infer from the pointer's movement. 
That is, if the design is documented  "Movement of the consumer pointer implies 
that the corresponding data has been ACKed by the remote TCP" that is locking 
the design into a semantic I don't know that it will always want to maintain, 
because there may end-up being some cases where the stack might indeed want to 
copy and so not maintain that "pointer update means the remote TCP has the data" 
semantic.

> But you do bring up an interesting point about TX buffer space sizing.
> 
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
> 
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all.  With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.

In some ways, the stone ages were nicer :)

What if... :)  the stack had a way to communicate to the application that it 
wanted to change the effective socket buffer size?  If that is indeed 
sufficiently infrequent, perhaps a "signal the new size and the app does a fresh 
mmap()" mechanism would suffice. The app would, I presume need to first wait for 
the existing ring to drain, which could cause some complications I suppose.  Is 
there a way to flip the sense and have the kernel allocate the ring(s) and 
communicate that to the application?

But doesn't the whole idea of having an explicitly mmap()ed area of memory fly 
in the face of autotuning to begin with?  (Mind you, I've not always been a fan 
of autotuning as some of my previous "Why is it growing the window so large?!?" 
will attest :)  It is suggesting that the application has some "communications 
memory" (that it won't be itself copying to/from) and presumably knows or thinks 
it knows how much of that it needs.  For all we know, Tom is thinking that this 
mmap()ed region of memory will be rather larger than the maximum autotuned 
socket buffer sizes in the first place.  Going back to his initial email I don't 
see anything that explicitly describes the relationship between the size of this 
mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would 
set them.  Sure, it would have to be an effective upper bound for copy-less 
transmit and receive, but there is nothing that says the windows TCP is using 
have to be that large.

rick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 22:08       ` David Miller
  2010-11-19 22:47         ` Rick Jones
@ 2010-11-19 22:49         ` Tom Herbert
  2010-11-24 19:57           ` Michael S. Tsirkin
  1 sibling, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev

> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer.  Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.
>
Right, with the caveats that even ACK'ed data might still go out on
the with that was discussed in the vmsplice() related patches.  I
don't think this should make the problem any worse.

> But you do bring up an interesting point about TX buffer space sizing.
>
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
>
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all.  With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.

True, and I would never say that this is suitable replacement for all
TCP transmit.  However, there are specialized applications where this
could be applied.  Note that the buffer is not just a kernel buffer,
it is also an application visible buffer with more purpose than just
buffering data for transmit.  For instance, an application using RPC
could assemble it's message directly into this buffer (which is cool).
 The obvious alternative would be to malloc a buffer and then just use
vmsplice(), either way the application will be allocating a send
buffer for its work.

Tom

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Generalizing mmap'ed sockets
  2010-11-19 22:49         ` Tom Herbert
@ 2010-11-24 19:57           ` Michael S. Tsirkin
  0 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 19:57 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, rick.jones2, netdev

On Fri, Nov 19, 2010 at 02:49:46PM -0800, Tom Herbert wrote:
> > I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> > move the consumer pointer.  Otherwise you have to copy, and the ACKs
> > do not clock the sender process properly.
> >
> Right, with the caveats that even ACK'ed data might still go out on
> the with that was discussed in the vmsplice() related patches.  I
> don't think this should make the problem any worse.

Or any better. Sigh. Any idea how to actually track pages
in question so we can either really know when the stack is no longer
referencing them, or force a copy if they hang around after ack?

-- 
MST

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-11-24 19:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
2010-11-19 21:32 ` Rick Jones
2010-11-19 21:52   ` David Miller
2010-11-19 21:55     ` Tom Herbert
2010-11-19 21:58     ` Rick Jones
2010-11-19 22:08       ` David Miller
2010-11-19 22:47         ` Rick Jones
2010-11-19 22:49         ` Tom Herbert
2010-11-24 19:57           ` Michael S. Tsirkin
2010-11-19 22:10 ` Andrew Grover

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).