* Generalizing mmap'ed sockets
@ 2010-11-19 20:04 Tom Herbert
2010-11-19 21:32 ` Rick Jones
2010-11-19 22:10 ` Andrew Grover
0 siblings, 2 replies; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 20:04 UTC (permalink / raw)
To: Linux Netdev List
This is a project I'm contemplating. If you have any comments or can
point me to prior work in this area that would be appreciated.
It seems like should be fairly straight forward to extend the mmap
packet ring mechanisms to be used for arbitrary sockets (like TCP,
UDP, etc.). The idea is that we create a ring buffer for a socket
which is mmap'ed to share between user and kernel. This can be done
for both transmit and receive side, and is basically modeled as a
consumer/producer queue. There are semantic differences between
stream and datagram sockets that need to be considered, but I don't
think anything here is untenable.
The expected benefits of this are:
TX:
- Zero copy transmit (which is already supported by vmsplice(), but
this might be simpler)
- One system call needed on transmit which can cover multiple
datagrams or what would have been multiple writes (the call is just to
kick kernel to start sending)
RX:
- Zero system calls needed to do receive (determining data ready is
accomplished by polling)
- Immediate data placement in kernel available all the time,
including OOO placement
- Potential for true zero copy on receive with device support (like
per flow queues, UDP queues)
The userland use of this for TCP might look something like:
struct mmap_sock_hdr {,
__u32 prod_ptr;
__u32 consumer_ptr;
};
int s;
struct mmap_sock_hdr *tx, *rx;
void *tx_base, *rx_base;
struct s_mmap_req {
size_t size;
} mmap_req;
s = socket(AF_INET, SOCKET_STREAM, 0);
/* Set up ring buffer on socket and mmap into user space for TX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size = size;
setsockopt(s, SOL_SOCKET, TX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
tx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
tx_base = (void *)tx[1];
/* Now do same thing for RX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size = size;
setsockopt(s, SOL_SOCKET, RX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
rx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
rx_base = (void *)rx[1];
bind(s, ...) /* Normal bind */
connect(s, ...) /* Normal connect */
/* Transmit */
/* Application fills some of the available buffer (up to consumer pointer) */
for (i = 0; i < 10000; i++)
tx_base[prod_ptr + i] = i % 256;
/* Advance producer pointer */
prod_ptr += 10000;
send(s, NULL, 0); /* Tells stack to send new data indicated by prod
pointer, just a trigger */
/* Polling for POLLOUT should work as expected */
/*********** Receive */
while (1) {
poll(fds);
if (s has POLLIN set) {
Process data from rx_base[rx->consume_ptr] to
rx_base[rx->prod_ptr], modulo size of buffer of course
rx->consume_ptr = rx->prod_ptr; /* Gives back buffer space
to the kernel */
}
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
@ 2010-11-19 21:32 ` Rick Jones
2010-11-19 21:52 ` David Miller
2010-11-19 22:10 ` Andrew Grover
1 sibling, 1 reply; 10+ messages in thread
From: Rick Jones @ 2010-11-19 21:32 UTC (permalink / raw)
To: Tom Herbert; +Cc: Linux Netdev List
I suppose then one would be able to track the consumer pointer (on tx) to "know"
that certain data had been ACKed by the remote? For TCP anyway - and assuming
there wouldn't be a case where TCP might copy the data out of the ring and
assert "completion."
How would the mmap'ing interact with autotuning? Particularly in the "future
case" of HW support for true zero copy on receive.
Today I can be assured that data I receieve is on a "nice" boundary in my memory
by virtue of the buffer pointer I pass to the recv() call, but with the "rings"
(they are simply some chunk of virtually continguous data yes?) it will just be
bumped up against the end of the last one - I'm wondering if that might not be a
problem, especially for UDP datagrams, but even for TCP data? It is one thing
to have people pad structures in the memory of a system, but telling them to
pad-out the messages they send across the network to maintain alignment is
rather different.
How do you differentiate between a "there is data to send" and "I want to send a
zero-legnth datagram for a UDP socket? I think you will have to do/overload
something other than a send() call for the trigger.
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 21:32 ` Rick Jones
@ 2010-11-19 21:52 ` David Miller
2010-11-19 21:55 ` Tom Herbert
2010-11-19 21:58 ` Rick Jones
0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2010-11-19 21:52 UTC (permalink / raw)
To: rick.jones2; +Cc: therbert, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:32:57 -0800
> I suppose then one would be able to track the consumer pointer (on tx)
> to "know" that certain data had been ACKed by the remote? For TCP
> anyway - and assuming there wouldn't be a case where TCP might copy
> the data out of the ring and assert "completion."
Yes, that's implicit in his design, the kernel manages the consumer
pointer in the ring and this is how userspace can see when ring entries
are reusable.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 21:52 ` David Miller
@ 2010-11-19 21:55 ` Tom Herbert
2010-11-19 21:58 ` Rick Jones
1 sibling, 0 replies; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 21:55 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, netdev
On Fri, Nov 19, 2010 at 1:52 PM, David Miller <davem@davemloft.net> wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
>
>> I suppose then one would be able to track the consumer pointer (on tx)
>> to "know" that certain data had been ACKed by the remote? For TCP
>> anyway - and assuming there wouldn't be a case where TCP might copy
>> the data out of the ring and assert "completion."
>
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.
>
And, for stream sockets the ring would be one big contiguous buffer,
for datagram would be packetized buffer like with packet interface.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 21:52 ` David Miller
2010-11-19 21:55 ` Tom Herbert
@ 2010-11-19 21:58 ` Rick Jones
2010-11-19 22:08 ` David Miller
1 sibling, 1 reply; 10+ messages in thread
From: Rick Jones @ 2010-11-19 21:58 UTC (permalink / raw)
To: David Miller; +Cc: therbert, netdev
David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:32:57 -0800
>
>>I suppose then one would be able to track the consumer pointer (on tx)
>>to "know" that certain data had been ACKed by the remote? For TCP
>>anyway - and assuming there wouldn't be a case where TCP might copy
>>the data out of the ring and assert "completion."
>
> Yes, that's implicit in his design, the kernel manages the consumer
> pointer in the ring and this is how userspace can see when ring entries
> are reusable.
But does one really want to lock-in that the update to the consumer pointer
means the data has been ACKed by the remote (or I suppose that DMA have
completed if it were UDP)? We can think of no case where the stack will want to
copy out of the ring and assert completion to the user before it got ACKed by
the remote? Say when the stack wants to autotune the send socket buffer size to
something larger than the tx ring?
rick
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 21:58 ` Rick Jones
@ 2010-11-19 22:08 ` David Miller
2010-11-19 22:47 ` Rick Jones
2010-11-19 22:49 ` Tom Herbert
0 siblings, 2 replies; 10+ messages in thread
From: David Miller @ 2010-11-19 22:08 UTC (permalink / raw)
To: rick.jones2; +Cc: therbert, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 19 Nov 2010 13:58:21 -0800
> David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com>
>> Date: Fri, 19 Nov 2010 13:32:57 -0800
>>
>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>to "know" that certain data had been ACKed by the remote? For TCP
>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>the data out of the ring and assert "completion."
>> Yes, that's implicit in his design, the kernel manages the consumer
>> pointer in the ring and this is how userspace can see when ring
>> entries
>> are reusable.
>
> But does one really want to lock-in that the update to the consumer
> pointer means the data has been ACKed by the remote (or I suppose that
> DMA have completed if it were UDP)?
I think the ACK (or for UDP, the kfree_skb() after TX completes) should
move the consumer pointer. Otherwise you have to copy, and the ACKs
do not clock the sender process properly.
But you do bring up an interesting point about TX buffer space sizing.
This whole scheme currently seems to completely ignore buffer size
auto-tuning done by TCP, and that won't fly I think. :-)
The whole point is to make it so that applications do not need to know
about that aspect of buffering at all. With the current mmap design
we're back to the stone ages where the app essentially has to pick an
explicit send buffer size.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
2010-11-19 21:32 ` Rick Jones
@ 2010-11-19 22:10 ` Andrew Grover
1 sibling, 0 replies; 10+ messages in thread
From: Andrew Grover @ 2010-11-19 22:10 UTC (permalink / raw)
To: Tom Herbert; +Cc: Linux Netdev List
On Fri, Nov 19, 2010 at 12:04 PM, Tom Herbert <therbert@google.com> wrote:
> This is a project I'm contemplating. If you have any comments or can
> point me to prior work in this area that would be appreciated.
> TX:
> - Zero copy transmit (which is already supported by vmsplice(), but
> this might be simpler)
> - One system call needed on transmit which can cover multiple
> datagrams or what would have been multiple writes (the call is just to
> kick kernel to start sending)
I'd look at our existing recvmmsg syscall -- there was talk of doing a
sendmmsg, which sounds close to what you want.
> RX:
> - Zero system calls needed to do receive (determining data ready is
> accomplished by polling)
> - Immediate data placement in kernel available all the time,
> including OOO placement
> - Potential for true zero copy on receive with device support (like
> per flow queues, UDP queues)
Mentioning zero-copy per-flow queues in userspace suggests Infiniband
is prior work in this area.
Regards -- Andy
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 22:08 ` David Miller
@ 2010-11-19 22:47 ` Rick Jones
2010-11-19 22:49 ` Tom Herbert
1 sibling, 0 replies; 10+ messages in thread
From: Rick Jones @ 2010-11-19 22:47 UTC (permalink / raw)
To: David Miller; +Cc: therbert, netdev
David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 19 Nov 2010 13:58:21 -0800
>
>
>>David Miller wrote:
>>
>>>From: Rick Jones <rick.jones2@hp.com>
>>>Date: Fri, 19 Nov 2010 13:32:57 -0800
>>>
>>>
>>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>>to "know" that certain data had been ACKed by the remote? For TCP
>>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>>the data out of the ring and assert "completion."
>>>
>>>Yes, that's implicit in his design, the kernel manages the consumer
>>>pointer in the ring and this is how userspace can see when ring
>>>entries
>>>are reusable.
>>
>>But does one really want to lock-in that the update to the consumer
>>pointer means the data has been ACKed by the remote (or I suppose that
>>DMA have completed if it were UDP)?
>
>
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer. Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.
I'm not worried about the ACK/kfree_skb() moving the pointer. I'm simply
worried about what the application should infer from the pointer's movement.
That is, if the design is documented "Movement of the consumer pointer implies
that the corresponding data has been ACKed by the remote TCP" that is locking
the design into a semantic I don't know that it will always want to maintain,
because there may end-up being some cases where the stack might indeed want to
copy and so not maintain that "pointer update means the remote TCP has the data"
semantic.
> But you do bring up an interesting point about TX buffer space sizing.
>
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
>
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all. With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.
In some ways, the stone ages were nicer :)
What if... :) the stack had a way to communicate to the application that it
wanted to change the effective socket buffer size? If that is indeed
sufficiently infrequent, perhaps a "signal the new size and the app does a fresh
mmap()" mechanism would suffice. The app would, I presume need to first wait for
the existing ring to drain, which could cause some complications I suppose. Is
there a way to flip the sense and have the kernel allocate the ring(s) and
communicate that to the application?
But doesn't the whole idea of having an explicitly mmap()ed area of memory fly
in the face of autotuning to begin with? (Mind you, I've not always been a fan
of autotuning as some of my previous "Why is it growing the window so large?!?"
will attest :) It is suggesting that the application has some "communications
memory" (that it won't be itself copying to/from) and presumably knows or thinks
it knows how much of that it needs. For all we know, Tom is thinking that this
mmap()ed region of memory will be rather larger than the maximum autotuned
socket buffer sizes in the first place. Going back to his initial email I don't
see anything that explicitly describes the relationship between the size of this
mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would
set them. Sure, it would have to be an effective upper bound for copy-less
transmit and receive, but there is nothing that says the windows TCP is using
have to be that large.
rick
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 22:08 ` David Miller
2010-11-19 22:47 ` Rick Jones
@ 2010-11-19 22:49 ` Tom Herbert
2010-11-24 19:57 ` Michael S. Tsirkin
1 sibling, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2010-11-19 22:49 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, netdev
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer. Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.
>
Right, with the caveats that even ACK'ed data might still go out on
the with that was discussed in the vmsplice() related patches. I
don't think this should make the problem any worse.
> But you do bring up an interesting point about TX buffer space sizing.
>
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
>
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all. With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.
True, and I would never say that this is suitable replacement for all
TCP transmit. However, there are specialized applications where this
could be applied. Note that the buffer is not just a kernel buffer,
it is also an application visible buffer with more purpose than just
buffering data for transmit. For instance, an application using RPC
could assemble it's message directly into this buffer (which is cool).
The obvious alternative would be to malloc a buffer and then just use
vmsplice(), either way the application will be allocating a send
buffer for its work.
Tom
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Generalizing mmap'ed sockets
2010-11-19 22:49 ` Tom Herbert
@ 2010-11-24 19:57 ` Michael S. Tsirkin
0 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 19:57 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, rick.jones2, netdev
On Fri, Nov 19, 2010 at 02:49:46PM -0800, Tom Herbert wrote:
> > I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> > move the consumer pointer. Otherwise you have to copy, and the ACKs
> > do not clock the sender process properly.
> >
> Right, with the caveats that even ACK'ed data might still go out on
> the with that was discussed in the vmsplice() related patches. I
> don't think this should make the problem any worse.
Or any better. Sigh. Any idea how to actually track pages
in question so we can either really know when the stack is no longer
referencing them, or force a copy if they hang around after ack?
--
MST
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-11-24 19:58 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-19 20:04 Generalizing mmap'ed sockets Tom Herbert
2010-11-19 21:32 ` Rick Jones
2010-11-19 21:52 ` David Miller
2010-11-19 21:55 ` Tom Herbert
2010-11-19 21:58 ` Rick Jones
2010-11-19 22:08 ` David Miller
2010-11-19 22:47 ` Rick Jones
2010-11-19 22:49 ` Tom Herbert
2010-11-24 19:57 ` Michael S. Tsirkin
2010-11-19 22:10 ` Andrew Grover
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).