netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
@ 2010-09-19 21:55 Tom Herbert
  2010-09-20  0:35 ` David Miller
  2010-09-20  3:22 ` Changli Gao
  0 siblings, 2 replies; 5+ messages in thread
From: Tom Herbert @ 2010-09-19 21:55 UTC (permalink / raw)
  To: netdev, davem; +Cc: sridharr

In this patch we propose to adds some socket API to retrieve the
 "transmit completion sequence number", essentially a byte counter
for the number of bytes that have been transmitted and will not be
retransmitted.  In the case of TCP, this should correspond to snd_una.

The purpose of this API is to provide information to userspace about
which buffers can be reclaimed when sending with vmsplice() on a
socket.

There are two methods for retrieving the completed sequence number:
through a simple getsockopt (implemented here for TCP), as well as
returning the value in the ancilary data of a recvmsg.

The expected flow would be something like:
   - Connect is created
   - Initial completion seq # is retrieved through the sockopt, and is
     stored in userspace "compl_seq" variable for the connection.
   - Whenever a send is done, compl_seq += # bytes sent.
   - When doing a vmsplice the completion sequence number is saved
     for each user space buffer, buffer_compl_seq = compl_seq.
   - When recvmsg returns with a completion sequence number in
     ancillary data, any buffers cover by that sequence number
     (where buffer_compl_seq < recvmsg_compl_seq) are reclaimed
     and can be written to again.
   - If no data is receieved on a connection (recvmsg does not
     return), a timeout can be used to call the getsockopt and
     reclaim buffers as a fallback.

Using recvmsg data in this manner is sort of a cheap way to get a
"callback" for when a vmspliced buffer is consumed.  It will work
well for a client where the response causes recvmsg to return.
On the server side it works well if there are a sufficient
number of requests coming on the connection (resorting to the
timeout if necessary as described above).

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..6dc1ed8 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,7 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e64f4c6..f044aff 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -106,6 +106,7 @@ enum {
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
 #define TCP_USER_TIMEOUT	18	/* How long for loss retry before timeout */
+#define TCP_XMIT_COMPL_SEQ	19	/* Return current snd_una */
 
 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
diff --git a/include/net/sock.h b/include/net/sock.h
index 8ae97c4..e820e2b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -543,6 +543,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
+	SOCK_XMIT_COMPL_SEQ, /* SO_XMIT_COMPL_SEQ setting */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a06c4..7a10215 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -740,6 +740,12 @@ set_rcvbuf:
 		else
 			sock_reset_flag(sk, SOCK_RXQ_OVFL);
 		break;
+	case SO_XMIT_COMPL_SEQ:
+		if (valbool)
+			sock_set_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		else
+			sock_reset_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -961,6 +967,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
 		break;
 
+	case SO_XMIT_COMPL_SEQ:
+		v.val = !!sock_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3e8a4db..5e30381 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1387,6 +1387,21 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 EXPORT_SYMBOL(tcp_read_sock);
 
 /*
+ * Copy the first unacked seq into the receive msg control part.
+ */
+static inline void tcp_sock_xmit_compl_seq(struct msghdr *msg,
+					   struct sock *sk)
+{
+	if (sock_flag(sk, SOCK_XMIT_COMPL_SEQ)) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		if (msg->msg_controllen >= sizeof(tp->snd_una)) {
+			put_cmsg(msg, SOL_SOCKET, SCM_XMIT_COMPL_SEQ,
+			    sizeof(tp->snd_una), &tp->snd_una);
+		}
+	}
+}
+
+/*
  *	This routine copies from a sock struct into the user buffer.
  *
  *	Technical note: in 2.3 we work on _locked_ socket, so that
@@ -1763,6 +1778,8 @@ skip_copy:
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
+	tcp_sock_xmit_compl_seq(msg, sk);
+
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
@@ -2617,6 +2634,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
+	case TCP_XMIT_COMPL_SEQ:
+		val = tp->snd_una;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
  2010-09-19 21:55 [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers Tom Herbert
@ 2010-09-20  0:35 ` David Miller
  2010-09-20  3:22 ` Changli Gao
  1 sibling, 0 replies; 5+ messages in thread
From: David Miller @ 2010-09-20  0:35 UTC (permalink / raw)
  To: therbert; +Cc: netdev, sridharr

From: Tom Herbert <therbert@google.com>
Date: Sun, 19 Sep 2010 14:55:40 -0700 (PDT)

> diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
> index 9a6115e..6dc1ed8 100644
> --- a/include/asm-generic/socket.h
> +++ b/include/asm-generic/socket.h
> @@ -64,4 +64,7 @@
>  #define SO_DOMAIN		39
>  
>  #define SO_RXQ_OVFL             40
> +
> +#define SO_XMIT_COMPL_SEQ	41
> +#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
>  #endif /* __ASM_GENERIC_SOCKET_H */

Tom, I just noticed that you're going to have to update all of
the arch specific arch/*/include/asm/socket.h files that don't
make use of asm-generic/socket.h before I can apply this.

Please fix this up and resubmit, thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
  2010-09-19 21:55 [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers Tom Herbert
  2010-09-20  0:35 ` David Miller
@ 2010-09-20  3:22 ` Changli Gao
  2010-09-20 17:50   ` Tom Herbert
  1 sibling, 1 reply; 5+ messages in thread
From: Changli Gao @ 2010-09-20  3:22 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, davem, sridharr

On Mon, Sep 20, 2010 at 5:55 AM, Tom Herbert <therbert@google.com> wrote:
> In this patch we propose to adds some socket API to retrieve the
>  "transmit completion sequence number", essentially a byte counter
> for the number of bytes that have been transmitted and will not be
> retransmitted.  In the case of TCP, this should correspond to snd_una.
>
> The purpose of this API is to provide information to userspace about
> which buffers can be reclaimed when sending with vmsplice() on a
> socket.
>
> There are two methods for retrieving the completed sequence number:
> through a simple getsockopt (implemented here for TCP), as well as
> returning the value in the ancilary data of a recvmsg.
>
> The expected flow would be something like:
>   - Connect is created
>   - Initial completion seq # is retrieved through the sockopt, and is
>     stored in userspace "compl_seq" variable for the connection.
>   - Whenever a send is done, compl_seq += # bytes sent.
>   - When doing a vmsplice the completion sequence number is saved
>     for each user space buffer, buffer_compl_seq = compl_seq.
>   - When recvmsg returns with a completion sequence number in
>     ancillary data, any buffers cover by that sequence number
>     (where buffer_compl_seq < recvmsg_compl_seq) are reclaimed
>     and can be written to again.
>   - If no data is receieved on a connection (recvmsg does not
>     return), a timeout can be used to call the getsockopt and
>     reclaim buffers as a fallback.
>
> Using recvmsg data in this manner is sort of a cheap way to get a
> "callback" for when a vmspliced buffer is consumed.  It will work
> well for a client where the response causes recvmsg to return.

In this case, we don't need it. The response is sufficient to indicate
the request is sent out, and the corresponding pages are available for
the other use.

> On the server side it works well if there are a sufficient
> number of requests coming on the connection (resorting to the
> timeout if necessary as described above).

If no response is sent out by the client, the timeout mechanism may
slow down the data transfer.

I think we should not mix RX and TX. Maybe we can add a new epoll
event to indicate that the snd bufer is advanced.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
  2010-09-20  3:22 ` Changli Gao
@ 2010-09-20 17:50   ` Tom Herbert
  2010-09-21  0:14     ` Changli Gao
  0 siblings, 1 reply; 5+ messages in thread
From: Tom Herbert @ 2010-09-20 17:50 UTC (permalink / raw)
  To: Changli Gao; +Cc: netdev, davem, sridharr

>
> > Using recvmsg data in this manner is sort of a cheap way to get a
> > "callback" for when a vmspliced buffer is consumed.  It will work
> > well for a client where the response causes recvmsg to return.
>
> In this case, we don't need it. The response is sufficient to indicate
> the request is sent out, and the corresponding pages are available for
> the other use.

The response might seem sufficient, but there is no guarantee this
works generally for every application protocol.  The specific
requirement for TCP is that the page cannot be reused before its data
has been acked; if this does not hold then data corruption becomes an
issue.  Given that danger, I would rather the mechanism be unambiguous
about when the data has been consumed, rather than each application
protocol trying to figure this out using its heuristics.

>
> > On the server side it works well if there are a sufficient
> > number of requests coming on the connection (resorting to the
> > timeout if necessary as described above).
>
> If no response is sent out by the client, the timeout mechanism may
> slow down the data transfer.
>
Why would that slow down a transfer?

>
> I think we should not mix RX and TX. Maybe we can add a new epoll
> event to indicate that the snd bufer is advanced.
>
This is possible, but seems more heavyweight that needed.  I don't see
that it's going to be critical to reclaim buffers as soon as they have
been completed, some laziness in do this seems like a reasonable
tradeoff and we should still have a reasonable working set.

Tom

> --
> Regards,
> Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
  2010-09-20 17:50   ` Tom Herbert
@ 2010-09-21  0:14     ` Changli Gao
  0 siblings, 0 replies; 5+ messages in thread
From: Changli Gao @ 2010-09-21  0:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, davem, sridharr

On Tue, Sep 21, 2010 at 1:50 AM, Tom Herbert <therbert@google.com> wrote:
>
>>
>> > On the server side it works well if there are a sufficient
>> > number of requests coming on the connection (resorting to the
>> > timeout if necessary as described above).
>>
>> If no response is sent out by the client, the timeout mechanism may
>> slow down the data transfer.
>>
> Why would that slow down a transfer?


vmsplice(sock, page).
epoll(sock.read_event, timeo);
/* time out */
if_the_page_is_available.
  reclaim_page_and_send_the_next_data.

The page maybe available before timeout, and you may not send the next
data in time.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-09-21  0:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-19 21:55 [PATCH] xmit_compl_seq: information to reclaim vmsplice buffers Tom Herbert
2010-09-20  0:35 ` David Miller
2010-09-20  3:22 ` Changli Gao
2010-09-20 17:50   ` Tom Herbert
2010-09-21  0:14     ` Changli Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).