netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
@ 2010-09-14 20:28 Tom Herbert
  2010-09-16  4:38 ` David Miller
  0 siblings, 1 reply; 3+ messages in thread
From: Tom Herbert @ 2010-09-14 20:28 UTC (permalink / raw)
  To: netdev, davem; +Cc: sridharr

In this patch we propose to adds some socket API to retrieve the
 "transmit completion sequence number", essentially a byte counter
for the number of bytes that have been transmitted and will not be
retransmitted.  In the case of TCP, this should correspond to snd_una.

The purpose of this API is to provide information to userspace about
which buffers can be reclaimed when sending with vmsplice() on a
socket.

There are two methods for retrieving the completed sequence number:
through a simple getsockopt (implemented here for TCP), as well as
returning the value in the ancilary data of a recvmsg.

The expected flow would be something like:
   - Connect is created
   - Initial completion seq # is retrieved through the sockopt, and is
     stored in userspace "compl_seq" variable for the connection.
   - Whenever a send is done, compl_seq += # bytes sent.
   - When doing a vmsplice the completion sequence number is saved
     for each user space buffer, buffer_compl_seq = compl_seq.
   - When recvmsg returns with a completion sequence number in
     ancillary data, any buffers cover by that sequence number
     (where buffer_compl_seq < recvmsg_compl_seq) are reclaimed
     and can be written to again.
   - If no data is receieved on a connection (recvmsg does not
     return), a timeout can be used to call the getsockopt and
     reclaim buffers as a fallback.

Using recvmsg data in this manner is sort of a cheap way to get a
"callback" for when a vmspliced buffer is consumed.  It will work
well for a client where the response causes recvmsg to return.
On the server side it works well if there are a sufficient
number of requests coming on the connection (resorting to the
timeout if necessary as described above).

---
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..6dc1ed8 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,7 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+
+#define SO_XMIT_COMPL_SEQ	41
+#define SCM_XMIT_COMPL_SEQ	SO_XMIT_COMPL_SEQ
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e64f4c6..f044aff 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -106,6 +106,7 @@ enum {
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
 #define TCP_USER_TIMEOUT	18	/* How long for loss retry before timeout */
+#define TCP_XMIT_COMPL_SEQ	19	/* Return current snd_una */
 
 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
diff --git a/include/net/sock.h b/include/net/sock.h
index 8ae97c4..e820e2b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -543,6 +543,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
+	SOCK_XMIT_COMPL_SEQ, /* SO_XMIT_COMPL_SEQ setting */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a06c4..7a10215 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -740,6 +740,12 @@ set_rcvbuf:
 		else
 			sock_reset_flag(sk, SOCK_RXQ_OVFL);
 		break;
+	case SO_XMIT_COMPL_SEQ:
+		if (valbool)
+			sock_set_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		else
+			sock_reset_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -961,6 +967,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
 		break;
 
+	case SO_XMIT_COMPL_SEQ:
+		v.val = !!sock_flag(sk, SOCK_XMIT_COMPL_SEQ);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3e8a4db..5e30381 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1387,6 +1387,21 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 EXPORT_SYMBOL(tcp_read_sock);
 
 /*
+ * Copy the first unacked seq into the receive msg control part.
+ */
+static inline void tcp_sock_xmit_compl_seq(struct msghdr *msg,
+					   struct sock *sk)
+{
+	if (sock_flag(sk, SOCK_XMIT_COMPL_SEQ)) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		if (msg->msg_controllen >= sizeof(tp->snd_una)) {
+			put_cmsg(msg, SOL_SOCKET, SCM_XMIT_COMPL_SEQ,
+			    sizeof(tp->snd_una), &tp->snd_una);
+		}
+	}
+}
+
+/*
  *	This routine copies from a sock struct into the user buffer.
  *
  *	Technical note: in 2.3 we work on _locked_ socket, so that
@@ -1763,6 +1778,8 @@ skip_copy:
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
+	tcp_sock_xmit_compl_seq(msg, sk);
+
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
@@ -2617,6 +2634,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
+	case TCP_XMIT_COMPL_SEQ:
+		val = tp->snd_una;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
  2010-09-14 20:28 [RFC PATCH] xmit_compl_seq: information to reclaim vmsplice buffers Tom Herbert
@ 2010-09-16  4:38 ` David Miller
       [not found]   ` <AANLkTimoHcbpyp95a23GXxWO3gQTxB6SJuP0WSJf1DB-@mail.gmail.com>
  0 siblings, 1 reply; 3+ messages in thread
From: David Miller @ 2010-09-16  4:38 UTC (permalink / raw)
  To: therbert; +Cc: netdev, sridharr


And SIOCOUTQ doesn't work because?

It tells you how much of the current queued data hasn't
been sequentially ACK'd yet, from which you can derive
which buffers are still in-use.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] xmit_compl_seq: information to reclaim vmsplice buffers
       [not found]   ` <AANLkTimoHcbpyp95a23GXxWO3gQTxB6SJuP0WSJf1DB-@mail.gmail.com>
@ 2010-09-19 18:35     ` David Miller
  0 siblings, 0 replies; 3+ messages in thread
From: David Miller @ 2010-09-19 18:35 UTC (permalink / raw)
  To: therbert; +Cc: netdev, sridharr

From: Tom Herbert <therbert@google.com>
Date: Thu, 16 Sep 2010 12:09:08 -0700

> We return the seq # as part of receive message since: 1) the socket is
> already being accessed in the recvmsg, so tacking on this data should be
> cheap 2) the recvmsg may often coincide with an acknowledgment that would
> allow buffers to be reclaimed (esp. in response of a client request) 3) this
> could also be achieved by another system call after recvmsg, but then we're
> adding the cost of the system call.
> 
> The recvmsg and sockopt return the sequence number of first unacknowleged
> data, as opposed to the number of bytes outstanding.  The sequence number is
> not a relative value for our purposes, but the other is.  Given just the
> number of bytes outstanding, we would also need the # bytes that have ever
> been written by application at that instant to compute the completed byte
> number for reclaiming buffers-- so we would need synchronization between
> read and write path in the application (lock needed).

Ok, I'm convinced, thanks for explaining.

Please address the other feedback you've received (if any) and formally
submit this for inclusion into net-next-2.6

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-09-19 18:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-14 20:28 [RFC PATCH] xmit_compl_seq: information to reclaim vmsplice buffers Tom Herbert
2010-09-16  4:38 ` David Miller
     [not found]   ` <AANLkTimoHcbpyp95a23GXxWO3gQTxB6SJuP0WSJf1DB-@mail.gmail.com>
2010-09-19 18:35     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).