Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: Yinghai Lu @ 2013-03-20 21:01 UTC (permalink / raw)
  To: David Miller, Zhang Rui; +Cc: yamato, netdev
In-Reply-To: <20130320.120705.58997943737455552.davem@davemloft.net>

On Wed, Mar 20, 2013 at 9:07 AM, David Miller <davem@davemloft.net> wrote:
> From: Masatake YAMATO <yamato@redhat.com>
> Date: Tue, 19 Mar 2013 20:47:27 +0900
>
>> Trigger BUG_ON if a group name is longer than GENL_NAMSIZ.
>>
>> Signed-off-by: Masatake YAMATO <yamato@redhat.com>
>
> Applied, thanks.

catch one in

drivers/thermal/thermal_sys.c::genetlink_init

        result = genl_register_mc_group(&thermal_event_genl_family,
                                        &thermal_event_mcgrp);

and

static struct genl_multicast_group thermal_event_mcgrp = {
        .name = THERMAL_GENL_MCAST_GROUP_NAME,
};

#define THERMAL_GENL_MCAST_GROUP_NAME           "thermal_mc_group"

that string len 16.

^ permalink raw reply

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: David Miller @ 2013-03-20 21:05 UTC (permalink / raw)
  To: yinghai; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <CAE9FiQURBR=iTbksUzb_i23g6N56Acg2J8bab9zr1UcwT5tWzA@mail.gmail.com>

From: Yinghai Lu <yinghai@kernel.org>
Date: Wed, 20 Mar 2013 14:01:09 -0700

> On Wed, Mar 20, 2013 at 9:07 AM, David Miller <davem@davemloft.net> wrote:
>> From: Masatake YAMATO <yamato@redhat.com>
>> Date: Tue, 19 Mar 2013 20:47:27 +0900
>>
>>> Trigger BUG_ON if a group name is longer than GENL_NAMSIZ.
>>>
>>> Signed-off-by: Masatake YAMATO <yamato@redhat.com>
>>
>> Applied, thanks.
> 
> catch one in

Seriously?

That's what his second patch in this series fixes, we know about it
already.

^ permalink raw reply

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: Yinghai Lu @ 2013-03-20 21:06 UTC (permalink / raw)
  To: David Miller; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <20130320.170511.343122156621608721.davem@davemloft.net>

On Wed, Mar 20, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote:
> From: Yinghai Lu <yinghai@kernel.org>
> Date: Wed, 20 Mar 2013 14:01:09 -0700
>
>> On Wed, Mar 20, 2013 at 9:07 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Masatake YAMATO <yamato@redhat.com>
>>> Date: Tue, 19 Mar 2013 20:47:27 +0900
>>>
>>>> Trigger BUG_ON if a group name is longer than GENL_NAMSIZ.
>>>>
>>>> Signed-off-by: Masatake YAMATO <yamato@redhat.com>
>>>
>>> Applied, thanks.
>>
>> catch one in
>
> Seriously?
>
> That's what his second patch in this series fixes, we know about it
> already.

[    6.047966] ------------[ cut here ]------------
[    6.048944] kernel BUG at net/netlink/genetlink.c:145!
[    6.049204] invalid opcode: 0000 [#1] SMP
[    6.049501] Modules linked in:
[    6.049728] CPU 2
[    6.050003] Pid: 1, comm: swapper/0 Not tainted
3.9.0-rc3-yh-00486-gb6d1e23-dirty #1371 Bochs Bochs
[    6.050299] RIP: 0010:[<ffffffff81f0f968>]  [<ffffffff81f0f968>]
genl_register_mc_group+0x38/0x270
[    6.050593] RSP: 0000:ffff880119e63dd8  EFLAGS: 00000246
[    6.050804] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff82b667c8
[    6.050931] RDX: ffffffff82b667c8 RSI: 0000000000000000 RDI: ffffffff82b667b8
[    6.050931] RBP: ffff880119e63e28 R08: 0000000000000000 R09: 0000000000000000
[    6.050931] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff82b667a0
[    6.050931] R13: ffffffff82b66720 R14: 00000001683fe47c R15: 0000000000000000
[    6.050931] FS:  0000000000000000(0000) GS:ffff88019f600000(0000)
knlGS:0000000000000000
[    6.050931] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    6.050931] CR2: 0000000000000000 CR3: 0000000002a14000 CR4: 00000000000006e0
[    6.050931] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    6.050931] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    6.050931] Process swapper/0 (pid: 1, threadinfo ffff880119e62000,
task ffff880119e68000)
[    6.050931] Stack:
[    6.050931]  ffff880119e63de8 ffffffff82b66720 ffffffff82dcb082
0000000000000000
[    6.050931]  ffff880119e63e28 ffffffff81f0f901 0000000000000000
ffffffff82dcb082
[    6.050931]  0000000000000000 00000001683fe47c ffff880119e63e48
ffffffff82dcb0f8
[    6.050931] Call Trace:
[    6.050931]  [<ffffffff82dcb082>] ? smsc47b397_init+0x19d/0x19d
[    6.050931]  [<ffffffff81f0f901>] ? genl_register_family+0x1b1/0x1e0
[    6.050931]  [<ffffffff82dcb082>] ? smsc47b397_init+0x19d/0x19d
[    6.050931]  [<ffffffff82dcb0f8>] thermal_init+0x76/0x8e
[    6.050931]  [<ffffffff81000254>] do_one_initcall+0x64/0x160
[    6.050931]  [<ffffffff82d8a321>] kernel_init_freeable+0x1bf/0x251
[    6.050931]  [<ffffffff82d89a71>] ? loglevel+0x31/0x31
[    6.050931]  [<ffffffff81ff0740>] ? rest_init+0xc0/0xc0
[    6.050931]  [<ffffffff81ff074e>] kernel_init+0xe/0xf0
[    6.050931]  [<ffffffff8202999c>] ret_from_fork+0x7c/0xb0
[    6.050931]  [<ffffffff81ff0740>] ? rest_init+0xc0/0xc0
[    6.050931] Code: 41 54 49 89 f4 53 48 83 ec 30 80 7e 18 00 75 03
0f 0b 90 49 89 fd 48 8d 7e 18 ba 10 00 00 00 31 f6 e8 3d 01 5e ff 48
85 c0 75 08 <0f> 0b 66 0f 1f 44 00 00 e8 7b f3 ff ff 49 81 fc a0 20 b8
82 74
[    6.050931] RIP  [<ffffffff81f0f968>] genl_register_mc_group+0x38/0x270
[    6.050931]  RSP <ffff880119e63dd8>
[    6.063404] ---[ end trace 60aaac53e4d8e418 ]---

^ permalink raw reply

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: David Miller @ 2013-03-20 21:13 UTC (permalink / raw)
  To: yinghai; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <CAE9FiQXgv8EZGQ142NyV5D1M2+HKSaSoafNzdXY2SBO+iQjdGw@mail.gmail.com>

From: Yinghai Lu <yinghai@kernel.org>
Date: Wed, 20 Mar 2013 14:06:06 -0700

> On Wed, Mar 20, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote:
>> From: Yinghai Lu <yinghai@kernel.org>
>> Date: Wed, 20 Mar 2013 14:01:09 -0700
>>
>>> On Wed, Mar 20, 2013 at 9:07 AM, David Miller <davem@davemloft.net> wrote:
>>>> From: Masatake YAMATO <yamato@redhat.com>
>>>> Date: Tue, 19 Mar 2013 20:47:27 +0900
>>>>
>>>>> Trigger BUG_ON if a group name is longer than GENL_NAMSIZ.
>>>>>
>>>>> Signed-off-by: Masatake YAMATO <yamato@redhat.com>
>>>>
>>>> Applied, thanks.
>>>
>>> catch one in
>>
>> Seriously?
>>
>> That's what his second patch in this series fixes, we know about it
>> already.
> 
> [    6.047966] ------------[ cut here ]------------
> [    6.048944] kernel BUG at net/netlink/genetlink.c:145!

For the second time, we know about this bug.

His second patch fixes the themal layer bug.

But I told him to send it to the thermal maintainers so that they
can integrate it.

Why are you posting what we know already over and over?

^ permalink raw reply

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: Yinghai Lu @ 2013-03-20 21:16 UTC (permalink / raw)
  To: David Miller; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <20130320.171305.1359445944793447383.davem@davemloft.net>

On Wed, Mar 20, 2013 at 2:13 PM, David Miller <davem@davemloft.net> wrote:
>>
>> [    6.047966] ------------[ cut here ]------------
>> [    6.048944] kernel BUG at net/netlink/genetlink.c:145!
>
> For the second time, we know about this bug.
>
> His second patch fixes the themal layer bug.
>
> But I told him to send it to the thermal maintainers so that they
> can integrate it.
>
> Why are you posting what we know already over and over?

so bisecting will be broken?

^ permalink raw reply

* [PATCH 0/3] Clean up and fix SCM_CREDENTIALS code
From: Andy Lutomirski @ 2013-03-20 21:38 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Andy Lutomirski

The code is incomprehensible, slow, and buggy.

Andy Lutomirski (3):
  net: Clean up SCM_CREDENTIALS code
  netlink: Remove an unused pointer in netlink_skb_parms
  net: Remove sock_iocb.scm

 include/linux/netlink.h  |  1 -
 include/net/af_unix.h    |  6 ++--
 include/net/scm.h        | 83 ++++++++++++++++++++++++++++++++----------------
 include/net/sock.h       |  1 -
 net/core/scm.c           | 49 ++++++++++------------------
 net/netlink/af_netlink.c | 31 ++++++++----------
 net/socket.c             |  2 --
 net/unix/af_unix.c       | 83 +++++++++++++++++-------------------------------
 8 files changed, 118 insertions(+), 138 deletions(-)

-- 
1.8.1.4

^ permalink raw reply

* [PATCH 1/3] net: Clean up SCM_CREDENTIALS code
From: Andy Lutomirski @ 2013-03-20 21:38 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Andy Lutomirski
In-Reply-To: <cover.1363815201.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

I was curious whether the uids, gids, and pids passed around worked
correctly in the presence of multiple namespaces.  I gave up trying
to figure it out: there are two copies of the pid (one of which has
type u32, which is odd), a struct cred * (!), and a separate kuid
and kgid.  IOW, all of the relevant data is stored twice, and it's
unclear which copy is used when.

I also wondered what prevented a SO_CREDENTIALS message from being
recieved when the credentials weren't filled out.  Answer: not very
much (and there have been serious security bugs here in the past).

So just rewrite the thing to store a pid_t relative to the init pid
ns, a kuid, and a kgid, and to explicitly track whether the data is
filled out.

I haven't played with the secid code.  I have no idea whether it has
similar problems.

I haven't benchmarked this, but it should be a respectable speedup
in the cases where the credentials are in use.

Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
---

Before, the program below printed this:

$ passcred
My pid = 19873
No SO_PASSCRED: uid=65534 gid=65534 pid=0  [this is a bug]
SO_PASSCRED: uid=1000 gid=1000 pid=19873
SO_PASSCRED, forked to pid 19874: uid=1000 gid=1000 pid=19874

# passcred
My pid = 19886
No SO_PASSCRED: uid=65534 gid=65534 pid=0
SO_PASSCRED: uid=0 gid=0 pid=19886
SO_PASSCRED, forked to pid 19887: uid=0 gid=0 pid=19887


After:

# passcred
My pid = 83
No creds received
SO_PASSCRED: uid=0 gid=0 pid=83
SO_PASSCRED, forked to pid 84: uid=0 gid=0 pid=0


#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <err.h>

static void send_str(int fd, const char *msg)
{
	if (send(fd, msg, strlen(msg)+1, 0) < 0)
		err(1, "send");
}

int main()
{
	int fds[2];
	if (socketpair(AF_UNIX, SOCK_DGRAM, 0, fds))
		err(1, "socketpair");

	send_str(fds[1], "No SO_PASSCRED");

	int one = 1;
	if (setsockopt(fds[0], SOL_SOCKET, SO_PASSCRED, &one, sizeof(one)) != 0)
		err(1, "SO_PASSCRED");

	send_str(fds[1], "SO_PASSCRED");

	if (fork() == 0) {
		char msg[1024];
		sprintf(msg, "SO_PASSCRED, forked to pid %ld", (long)getpid());
		send_str(fds[1], msg);
		return 0;
	}

	int status;
	wait(&status);
	printf("My pid = %ld\n", getpid());

	for (int i = 0; i < 3; i++) {
		char buf[1024];
		char cbuf[CMSG_SPACE(sizeof(struct ucred))];
		struct iovec iov;
		iov.iov_base = &buf;
		iov.iov_len = sizeof(buf);
		struct msghdr hdr;
		memset(&hdr, 0, sizeof(hdr));
		hdr.msg_iov = &iov;
		hdr.msg_iovlen = 1;
		hdr.msg_control = cbuf;
		hdr.msg_controllen = sizeof(cbuf);
		ssize_t bytes = recvmsg(fds[0], &hdr, 0);
		if (bytes < 0)
			err(1, "recvmsg");

		if (hdr.msg_flags & (MSG_TRUNC | MSG_CTRUNC))
			errx(1, "truncated");

		struct ucred cred;
		bool ok = false;

		for (struct cmsghdr *cmsg = CMSG_FIRSTHDR(&hdr); cmsg; cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
			if (cmsg->cmsg_level == SOL_SOCKET &&
			    cmsg->cmsg_type == SCM_CREDENTIALS) {
				cred = *((struct ucred *)CMSG_DATA(cmsg));
				ok = true;
				break;
			}
		}
		if (!ok) {
			printf("No creds received\n");
		} else {
			printf("%s: uid=%ld gid=%ld pid=%ld\n",
			       buf, (long)cred.uid, (long)cred.gid, (long)cred.pid);
		}
	}

	return 0;
}


 include/net/af_unix.h    |  6 ++--
 include/net/scm.h        | 83 ++++++++++++++++++++++++++++++++----------------
 net/core/scm.c           | 49 ++++++++++------------------
 net/netlink/af_netlink.c | 14 +++++---
 net/unix/af_unix.c       | 35 +++++++++-----------
 5 files changed, 99 insertions(+), 88 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 0a996a3..7874f3e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -27,12 +27,12 @@ struct unix_address {
 	struct sockaddr_un name[0];
 };
 
+/* This structure is identical to struct scm_cookie. */
 struct unix_skb_parms {
-	struct pid		*pid;		/* Skb credentials	*/
-	const struct cred	*cred;
+	struct scm_creds	creds;
 	struct scm_fp_list	*fp;		/* Passed files		*/
 #ifdef CONFIG_SECURITY_NETWORK
-	u32			secid;		/* Security ID		*/
+	u32			secid;		/* Passed security ID 	*/
 #endif
 };
 
diff --git a/include/net/scm.h b/include/net/scm.h
index 975cca0..f6f0626 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -13,9 +13,24 @@
 #define SCM_MAX_FD	253
 
 struct scm_creds {
-	u32	pid;
-	kuid_t	uid;
-	kgid_t	gid;
+	bool has_creds;
+
+	/*
+	 * Keeping reference counts (as to a struct pid *) in here is
+	 * annoying -- things like skb_set_owner_[rw] and skb_clone assume
+	 * that it's ok to memcpy skb->cb around.
+	 *
+	 * Fortunately (?) anything that uses the pid field in SCM_CREDENTIALS
+	 * is fundamentally racy, since the networking code certainly isn't
+	 * going to keep a reference alive *after* recvmsg.  So let's embrace
+	 * the race condition at the cost of an extra hash lookup on receive.
+	 *
+	 * (There's an added benefit here: this approach doesn't write to
+	 * any shared cachelines.)
+	 */
+	pid_t		init_ns_pid;
+	kuid_t		uid;
+	kgid_t		gid;
 };
 
 struct scm_fp_list {
@@ -25,15 +40,15 @@ struct scm_fp_list {
 };
 
 struct scm_cookie {
-	struct pid		*pid;		/* Skb credentials */
-	const struct cred	*cred;
+	struct scm_creds	creds;
 	struct scm_fp_list	*fp;		/* Passed files		*/
-	struct scm_creds	creds;		/* Skb credentials	*/
 #ifdef CONFIG_SECURITY_NETWORK
 	u32			secid;		/* Passed security ID 	*/
 #endif
 };
 
+#define SCM_COOKIE_INIT {}  /* All zeros is good. */
+
 extern void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm);
 extern void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm);
 extern int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm);
@@ -50,39 +65,47 @@ static __inline__ void unix_get_peersec_dgram(struct socket *sock, struct scm_co
 { }
 #endif /* CONFIG_SECURITY_NETWORK */
 
-static __inline__ void scm_set_cred(struct scm_cookie *scm,
-				    struct pid *pid, const struct cred *cred)
+static __inline__ bool scm_creds_equal(const struct scm_creds *a,
+                                       const struct scm_creds *b)
 {
-	scm->pid  = get_pid(pid);
-	scm->cred = cred ? get_cred(cred) : NULL;
-	scm->creds.pid = pid_vnr(pid);
-	scm->creds.uid = cred ? cred->euid : INVALID_UID;
-	scm->creds.gid = cred ? cred->egid : INVALID_GID;
+	if (a->has_creds)
+		return a->init_ns_pid == b->init_ns_pid &&
+			uid_eq(a->uid, b->uid) && gid_eq(a->gid, b->gid);
+	else
+		return !b->has_creds;
 }
 
-static __inline__ void scm_destroy_cred(struct scm_cookie *scm)
+static __inline__ void scm_creds_from_current(struct scm_creds *creds)
 {
-	put_pid(scm->pid);
-	scm->pid  = NULL;
+	const struct cred *cred = get_current_cred();
+	creds->has_creds = true;
+	creds->init_ns_pid = task_tgid_nr(current);
+	creds->uid = cred->uid;
+	creds->gid = cred->gid;
+}
 
-	if (scm->cred)
-		put_cred(scm->cred);
-	scm->cred = NULL;
+static __inline__ void scm_creds_from_kernel(struct scm_creds *creds)
+{
+	creds->has_creds = true;
+	creds->init_ns_pid = 0;
+	creds->uid = GLOBAL_ROOT_UID;
+	creds->gid = GLOBAL_ROOT_GID;
+}
+
+static __inline__ void scm_creds_wipe(struct scm_creds *creds)
+{
+	creds->has_creds = false;
 }
 
 static __inline__ void scm_destroy(struct scm_cookie *scm)
 {
-	scm_destroy_cred(scm);
 	if (scm->fp)
 		__scm_destroy(scm);
 }
 
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
-			       struct scm_cookie *scm, bool forcecreds)
+			       struct scm_cookie *scm)
 {
-	memset(scm, 0, sizeof(*scm));
-	if (forcecreds)
-		scm_set_cred(scm, task_tgid(current), current_cred());
 	unix_get_peersec_dgram(sock, scm);
 	if (msg->msg_controllen <= 0)
 		return 0;
@@ -120,18 +143,22 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 		return;
 	}
 
-	if (test_bit(SOCK_PASSCRED, &sock->flags)) {
+	if (test_bit(SOCK_PASSCRED, &sock->flags) && scm->creds.has_creds) {
 		struct user_namespace *current_ns = current_user_ns();
 		struct ucred ucreds = {
-			.pid = scm->creds.pid,
 			.uid = from_kuid_munged(current_ns, scm->creds.uid),
 			.gid = from_kgid_munged(current_ns, scm->creds.gid),
 		};
+
+		rcu_read_lock();
+		/* On error, this will result in pid 0. */
+		ucreds.pid = pid_vnr(find_pid_ns(scm->creds.init_ns_pid,
+						 &init_pid_ns));
+		rcu_read_unlock();
+
 		put_cmsg(msg, SOL_SOCKET, SCM_CREDENTIALS, sizeof(ucreds), &ucreds);
 	}
 
-	scm_destroy_cred(scm);
-
 	scm_passec(sock, msg, scm);
 
 	if (!scm->fp)
diff --git a/net/core/scm.c b/net/core/scm.c
index 905dcc6..7dd7534 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -157,8 +157,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 		case SCM_CREDENTIALS:
 		{
 			struct ucred creds;
-			kuid_t uid;
-			kgid_t gid;
+			struct pid *pid;
 			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred)))
 				goto error;
 			memcpy(&creds, CMSG_DATA(cmsg), sizeof(struct ucred));
@@ -166,41 +165,25 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 			if (err)
 				goto error;
 
-			p->creds.pid = creds.pid;
-			if (!p->pid || pid_vnr(p->pid) != creds.pid) {
-				struct pid *pid;
-				err = -ESRCH;
-				pid = find_get_pid(creds.pid);
-				if (!pid)
-					goto error;
-				put_pid(p->pid);
-				p->pid = pid;
-			}
-
 			err = -EINVAL;
-			uid = make_kuid(current_user_ns(), creds.uid);
-			gid = make_kgid(current_user_ns(), creds.gid);
-			if (!uid_valid(uid) || !gid_valid(gid))
+			p->creds.uid = make_kuid(current_user_ns(), creds.uid);
+			p->creds.gid = make_kgid(current_user_ns(), creds.gid);
+			if (!uid_valid(p->creds.uid) ||
+			    !gid_valid(p->creds.gid))
 				goto error;
 
-			p->creds.uid = uid;
-			p->creds.gid = gid;
-
-			if (!p->cred ||
-			    !uid_eq(p->cred->euid, uid) ||
-			    !gid_eq(p->cred->egid, gid)) {
-				struct cred *cred;
-				err = -ENOMEM;
-				cred = prepare_creds();
-				if (!cred)
-					goto error;
-
-				cred->uid = cred->euid = uid;
-				cred->gid = cred->egid = gid;
-				if (p->cred)
-					put_cred(p->cred);
-				p->cred = cred;
+			rcu_read_lock();
+			pid = find_vpid(creds.pid);
+			if (!pid) {
+				rcu_read_unlock();
+				err = -ESRCH;
+				goto error;
 			}
+			p->creds.init_ns_pid = pid_nr(pid);
+			rcu_read_unlock();
+
+			p->creds.has_creds = true;
+
 			break;
 		}
 		default:
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 1e3fd5b..8245f61 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -936,7 +936,6 @@ static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb,
 	if (nlk->netlink_rcv != NULL) {
 		ret = skb->len;
 		skb_set_owner_r(skb, sk);
-		NETLINK_CB(skb).ssk = ssk;
 		nlk->netlink_rcv(skb);
 		consume_skb(skb);
 	} else {
@@ -1368,7 +1367,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	u32 dst_group;
 	struct sk_buff *skb;
 	int err;
-	struct scm_cookie scm;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 
 	if (msg->msg_flags&MSG_OOB)
 		return -EOPNOTSUPP;
@@ -1376,7 +1375,8 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (NULL == siocb->scm)
 		siocb->scm = &scm;
 
-	err = scm_send(sock, msg, siocb->scm, true);
+	scm_creds_from_current(&siocb->scm->creds);
+	err = scm_send(sock, msg, siocb->scm);
 	if (err < 0)
 		return err;
 
@@ -1411,7 +1411,9 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 
 	NETLINK_CB(skb).portid	= nlk->portid;
 	NETLINK_CB(skb).dst_group = dst_group;
-	NETLINK_CB(skb).creds	= siocb->scm->creds;
+
+	/* This is mandatory. See netlink_recvmsg. */
+	NETLINK_CB(skb).creds = siocb->scm->creds;
 
 	err = -EFAULT;
 	if (memcpy_fromiovec(skb_put(skb, len), msg->msg_iov, len)) {
@@ -1504,7 +1506,11 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock,
 		memset(&scm, 0, sizeof(scm));
 		siocb->scm = &scm;
 	}
+	/* skbs without creds are from the kernel. */
 	siocb->scm->creds = *NETLINK_CREDS(skb);
+	if (!siocb->scm->creds.has_creds)
+		scm_creds_from_kernel(&siocb->scm->creds);
+
 	if (flags & MSG_TRUNC)
 		copied = data_skb->len;
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 51be64f..0881739 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1338,10 +1338,8 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
 
 static void unix_destruct_scm(struct sk_buff *skb)
 {
-	struct scm_cookie scm;
-	memset(&scm, 0, sizeof(scm));
-	scm.pid  = UNIXCB(skb).pid;
-	scm.cred = UNIXCB(skb).cred;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
+	scm.creds = UNIXCB(skb).creds;
 	if (UNIXCB(skb).fp)
 		unix_detach_fds(&scm, skb);
 
@@ -1391,9 +1389,7 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 {
 	int err = 0;
 
-	UNIXCB(skb).pid  = get_pid(scm->pid);
-	if (scm->cred)
-		UNIXCB(skb).cred = get_cred(scm->cred);
+	UNIXCB(skb).creds = scm->creds;
 	UNIXCB(skb).fp = NULL;
 	if (scm->fp && send_fds)
 		err = unix_attach_fds(scm, skb);
@@ -1410,13 +1406,12 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
 static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
 			    const struct sock *other)
 {
-	if (UNIXCB(skb).cred)
+	if (UNIXCB(skb).creds.has_creds)
 		return;
 	if (test_bit(SOCK_PASSCRED, &sock->flags) ||
 	    !other->sk_socket ||
 	    test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
-		UNIXCB(skb).pid  = get_pid(task_tgid(current));
-		UNIXCB(skb).cred = get_current_cred();
+		scm_creds_from_current(&UNIXCB(skb).creds);
 	}
 }
 
@@ -1438,14 +1433,14 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	unsigned int hash;
 	struct sk_buff *skb;
 	long timeo;
-	struct scm_cookie tmp_scm;
+	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
 	int max_level;
 	int data_len = 0;
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
 	wait_for_unix_gc();
-	err = scm_send(sock, msg, siocb->scm, false);
+	err = scm_send(sock, msg, siocb->scm);
 	if (err < 0)
 		return err;
 
@@ -1607,14 +1602,14 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	int err, size;
 	struct sk_buff *skb;
 	int sent = 0;
-	struct scm_cookie tmp_scm;
+	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
 	bool fds_sent = false;
 	int max_level;
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
 	wait_for_unix_gc();
-	err = scm_send(sock, msg, siocb->scm, false);
+	err = scm_send(sock, msg, siocb->scm);
 	if (err < 0)
 		return err;
 
@@ -1765,7 +1760,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 			      int flags)
 {
 	struct sock_iocb *siocb = kiocb_to_siocb(iocb);
-	struct scm_cookie tmp_scm;
+	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
 	int noblock = flags & MSG_DONTWAIT;
@@ -1820,7 +1815,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 		siocb->scm = &tmp_scm;
 		memset(&tmp_scm, 0, sizeof(tmp_scm));
 	}
-	scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+	siocb->scm->creds = UNIXCB(skb).creds;
 	unix_set_secdata(siocb->scm, skb);
 
 	if (!(flags & MSG_PEEK)) {
@@ -1898,7 +1893,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			       int flags)
 {
 	struct sock_iocb *siocb = kiocb_to_siocb(iocb);
-	struct scm_cookie tmp_scm;
+	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
 	struct sockaddr_un *sunaddr = msg->msg_name;
@@ -1991,12 +1986,12 @@ again:
 
 		if (check_creds) {
 			/* Never glue messages from different writers */
-			if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
-			    (UNIXCB(skb).cred != siocb->scm->cred))
+			if (!scm_creds_equal(&UNIXCB(skb).creds,
+			                     &siocb->scm->creds));
 				break;
 		} else {
 			/* Copy credentials */
-			scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+			siocb->scm->creds = UNIXCB(skb).creds;
 			check_creds = 1;
 		}
 
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 2/3] netlink: Remove an unused pointer in netlink_skb_parms
From: Andy Lutomirski @ 2013-03-20 21:38 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, Andy Lutomirski
In-Reply-To: <cover.1363815201.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
---
 include/linux/netlink.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index e0f746b..9ac1201 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -19,7 +19,6 @@ struct netlink_skb_parms {
 	struct scm_creds	creds;		/* Skb credentials	*/
 	__u32			portid;
 	__u32			dst_group;
-	struct sock		*ssk;
 };
 
 #define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 3/3] net: Remove sock_iocb.scm
From: Andy Lutomirski @ 2013-03-20 21:38 UTC (permalink / raw)
  To: netdev; +Cc: Eric W. Biederman, containers, Andy Lutomirski
In-Reply-To: <cover.1363815201.git.luto@amacapital.net>

I'm not sure why this was here.  Any use after sendmsg or recvmsg
returned was bogus anyway.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/net/sock.h       |  1 -
 net/netlink/af_netlink.c | 27 +++++++-------------
 net/socket.c             |  2 --
 net/unix/af_unix.c       | 66 ++++++++++++++++++------------------------------
 4 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 14f6e9d..87b134e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1299,7 +1299,6 @@ struct sock_iocb {
 	int			size;
 	struct socket		*sock;
 	struct sock		*sk;
-	struct scm_cookie	*scm;
 	struct msghdr		*msg, async_msg;
 	struct kiocb		*kiocb;
 };
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8245f61..874bc1f 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1359,7 +1359,6 @@ static void netlink_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
 static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 			   struct msghdr *msg, size_t len)
 {
-	struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
 	struct sock *sk = sock->sk;
 	struct netlink_sock *nlk = nlk_sk(sk);
 	struct sockaddr_nl *addr = msg->msg_name;
@@ -1372,11 +1371,8 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (msg->msg_flags&MSG_OOB)
 		return -EOPNOTSUPP;
 
-	if (NULL == siocb->scm)
-		siocb->scm = &scm;
-
-	scm_creds_from_current(&siocb->scm->creds);
-	err = scm_send(sock, msg, siocb->scm);
+	scm_creds_from_current(&scm.creds);
+	err = scm_send(sock, msg, &scm);
 	if (err < 0)
 		return err;
 
@@ -1413,7 +1409,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	NETLINK_CB(skb).dst_group = dst_group;
 
 	/* This is mandatory. See netlink_recvmsg. */
-	NETLINK_CB(skb).creds = siocb->scm->creds;
+	NETLINK_CB(skb).creds = scm.creds;
 
 	err = -EFAULT;
 	if (memcpy_fromiovec(skb_put(skb, len), msg->msg_iov, len)) {
@@ -1434,7 +1430,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	err = netlink_unicast(sk, skb, dst_portid, msg->msg_flags&MSG_DONTWAIT);
 
 out:
-	scm_destroy(siocb->scm);
+	scm_destroy(&scm);
 	return err;
 }
 
@@ -1442,8 +1438,7 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock,
 			   struct msghdr *msg, size_t len,
 			   int flags)
 {
-	struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
-	struct scm_cookie scm;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 	struct sock *sk = sock->sk;
 	struct netlink_sock *nlk = nlk_sk(sk);
 	int noblock = flags&MSG_DONTWAIT;
@@ -1502,14 +1497,10 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock,
 	if (nlk->flags & NETLINK_RECV_PKTINFO)
 		netlink_cmsg_recv_pktinfo(msg, skb);
 
-	if (NULL == siocb->scm) {
-		memset(&scm, 0, sizeof(scm));
-		siocb->scm = &scm;
-	}
 	/* skbs without creds are from the kernel. */
-	siocb->scm->creds = *NETLINK_CREDS(skb);
-	if (!siocb->scm->creds.has_creds)
-		scm_creds_from_kernel(&siocb->scm->creds);
+	scm.creds = *NETLINK_CREDS(skb);
+	if (!scm.creds.has_creds)
+		scm_creds_from_kernel(&scm.creds);
 
 	if (flags & MSG_TRUNC)
 		copied = data_skb->len;
@@ -1524,7 +1515,7 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock,
 		}
 	}
 
-	scm_recv(sock, msg, siocb->scm, flags);
+	scm_recv(sock, msg, &scm, flags);
 out:
 	netlink_rcv_wake(sk);
 	return err ? : copied;
diff --git a/net/socket.c b/net/socket.c
index 88f759a..26a65b4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -619,7 +619,6 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
 	si->sock = sock;
-	si->scm = NULL;
 	si->msg = msg;
 	si->size = size;
 
@@ -781,7 +780,6 @@ static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
 	struct sock_iocb *si = kiocb_to_siocb(iocb);
 
 	si->sock = sock;
-	si->scm = NULL;
 	si->msg = msg;
 	si->size = size;
 	si->flags = flags;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0881739..e6541c1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1422,7 +1422,6 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
 static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 			      struct msghdr *msg, size_t len)
 {
-	struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
 	struct sock *sk = sock->sk;
 	struct net *net = sock_net(sk);
 	struct unix_sock *u = unix_sk(sk);
@@ -1433,14 +1432,12 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	unsigned int hash;
 	struct sk_buff *skb;
 	long timeo;
-	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 	int max_level;
 	int data_len = 0;
 
-	if (NULL == siocb->scm)
-		siocb->scm = &tmp_scm;
 	wait_for_unix_gc();
-	err = scm_send(sock, msg, siocb->scm);
+	err = scm_send(sock, msg, &scm);
 	if (err < 0)
 		return err;
 
@@ -1479,11 +1476,11 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
-	err = unix_scm_to_skb(siocb->scm, skb, true);
+	err = unix_scm_to_skb(&scm, skb, true);
 	if (err < 0)
 		goto out_free;
 	max_level = err + 1;
-	unix_get_secdata(siocb->scm, skb);
+	unix_get_secdata(&scm, skb);
 
 	skb_put(skb, len - data_len);
 	skb->data_len = data_len;
@@ -1578,7 +1575,7 @@ restart:
 	unix_state_unlock(other);
 	other->sk_data_ready(other, len);
 	sock_put(other);
-	scm_destroy(siocb->scm);
+	scm_destroy(&scm);
 	return len;
 
 out_unlock:
@@ -1588,7 +1585,7 @@ out_free:
 out:
 	if (other)
 		sock_put(other);
-	scm_destroy(siocb->scm);
+	scm_destroy(&scm);
 	return err;
 }
 
@@ -1596,20 +1593,17 @@ out:
 static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 			       struct msghdr *msg, size_t len)
 {
-	struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
 	struct sock *sk = sock->sk;
 	struct sock *other = NULL;
 	int err, size;
 	struct sk_buff *skb;
 	int sent = 0;
-	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 	bool fds_sent = false;
 	int max_level;
 
-	if (NULL == siocb->scm)
-		siocb->scm = &tmp_scm;
 	wait_for_unix_gc();
-	err = scm_send(sock, msg, siocb->scm);
+	err = scm_send(sock, msg, &scm);
 	if (err < 0)
 		return err;
 
@@ -1666,7 +1660,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 
 
 		/* Only send the fds in the first buffer */
-		err = unix_scm_to_skb(siocb->scm, skb, !fds_sent);
+		err = unix_scm_to_skb(&scm, skb, !fds_sent);
 		if (err < 0) {
 			kfree_skb(skb);
 			goto out_err;
@@ -1695,8 +1689,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		sent += size;
 	}
 
-	scm_destroy(siocb->scm);
-	siocb->scm = NULL;
+	scm_destroy(&scm);
 
 	return sent;
 
@@ -1708,8 +1701,7 @@ pipe_err:
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
-	scm_destroy(siocb->scm);
-	siocb->scm = NULL;
+	scm_destroy(&scm);
 	return sent ? : err;
 }
 
@@ -1760,7 +1752,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 			      int flags)
 {
 	struct sock_iocb *siocb = kiocb_to_siocb(iocb);
-	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
 	int noblock = flags & MSG_DONTWAIT;
@@ -1811,16 +1803,12 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 	if (sock_flag(sk, SOCK_RCVTSTAMP))
 		__sock_recv_timestamp(msg, sk, skb);
 
-	if (!siocb->scm) {
-		siocb->scm = &tmp_scm;
-		memset(&tmp_scm, 0, sizeof(tmp_scm));
-	}
-	siocb->scm->creds = UNIXCB(skb).creds;
-	unix_set_secdata(siocb->scm, skb);
+	scm.creds = UNIXCB(skb).creds;
+	unix_set_secdata(&scm, skb);
 
 	if (!(flags & MSG_PEEK)) {
 		if (UNIXCB(skb).fp)
-			unix_detach_fds(siocb->scm, skb);
+			unix_detach_fds(&scm, skb);
 
 		sk_peek_offset_bwd(sk, skb->len);
 	} else {
@@ -1840,11 +1828,11 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 		sk_peek_offset_fwd(sk, size);
 
 		if (UNIXCB(skb).fp)
-			siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+			scm.fp = scm_fp_dup(UNIXCB(skb).fp);
 	}
 	err = (flags & MSG_TRUNC) ? skb->len - skip : size;
 
-	scm_recv(sock, msg, siocb->scm, flags);
+	scm_recv(sock, msg, &scm, flags);
 
 out_free:
 	skb_free_datagram(sk, skb);
@@ -1892,8 +1880,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			       struct msghdr *msg, size_t size,
 			       int flags)
 {
-	struct sock_iocb *siocb = kiocb_to_siocb(iocb);
-	struct scm_cookie tmp_scm = SCM_COOKIE_INIT;
+	struct scm_cookie scm = SCM_COOKIE_INIT;
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
 	struct sockaddr_un *sunaddr = msg->msg_name;
@@ -1921,11 +1908,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 	 * while sleeps in memcpy_tomsg
 	 */
 
-	if (!siocb->scm) {
-		siocb->scm = &tmp_scm;
-		memset(&tmp_scm, 0, sizeof(tmp_scm));
-	}
-
 	err = mutex_lock_interruptible(&u->readlock);
 	if (err) {
 		err = sock_intr_errno(timeo);
@@ -1987,11 +1969,11 @@ again:
 		if (check_creds) {
 			/* Never glue messages from different writers */
 			if (!scm_creds_equal(&UNIXCB(skb).creds,
-			                     &siocb->scm->creds));
+			                     &scm.creds));
 				break;
 		} else {
 			/* Copy credentials */
-			siocb->scm->creds = UNIXCB(skb).creds;
+			scm.creds = UNIXCB(skb).creds;
 			check_creds = 1;
 		}
 
@@ -2017,7 +1999,7 @@ again:
 			sk_peek_offset_bwd(sk, chunk);
 
 			if (UNIXCB(skb).fp)
-				unix_detach_fds(siocb->scm, skb);
+				unix_detach_fds(&scm, skb);
 
 			if (skb->len)
 				break;
@@ -2025,13 +2007,13 @@ again:
 			skb_unlink(skb, &sk->sk_receive_queue);
 			consume_skb(skb);
 
-			if (siocb->scm->fp)
+			if (scm.fp)
 				break;
 		} else {
 			/* It is questionable, see note in unix_dgram_recvmsg.
 			 */
 			if (UNIXCB(skb).fp)
-				siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+				scm.fp = scm_fp_dup(UNIXCB(skb).fp);
 
 			sk_peek_offset_fwd(sk, chunk);
 
@@ -2040,7 +2022,7 @@ again:
 	} while (size);
 
 	mutex_unlock(&u->readlock);
-	scm_recv(sock, msg, siocb->scm, flags);
+	scm_recv(sock, msg, &scm, flags);
 out:
 	return copied ? : err;
 }
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: David Miller @ 2013-03-20 22:10 UTC (permalink / raw)
  To: yinghai; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <CAE9FiQX+9NXfVO9P1ZELYjA+D35Tt7r0ngsGZfW9VaU1Q0WJvQ@mail.gmail.com>

From: Yinghai Lu <yinghai@kernel.org>
Date: Wed, 20 Mar 2013 14:16:22 -0700

> On Wed, Mar 20, 2013 at 2:13 PM, David Miller <davem@davemloft.net> wrote:
>>>
>>> [    6.047966] ------------[ cut here ]------------
>>> [    6.048944] kernel BUG at net/netlink/genetlink.c:145!
>>
>> For the second time, we know about this bug.
>>
>> His second patch fixes the themal layer bug.
>>
>> But I told him to send it to the thermal maintainers so that they
>> can integrate it.
>>
>> Why are you posting what we know already over and over?
> 
> so bisecting will be broken?

Good point, so I'll apply the thermal patch to my tree to minimize
the failure range.

Thanks.

^ permalink raw reply

* [PATCH net-next] filter: add minimal BPF JIT image disassembler
From: Daniel Borkmann @ 2013-03-20 22:11 UTC (permalink / raw)
  To: davem; +Cc: netdev

This is a minimal stand-alone user space helper, that allows for debugging or
verification of emitted BPF JIT images. This is in particular useful for
emitted opcode debugging, since minor bugs in the JIT compiler can be fatal.
The disassembler is architecture generic and uses libopcodes and libbfd.

How to get to the disassembly, example:

  1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
  2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
  3) Run e.g. `bpf_jit_disasm -o` to disassemble the most recent JIT code output

`bpf_jit_disasm -o` will display the related opcodes to a particular instruction
as well. Example for x86_64:

$ ./bpf_jit_disasm
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
   0:	push   %rbp
   1:	mov    %rsp,%rbp
   4:	sub    $0x60,%rsp
   8:	mov    %rbx,-0x8(%rbp)
   c:	mov    0x68(%rdi),%r9d
  10:	sub    0x6c(%rdi),%r9d
  14:	mov    0xe0(%rdi),%r8
  1b:	mov    $0xc,%esi
  20:	callq  0xffffffffe0d01b71
  25:	cmp    $0x86dd,%eax
  2a:	jne    0x000000000000003d
  2c:	mov    $0x14,%esi
  31:	callq  0xffffffffe0d01b8d
  36:	cmp    $0x6,%eax
[...]
  5c:	leaveq
  5d:	retq

$ ./bpf_jit_disasm -o
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
   0:	push   %rbp
	55
   1:	mov    %rsp,%rbp
	48 89 e5
   4:	sub    $0x60,%rsp
	48 83 ec 60
   8:	mov    %rbx,-0x8(%rbp)
	48 89 5d f8
   c:	mov    0x68(%rdi),%r9d
	44 8b 4f 68
  10:	sub    0x6c(%rdi),%r9d
	44 2b 4f 6c
[...]
  5c:	leaveq
	c9
  5d:	retq
	c3

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 I have added a 'net' target for current and future misc tools. I decided not
 to add an extra man-page, since it's just a very small helper, with an option
 -o, that is accessible through -h, nothing more. Let me know if that's okay
 for you, thanks.

 v1 -> v2: change location of file from scripts/ to tools/net/
 v2 -> v3: add a Makefile and 'net' target under tools

 tools/Makefile             |  11 +--
 tools/net/Makefile         |  15 ++++
 tools/net/bpf_jit_disasm.c | 199 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+), 5 deletions(-)
 create mode 100644 tools/net/Makefile
 create mode 100644 tools/net/bpf_jit_disasm.c

diff --git a/tools/Makefile b/tools/Makefile
index fa36565..c73c635 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -12,6 +12,7 @@ help:
 	@echo '  turbostat  - Intel CPU idle stats and freq reporting tool'
 	@echo '  usb        - USB testing tools'
 	@echo '  virtio     - vhost test module'
+	@echo '  net        - misc networking tools'
 	@echo '  vm         - misc vm tools'
 	@echo '  x86_energy_perf_policy - Intel energy policy tool'
 	@echo ''
@@ -34,7 +35,7 @@ help:
 cpupower: FORCE
 	$(call descend,power/$@)
 
-cgroup firewire lguest perf usb virtio vm: FORCE
+cgroup firewire lguest perf usb virtio vm net: FORCE
 	$(call descend,$@)
 
 selftests: FORCE
@@ -46,7 +47,7 @@ turbostat x86_energy_perf_policy: FORCE
 cpupower_install:
 	$(call descend,power/$(@:_install=),install)
 
-cgroup_install firewire_install lguest_install perf_install usb_install virtio_install vm_install:
+cgroup_install firewire_install lguest_install perf_install usb_install virtio_install vm_install net_install:
 	$(call descend,$(@:_install=),install)
 
 selftests_install:
@@ -57,12 +58,12 @@ turbostat_install x86_energy_perf_policy_install:
 
 install: cgroup_install cpupower_install firewire_install lguest_install \
 		perf_install selftests_install turbostat_install usb_install \
-		virtio_install vm_install x86_energy_perf_policy_install
+		virtio_install vm_install net_install x86_energy_perf_policy_install
 
 cpupower_clean:
 	$(call descend,power/cpupower,clean)
 
-cgroup_clean firewire_clean lguest_clean perf_clean usb_clean virtio_clean vm_clean:
+cgroup_clean firewire_clean lguest_clean perf_clean usb_clean virtio_clean vm_clean net_clean:
 	$(call descend,$(@:_clean=),clean)
 
 selftests_clean:
@@ -73,6 +74,6 @@ turbostat_clean x86_energy_perf_policy_clean:
 
 clean: cgroup_clean cpupower_clean firewire_clean lguest_clean perf_clean \
 		selftests_clean turbostat_clean usb_clean virtio_clean \
-		vm_clean x86_energy_perf_policy_clean
+		vm_clean net_clean x86_energy_perf_policy_clean
 
 .PHONY: FORCE
diff --git a/tools/net/Makefile b/tools/net/Makefile
new file mode 100644
index 0000000..b4444d5
--- /dev/null
+++ b/tools/net/Makefile
@@ -0,0 +1,15 @@
+prefix = /usr
+
+CC = gcc
+
+all : bpf_jit_disasm
+
+bpf_jit_disasm : CFLAGS = -Wall -O2
+bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
+bpf_jit_disasm : bpf_jit_disasm.o
+
+clean :
+	rm -rf *.o bpf_jit_disasm
+
+install :
+	install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
diff --git a/tools/net/bpf_jit_disasm.c b/tools/net/bpf_jit_disasm.c
new file mode 100644
index 0000000..cfe0cdc
--- /dev/null
+++ b/tools/net/bpf_jit_disasm.c
@@ -0,0 +1,199 @@
+/*
+ * Minimal BPF JIT image disassembler
+ *
+ * Disassembles BPF JIT compiler emitted opcodes back to asm insn's for
+ * debugging or verification purposes.
+ *
+ * To get the disassembly of the JIT code, do the following:
+ *
+ *  1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
+ *  2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
+ *  3) Run e.g. `bpf_jit_disasm -o` to read out the last JIT code
+ *
+ * Copyright 2013 Daniel Borkmann <borkmann@redhat.com>
+ * Licensed under the GNU General Public License, version 2.0 (GPLv2)
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <string.h>
+#include <bfd.h>
+#include <dis-asm.h>
+#include <sys/klog.h>
+#include <sys/types.h>
+#include <regex.h>
+
+static void get_exec_path(char *tpath, size_t size)
+{
+	char *path;
+	ssize_t len;
+
+	snprintf(tpath, size, "/proc/%d/exe", (int) getpid());
+	tpath[size - 1] = 0;
+
+	path = strdup(tpath);
+	assert(path);
+
+	len = readlink(path, tpath, size);
+	tpath[len] = 0;
+
+	free(path);
+}
+
+static void get_asm_insns(uint8_t *image, size_t len, unsigned long base,
+			  int opcodes)
+{
+	int count, i, pc = 0;
+	char tpath[256];
+	struct disassemble_info info;
+	disassembler_ftype disassemble;
+	bfd *bfdf;
+
+	memset(tpath, 0, sizeof(tpath));
+	get_exec_path(tpath, sizeof(tpath));
+
+	bfdf = bfd_openr(tpath, NULL);
+	assert(bfdf);
+	assert(bfd_check_format(bfdf, bfd_object));
+
+	init_disassemble_info(&info, stdout, (fprintf_ftype) fprintf);
+	info.arch = bfd_get_arch(bfdf);
+	info.mach = bfd_get_mach(bfdf);
+	info.buffer = image;
+	info.buffer_length = len;
+
+	disassemble_init_for_target(&info);
+
+	disassemble = disassembler(bfdf);
+	assert(disassemble);
+
+	do {
+		printf("%4x:\t", pc);
+
+		count = disassemble(pc, &info);
+
+		if (opcodes) {
+			printf("\n\t");
+			for (i = 0; i < count; ++i)
+				printf("%02x ", (uint8_t) image[pc + i]);
+		}
+		printf("\n");
+
+		pc += count;
+	} while(count > 0 && pc < len);
+
+	bfd_close(bfdf);
+}
+
+static char *get_klog_buff(int *klen)
+{
+	int ret, len = klogctl(10, NULL, 0);
+	char *buff = malloc(len);
+
+	assert(buff && klen);
+	ret = klogctl(3, buff, len);
+	assert(ret >= 0);
+	*klen = ret;
+
+	return buff;
+}
+
+static void put_klog_buff(char *buff)
+{
+	free(buff);
+}
+
+static int get_last_jit_image(char *haystack, size_t hlen,
+			      uint8_t *image, size_t ilen,
+			      unsigned long *base)
+{
+	char *ptr, *pptr, *tmp;
+	off_t off = 0;
+	int ret, flen, proglen, pass, ulen = 0;
+	regmatch_t pmatch[1];
+	regex_t regex;
+
+	if (hlen == 0)
+		return 0;
+
+	ret = regcomp(&regex, "flen=[[:alnum:]]+ proglen=[[:digit:]]+ "
+		      "pass=[[:digit:]]+ image=[[:xdigit:]]+", REG_EXTENDED);
+	assert(ret == 0);
+
+	ptr = haystack;
+	while (1) {
+		ret = regexec(&regex, ptr, 1, pmatch, 0);
+		if (ret == 0) {
+			ptr += pmatch[0].rm_eo;
+			off += pmatch[0].rm_eo;
+			assert(off < hlen);
+		} else
+			break;
+	}
+
+	ptr = haystack + off - (pmatch[0].rm_eo - pmatch[0].rm_so);
+	ret = sscanf(ptr, "flen=%d proglen=%d pass=%d image=%lx",
+		     &flen, &proglen, &pass, base);
+	if (ret != 4)
+		return 0;
+
+	tmp = ptr = haystack + off;
+	while ((ptr = strtok(tmp, "\n")) != NULL && ulen < ilen) {
+		tmp = NULL;
+		if (!strstr(ptr, "JIT code"))
+			continue;
+		pptr = ptr;
+		while ((ptr = strstr(pptr, ":")))
+			pptr = ptr + 1;
+		ptr = pptr;
+		do {
+			image[ulen++] = (uint8_t) strtoul(pptr, &pptr, 16);
+			if (ptr == pptr || ulen >= ilen) {
+				ulen--;
+				break;
+			}
+			ptr = pptr;
+		} while (1);
+	}
+
+	assert(ulen == proglen);
+	printf("%d bytes emitted from JIT compiler (pass:%d, flen:%d)\n",
+	       proglen, pass, flen);
+	printf("%lx + <x>:\n", *base);
+
+	regfree(&regex);
+	return ulen;
+}
+
+int main(int argc, char **argv)
+{
+	int len, klen, opcodes = 0;
+	char *kbuff;
+	unsigned long base;
+	uint8_t image[4096];
+
+	if (argc > 1) {
+		if (!strncmp("-o", argv[argc - 1], 2)) {
+			opcodes = 1;
+		} else {
+			printf("usage: bpf_jit_disasm [-o: show opcodes]\n");
+			exit(0);
+		}
+	}
+
+	bfd_init();
+	memset(image, 0, sizeof(image));
+
+	kbuff = get_klog_buff(&klen);
+
+	len = get_last_jit_image(kbuff, klen, image, sizeof(image), &base);
+	if (len > 0 && base > 0)
+		get_asm_insns(image, len, base, opcodes);
+
+	put_klog_buff(kbuff);
+
+	return 0;
+}
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: Yinghai Lu @ 2013-03-20 22:17 UTC (permalink / raw)
  To: David Miller; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <20130320.181005.1507999596274688798.davem@davemloft.net>

On Wed, Mar 20, 2013 at 3:10 PM, David Miller <davem@davemloft.net> wrote:

>> so bisecting will be broken?
>
> Good point, so I'll apply the thermal patch to my tree to minimize
> the failure range.

yes, that "second" patch should be applied before this one.

^ permalink raw reply

* Re: [PATCH 1/2] genetlink: trigger BUG_ON if a group name is too long
From: David Miller @ 2013-03-20 22:18 UTC (permalink / raw)
  To: yinghai; +Cc: rui.zhang, yamato, netdev
In-Reply-To: <CAE9FiQVJhAbvfxp152a0KbzFE1hVQZMNwdWkKnyii3HNWDBd=A@mail.gmail.com>

From: Yinghai Lu <yinghai@kernel.org>
Date: Wed, 20 Mar 2013 15:17:13 -0700

> On Wed, Mar 20, 2013 at 3:10 PM, David Miller <davem@davemloft.net> wrote:
> 
>>> so bisecting will be broken?
>>
>> Good point, so I'll apply the thermal patch to my tree to minimize
>> the failure range.
> 
> yes, that "second" patch should be applied before this one.

But it's too late as the first one is already commited to my public
GIT tree on kernel.org and I never rebase my tree, ever.

So this is the only option to minimize the damage at this point.

^ permalink raw reply

* Re: [PATCH] igb: fix PHC stopping on max freq
From: Vick, Matthew @ 2013-03-20 22:55 UTC (permalink / raw)
  To: Jiri Benc
  Cc: netdev@vger.kernel.org, e1000-devel@lists.sourceforge.net,
	Kirsher, Jeffrey T, Stefan Assmann, Miroslav Lichvar
In-Reply-To: <20130320201153.7fc5e844@griffin>

On 3/20/13 12:11 PM, "Jiri Benc" <jbenc@redhat.com> wrote:

>On Tue, 19 Mar 2013 21:17:25 +0000, Vick, Matthew wrote:
>> Good catch on this, Jiri! I know the math works out the same, but I'd
>> prefer it if you changed the max_adj value to 999999999, since that is
>> technically what we can accept before we have any issues. If you
>>re-submit
>> with this change, I'll add my ACK and we can run it through our internal
>> testing. Thanks!
>
>But the real maximum value is actually 999999881, as anything higher
>than that would be capped to 999999881 by the driver. I don't think the
>driver should advertise higher max_adj than it is able to fulfill,
>otherwise there would be no need for the field.

I prefer 999999999 as it's something that looks slightly less "magic
number"-y (plus looks like the other devices in igb) and is still
technically something that can be passed down without error. Ultimately
not a big deal and I can understand your argument, so I'm okay putting my
personal preference aside on this one.

^ permalink raw reply

* BUG: ipheth - NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out
From: Joerg Mayer @ 2013-03-20 23:24 UTC (permalink / raw)
  To: gregkh, linux-usb, netdev, linux-kernel

Hello,

ipheth does not work for me at all with iPhone5 and iOS 6.

Setup:
Thinkpad T60 (32 bit) with openSUSE 12.1 (with current kernel repo)
iPhone5 with iOS 6.1.2
USB-Cable between laptop and iphone

In the moment I enable Hotspot with USB-cable connected to phone and
laptop, I get the following message:

--------------------------------------------------------
[Wed Mar 20 04:51:24 2013] usb 1-2: new high-speed USB device number 4 using ehci-pci
[Wed Mar 20 04:51:24 2013] usb 1-2: New USB device found, idVendor=05ac, idProduct=12a8
[Wed Mar 20 04:51:24 2013] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[Wed Mar 20 04:51:24 2013] usb 1-2: Product: iPhone
[Wed Mar 20 04:51:24 2013] usb 1-2: Manufacturer: Apple Inc.
[Wed Mar 20 04:51:24 2013] usb 1-2: SerialNumber: XXX
[Wed Mar 20 04:51:24 2013] ipheth 1-2:4.2: Apple iPhone USB Ethernet device attached
[Wed Mar 20 04:51:24 2013] usbcore: registered new interface driver ipheth
[Wed Mar 20 04:52:46 2013] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[Wed Mar 20 04:55:28 2013] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[Wed Mar 20 04:55:33 2013] ------------[ cut here ]------------
[Wed Mar 20 04:55:33 2013] WARNING: at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.8.3/linux-3.8/net/sched/sch_generic.c:254 dev_watchdog+0x1e0/0x1f0()
[Wed Mar 20 04:55:33 2013] Hardware name: 200763G
[Wed Mar 20 04:55:33 2013] NETDEV WATCHDOG: eth1 (ipheth): transmit queue 0 timed out
[Wed Mar 20 04:55:33 2013] Modules linked in: ipheth md4 md5 nls_utf8 cifs fscache ppdev parport_pc lp parport af_packet rfcomm bnep binfmt_misc tp_smapi(OF) thinkpad_ec(OF) cpufreq_conservative cpufreq_userspace cpufreq_powersave sha256_generic cbc dm_crypt dm_mod snd_hda_codec_analog snd_hda_intel btusb snd_hda_codec bluetooth acpi_cpufreq arc4 pcmcia iwl3945 mperf iwlegacy coretemp snd_hwdep snd_pcm mac80211 kvm_intel iTCO_wdt iTCO_vendor_support snd_timer kvm sg lpc_ich snd_page_alloc mfd_core yenta_socket cfg80211 thinkpad_acpi pcmcia_rsrc pcmcia_core sr_mod cdrom i2c_i801 rfkill snd joydev irda tpm_tis e1000e tpm tpm_bios crc_ccitt microcode video battery soundcore ac button autofs4 radeon ttm drm_kms_helper drm i2c_algo_bit scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh edd fan ata_generic ata_piix thermal processor thermal_sys
[Wed Mar 20 04:55:33 2013] Pid: 0, comm: swapper/0 Tainted: GF          O 3.8.3-1-desktop #1
[Wed Mar 20 04:55:33 2013] Call Trace:
[Wed Mar 20 04:55:33 2013]  [<c0205709>] try_stack_unwind+0x199/0x1b0
[Wed Mar 20 04:55:33 2013]  [<c0204417>] dump_trace+0x47/0xf0
[Wed Mar 20 04:55:33 2013]  [<c020576b>] show_trace_log_lvl+0x4b/0x60
[Wed Mar 20 04:55:33 2013]  [<c0205798>] show_trace+0x18/0x20
[Wed Mar 20 04:55:33 2013]  [<c071a6cb>] dump_stack+0x6d/0x72
[Wed Mar 20 04:55:33 2013]  [<c02391c8>] warn_slowpath_common+0x78/0xb0
[Wed Mar 20 04:55:33 2013]  [<c0239293>] warn_slowpath_fmt+0x33/0x40
[Wed Mar 20 04:55:33 2013]  [<c0666350>] dev_watchdog+0x1e0/0x1f0
[Wed Mar 20 04:55:33 2013]  [<c0247ed4>] call_timer_fn+0x24/0x120
[Wed Mar 20 04:55:33 2013]  [<c0248172>] run_timer_softirq+0x1a2/0x240
[Wed Mar 20 04:55:33 2013]  [<c0240fb9>] __do_softirq+0x99/0x1e0
[Wed Mar 20 04:55:33 2013]  [<c02042f6>] do_softirq+0x76/0xb0
[Wed Mar 20 04:55:33 2013]  [<0000000f>] 0xe
[Wed Mar 20 04:55:33 2013] ---[ end trace dd5a449765340b63 ]---
[Wed Mar 20 04:55:33 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
[Wed Mar 20 04:55:38 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
[Wed Mar 20 04:55:48 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
[Wed Mar 20 04:55:58 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
[Wed Mar 20 04:56:08 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
[Wed Mar 20 04:56:18 2013] ipheth 1-2:4.2: ipheth_tx_timeout: TX timeout
--------------------------------------------------------

The bug also occurs
- when I turn on hotpot first and plug in the cable afterwards.
- with the vanilla kernel from the same repo (and without loading
  the tp_smapi module which taints the kernel in the log above).
- for others (https://bugzilla.novell.com/show_bug.cgi?id=779643).
  That link also contains messages from another system with an earlier
  kernel other iphone model and earlier iOS.

What other information is required to get this problem resolved?

Thanks
    Jörg
-- 
Joerg Mayer                                           <jmayer@loplof.de>
We are stuck with technology when what we really want is just stuff that
works. Some say that should read Microsoft instead of technology.

^ permalink raw reply

* [PATCH v2 1/3 net-next] tcp: refactor F-RTO
From: Yuchung Cheng @ 2013-03-20 23:32 UTC (permalink / raw)
  To: davem, ncardwell, edumazet, nanditad; +Cc: ilpo.jarvinen, netdev, Yuchung Cheng

The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
---
 Documentation/networking/ip-sysctl.txt |  17 --
 include/linux/tcp.h                    |   6 +-
 include/net/tcp.h                      |   4 -
 net/ipv4/sysctl_net_ipv4.c             |   7 -
 net/ipv4/tcp_input.c                   | 375 +--------------------------------
 net/ipv4/tcp_minisocks.c               |   3 -
 net/ipv4/tcp_output.c                  |  11 +-
 net/ipv4/tcp_timer.c                   |   6 +-
 net/ipv4/tcp_westwood.c                |   2 +-
 9 files changed, 10 insertions(+), 421 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 17953e2..8a977a0 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -239,23 +239,6 @@ tcp_frto - INTEGER
 	interacts badly with the packet counting of the SACK enabled TCP
 	flow.
 
-tcp_frto_response - INTEGER
-	When F-RTO has detected that a TCP retransmission timeout was
-	spurious (i.e, the timeout would have been avoided had TCP set a
-	longer retransmission timeout), TCP has several options what to do
-	next. Possible values are:
-		0 Rate halving based; a smooth and conservative response,
-		  results in halved cwnd and ssthresh after one RTT
-		1 Very conservative response; not recommended because even
-		  though being valid, it interacts poorly with the rest of
-		  Linux TCP, halves cwnd and ssthresh immediately
-		2 Aggressive response; undoes congestion control measures
-		  that are now known to be unnecessary (ignoring the
-		  possibility of a lost retransmission that would require
-		  TCP to be more cautious), cwnd and ssthresh are restored
-		  to the values prior timeout
-	Default: 0 (rate halving based)
-
 tcp_keepalive_time - INTEGER
 	How often TCP sends out keepalive messages when keepalive is enabled.
 	Default: 2hours.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ed6a745..f5f203b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -187,14 +187,12 @@ struct tcp_sock {
 	u32	window_clamp;	/* Maximal window to advertise		*/
 	u32	rcv_ssthresh;	/* Current window clamp			*/
 
-	u32	frto_highmark;	/* snd_nxt when RTO occurred */
 	u16	advmss;		/* Advertised MSS			*/
-	u8	frto_counter;	/* Number of new acks after RTO */
+	u8	unused;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
-		repair      : 1,
-		unused      : 1;
+		repair      : 1;
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
 		syn_data:1,	/* SYN includes data */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7f2f171..d1dcb59 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,7 +272,6 @@ extern int sysctl_tcp_app_win;
 extern int sysctl_tcp_adv_win_scale;
 extern int sysctl_tcp_tw_reuse;
 extern int sysctl_tcp_frto;
-extern int sysctl_tcp_frto_response;
 extern int sysctl_tcp_low_latency;
 extern int sysctl_tcp_dma_copybreak;
 extern int sysctl_tcp_nometrics_save;
@@ -424,8 +423,6 @@ extern struct sock * tcp_check_req(struct sock *sk,struct sk_buff *skb,
 				   bool fastopen);
 extern int tcp_child_process(struct sock *parent, struct sock *child,
 			     struct sk_buff *skb);
-extern bool tcp_use_frto(struct sock *sk);
-extern void tcp_enter_frto(struct sock *sk);
 extern void tcp_enter_loss(struct sock *sk, int how);
 extern void tcp_clear_retrans(struct tcp_sock *tp);
 extern void tcp_update_metrics(struct sock *sk);
@@ -756,7 +753,6 @@ enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
 	CA_EVENT_CWND_RESTART,	/* congestion window restart */
 	CA_EVENT_COMPLETE_CWR,	/* end of congestion recovery */
-	CA_EVENT_FRTO,		/* fast recovery timeout */
 	CA_EVENT_LOSS,		/* loss timeout */
 	CA_EVENT_FAST_ACK,	/* in sequence ack */
 	CA_EVENT_SLOW_ACK,	/* other ack */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index cb45062..fa2f63f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -592,13 +592,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_frto_response",
-		.data		= &sysctl_tcp_frto_response,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec
-	},
-	{
 		.procname	= "tcp_low_latency",
 		.data		= &sysctl_tcp_low_latency,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 19f0149..231c79f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -93,7 +93,6 @@ int sysctl_tcp_stdurg __read_mostly;
 int sysctl_tcp_rfc1337 __read_mostly;
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 int sysctl_tcp_frto __read_mostly = 2;
-int sysctl_tcp_frto_response __read_mostly;
 
 int sysctl_tcp_thin_dupack __read_mostly;
 
@@ -108,17 +107,14 @@ int sysctl_tcp_early_retrans __read_mostly = 3;
 #define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
 #define FLAG_ECE		0x40 /* ECE in this ACK				*/
 #define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
-#define FLAG_ONLY_ORIG_SACKED	0x200 /* SACKs only non-rexmit sent before RTO */
 #define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
 #define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
-#define FLAG_NONHEAD_RETRANS_ACKED	0x1000 /* Non-head rexmitted data was ACKed */
 #define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
 
 #define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
 #define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
 #define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
-#define FLAG_ANY_PROGRESS	(FLAG_FORWARD_PROGRESS|FLAG_SND_UNA_ADVANCED)
 
 #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
 #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
@@ -1159,10 +1155,6 @@ static u8 tcp_sacktag_one(struct sock *sk,
 					   tcp_highest_sack_seq(tp)))
 					state->reord = min(fack_count,
 							   state->reord);
-
-				/* SACK enhanced F-RTO (RFC4138; Appendix B) */
-				if (!after(end_seq, tp->frto_highmark))
-					state->flag |= FLAG_ONLY_ORIG_SACKED;
 			}
 
 			if (sacked & TCPCB_LOST) {
@@ -1555,7 +1547,6 @@ static int
 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 			u32 prior_snd_una)
 {
-	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	const unsigned char *ptr = (skb_transport_header(ack_skb) +
 				    TCP_SKB_CB(ack_skb)->sacked);
@@ -1728,12 +1719,6 @@ walk:
 				       start_seq, end_seq, dup_sack);
 
 advance_sp:
-		/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
-		 * due to in-order walk
-		 */
-		if (after(end_seq, tp->frto_highmark))
-			state.flag &= ~FLAG_ONLY_ORIG_SACKED;
-
 		i++;
 	}
 
@@ -1750,8 +1735,7 @@ advance_sp:
 	tcp_verify_left_out(tp);
 
 	if ((state.reord < tp->fackets_out) &&
-	    ((icsk->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker) &&
-	    (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark)))
+	    ((inet_csk(sk)->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker))
 		tcp_update_reordering(sk, tp->fackets_out - state.reord, 0);
 
 out:
@@ -1825,197 +1809,6 @@ static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
 	tp->sacked_out = 0;
 }
 
-static int tcp_is_sackfrto(const struct tcp_sock *tp)
-{
-	return (sysctl_tcp_frto == 0x2) && !tcp_is_reno(tp);
-}
-
-/* F-RTO can only be used if TCP has never retransmitted anything other than
- * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
- */
-bool tcp_use_frto(struct sock *sk)
-{
-	const struct tcp_sock *tp = tcp_sk(sk);
-	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct sk_buff *skb;
-
-	if (!sysctl_tcp_frto)
-		return false;
-
-	/* MTU probe and F-RTO won't really play nicely along currently */
-	if (icsk->icsk_mtup.probe_size)
-		return false;
-
-	if (tcp_is_sackfrto(tp))
-		return true;
-
-	/* Avoid expensive walking of rexmit queue if possible */
-	if (tp->retrans_out > 1)
-		return false;
-
-	skb = tcp_write_queue_head(sk);
-	if (tcp_skb_is_last(sk, skb))
-		return true;
-	skb = tcp_write_queue_next(sk, skb);	/* Skips head */
-	tcp_for_write_queue_from(skb, sk) {
-		if (skb == tcp_send_head(sk))
-			break;
-		if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
-			return false;
-		/* Short-circuit when first non-SACKed skb has been checked */
-		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
-			break;
-	}
-	return true;
-}
-
-/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
- * recovery a bit and use heuristics in tcp_process_frto() to detect if
- * the RTO was spurious. Only clear SACKED_RETRANS of the head here to
- * keep retrans_out counting accurate (with SACK F-RTO, other than head
- * may still have that bit set); TCPCB_LOST and remaining SACKED_RETRANS
- * bits are handled if the Loss state is really to be entered (in
- * tcp_enter_frto_loss).
- *
- * Do like tcp_enter_loss() would; when RTO expires the second time it
- * does:
- *  "Reduce ssthresh if it has not yet been made inside this window."
- */
-void tcp_enter_frto(struct sock *sk)
-{
-	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-
-	if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) ||
-	    tp->snd_una == tp->high_seq ||
-	    ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&
-	     !icsk->icsk_retransmits)) {
-		tp->prior_ssthresh = tcp_current_ssthresh(sk);
-		/* Our state is too optimistic in ssthresh() call because cwnd
-		 * is not reduced until tcp_enter_frto_loss() when previous F-RTO
-		 * recovery has not yet completed. Pattern would be this: RTO,
-		 * Cumulative ACK, RTO (2xRTO for the same segment does not end
-		 * up here twice).
-		 * RFC4138 should be more specific on what to do, even though
-		 * RTO is quite unlikely to occur after the first Cumulative ACK
-		 * due to back-off and complexity of triggering events ...
-		 */
-		if (tp->frto_counter) {
-			u32 stored_cwnd;
-			stored_cwnd = tp->snd_cwnd;
-			tp->snd_cwnd = 2;
-			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-			tp->snd_cwnd = stored_cwnd;
-		} else {
-			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-		}
-		/* ... in theory, cong.control module could do "any tricks" in
-		 * ssthresh(), which means that ca_state, lost bits and lost_out
-		 * counter would have to be faked before the call occurs. We
-		 * consider that too expensive, unlikely and hacky, so modules
-		 * using these in ssthresh() must deal these incompatibility
-		 * issues if they receives CA_EVENT_FRTO and frto_counter != 0
-		 */
-		tcp_ca_event(sk, CA_EVENT_FRTO);
-	}
-
-	tp->undo_marker = tp->snd_una;
-	tp->undo_retrans = 0;
-
-	skb = tcp_write_queue_head(sk);
-	if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
-		tp->undo_marker = 0;
-	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
-		TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-		tp->retrans_out -= tcp_skb_pcount(skb);
-	}
-	tcp_verify_left_out(tp);
-
-	/* Too bad if TCP was application limited */
-	tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1);
-
-	/* Earlier loss recovery underway (see RFC4138; Appendix B).
-	 * The last condition is necessary at least in tp->frto_counter case.
-	 */
-	if (tcp_is_sackfrto(tp) && (tp->frto_counter ||
-	    ((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) &&
-	    after(tp->high_seq, tp->snd_una)) {
-		tp->frto_highmark = tp->high_seq;
-	} else {
-		tp->frto_highmark = tp->snd_nxt;
-	}
-	tcp_set_ca_state(sk, TCP_CA_Disorder);
-	tp->high_seq = tp->snd_nxt;
-	tp->frto_counter = 1;
-}
-
-/* Enter Loss state after F-RTO was applied. Dupack arrived after RTO,
- * which indicates that we should follow the traditional RTO recovery,
- * i.e. mark everything lost and do go-back-N retransmission.
- */
-static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-
-	tp->lost_out = 0;
-	tp->retrans_out = 0;
-	if (tcp_is_reno(tp))
-		tcp_reset_reno_sack(tp);
-
-	tcp_for_write_queue(skb, sk) {
-		if (skb == tcp_send_head(sk))
-			break;
-
-		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
-		/*
-		 * Count the retransmission made on RTO correctly (only when
-		 * waiting for the first ACK and did not get it)...
-		 */
-		if ((tp->frto_counter == 1) && !(flag & FLAG_DATA_ACKED)) {
-			/* For some reason this R-bit might get cleared? */
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)
-				tp->retrans_out += tcp_skb_pcount(skb);
-			/* ...enter this if branch just for the first segment */
-			flag |= FLAG_DATA_ACKED;
-		} else {
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
-				tp->undo_marker = 0;
-			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-		}
-
-		/* Marking forward transmissions that were made after RTO lost
-		 * can cause unnecessary retransmissions in some scenarios,
-		 * SACK blocks will mitigate that in some but not in all cases.
-		 * We used to not mark them but it was causing break-ups with
-		 * receivers that do only in-order receival.
-		 *
-		 * TODO: we could detect presence of such receiver and select
-		 * different behavior per flow.
-		 */
-		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
-			TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
-			tp->lost_out += tcp_skb_pcount(skb);
-			tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
-		}
-	}
-	tcp_verify_left_out(tp);
-
-	tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments;
-	tp->snd_cwnd_cnt = 0;
-	tp->snd_cwnd_stamp = tcp_time_stamp;
-	tp->frto_counter = 0;
-
-	tp->reordering = min_t(unsigned int, tp->reordering,
-			       sysctl_tcp_reordering);
-	tcp_set_ca_state(sk, TCP_CA_Loss);
-	tp->high_seq = tp->snd_nxt;
-	TCP_ECN_queue_cwr(tp);
-
-	tcp_clear_all_retrans_hints(tp);
-}
-
 static void tcp_clear_retrans_partial(struct tcp_sock *tp)
 {
 	tp->retrans_out = 0;
@@ -2090,8 +1883,6 @@ void tcp_enter_loss(struct sock *sk, int how)
 	tcp_set_ca_state(sk, TCP_CA_Loss);
 	tp->high_seq = tp->snd_nxt;
 	TCP_ECN_queue_cwr(tp);
-	/* Abort F-RTO algorithm if one is in progress */
-	tp->frto_counter = 0;
 }
 
 /* If ACK arrived pointing to a remembered SACK, it means that our
@@ -2275,10 +2066,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 packets_out;
 
-	/* Do not perform any recovery during F-RTO algorithm */
-	if (tp->frto_counter)
-		return false;
-
 	/* Trick#1: The loss is proven. */
 	if (tp->lost_out)
 		return true;
@@ -2760,7 +2547,7 @@ static void tcp_try_to_open(struct sock *sk, int flag, int newly_acked_sacked)
 
 	tcp_verify_left_out(tp);
 
-	if (!tp->frto_counter && !tcp_any_retrans_done(sk))
+	if (!tcp_any_retrans_done(sk))
 		tp->retrans_stamp = 0;
 
 	if (flag & FLAG_ECE)
@@ -3198,8 +2985,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			flag |= FLAG_RETRANS_DATA_ACKED;
 			ca_seq_rtt = -1;
 			seq_rtt = -1;
-			if ((flag & FLAG_DATA_ACKED) || (acked_pcount > 1))
-				flag |= FLAG_NONHEAD_RETRANS_ACKED;
 		} else {
 			ca_seq_rtt = now - scb->when;
 			last_ackt = skb->tstamp;
@@ -3408,150 +3193,6 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
 	return flag;
 }
 
-/* A very conservative spurious RTO response algorithm: reduce cwnd and
- * continue in congestion avoidance.
- */
-static void tcp_conservative_spur_to_response(struct tcp_sock *tp)
-{
-	tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
-	tp->snd_cwnd_cnt = 0;
-	TCP_ECN_queue_cwr(tp);
-	tcp_moderate_cwnd(tp);
-}
-
-/* A conservative spurious RTO response algorithm: reduce cwnd using
- * PRR and continue in congestion avoidance.
- */
-static void tcp_cwr_spur_to_response(struct sock *sk)
-{
-	tcp_enter_cwr(sk, 0);
-}
-
-static void tcp_undo_spur_to_response(struct sock *sk, int flag)
-{
-	if (flag & FLAG_ECE)
-		tcp_cwr_spur_to_response(sk);
-	else
-		tcp_undo_cwr(sk, true);
-}
-
-/* F-RTO spurious RTO detection algorithm (RFC4138)
- *
- * F-RTO affects during two new ACKs following RTO (well, almost, see inline
- * comments). State (ACK number) is kept in frto_counter. When ACK advances
- * window (but not to or beyond highest sequence sent before RTO):
- *   On First ACK,  send two new segments out.
- *   On Second ACK, RTO was likely spurious. Do spurious response (response
- *                  algorithm is not part of the F-RTO detection algorithm
- *                  given in RFC4138 but can be selected separately).
- * Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss
- * and TCP falls back to conventional RTO recovery. F-RTO allows overriding
- * of Nagle, this is done using frto_counter states 2 and 3, when a new data
- * segment of any size sent during F-RTO, state 2 is upgraded to 3.
- *
- * Rationale: if the RTO was spurious, new ACKs should arrive from the
- * original window even after we transmit two new data segments.
- *
- * SACK version:
- *   on first step, wait until first cumulative ACK arrives, then move to
- *   the second step. In second step, the next ACK decides.
- *
- * F-RTO is implemented (mainly) in four functions:
- *   - tcp_use_frto() is used to determine if TCP is can use F-RTO
- *   - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is
- *     called when tcp_use_frto() showed green light
- *   - tcp_process_frto() handles incoming ACKs during F-RTO algorithm
- *   - tcp_enter_frto_loss() is called if there is not enough evidence
- *     to prove that the RTO is indeed spurious. It transfers the control
- *     from F-RTO to the conventional RTO recovery
- */
-static bool tcp_process_frto(struct sock *sk, int flag)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	tcp_verify_left_out(tp);
-
-	/* Duplicate the behavior from Loss state (fastretrans_alert) */
-	if (flag & FLAG_DATA_ACKED)
-		inet_csk(sk)->icsk_retransmits = 0;
-
-	if ((flag & FLAG_NONHEAD_RETRANS_ACKED) ||
-	    ((tp->frto_counter >= 2) && (flag & FLAG_RETRANS_DATA_ACKED)))
-		tp->undo_marker = 0;
-
-	if (!before(tp->snd_una, tp->frto_highmark)) {
-		tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 2 : 3), flag);
-		return true;
-	}
-
-	if (!tcp_is_sackfrto(tp)) {
-		/* RFC4138 shortcoming in step 2; should also have case c):
-		 * ACK isn't duplicate nor advances window, e.g., opposite dir
-		 * data, winupdate
-		 */
-		if (!(flag & FLAG_ANY_PROGRESS) && (flag & FLAG_NOT_DUP))
-			return true;
-
-		if (!(flag & FLAG_DATA_ACKED)) {
-			tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3),
-					    flag);
-			return true;
-		}
-	} else {
-		if (!(flag & FLAG_DATA_ACKED) && (tp->frto_counter == 1)) {
-			if (!tcp_packets_in_flight(tp)) {
-				tcp_enter_frto_loss(sk, 2, flag);
-				return true;
-			}
-
-			/* Prevent sending of new data. */
-			tp->snd_cwnd = min(tp->snd_cwnd,
-					   tcp_packets_in_flight(tp));
-			return true;
-		}
-
-		if ((tp->frto_counter >= 2) &&
-		    (!(flag & FLAG_FORWARD_PROGRESS) ||
-		     ((flag & FLAG_DATA_SACKED) &&
-		      !(flag & FLAG_ONLY_ORIG_SACKED)))) {
-			/* RFC4138 shortcoming (see comment above) */
-			if (!(flag & FLAG_FORWARD_PROGRESS) &&
-			    (flag & FLAG_NOT_DUP))
-				return true;
-
-			tcp_enter_frto_loss(sk, 3, flag);
-			return true;
-		}
-	}
-
-	if (tp->frto_counter == 1) {
-		/* tcp_may_send_now needs to see updated state */
-		tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
-		tp->frto_counter = 2;
-
-		if (!tcp_may_send_now(sk))
-			tcp_enter_frto_loss(sk, 2, flag);
-
-		return true;
-	} else {
-		switch (sysctl_tcp_frto_response) {
-		case 2:
-			tcp_undo_spur_to_response(sk, flag);
-			break;
-		case 1:
-			tcp_conservative_spur_to_response(tp);
-			break;
-		default:
-			tcp_cwr_spur_to_response(sk);
-			break;
-		}
-		tp->frto_counter = 0;
-		tp->undo_marker = 0;
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSPURIOUSRTOS);
-	}
-	return false;
-}
-
 /* RFC 5961 7 [ACK Throttling] */
 static void tcp_send_challenge_ack(struct sock *sk)
 {
@@ -3616,7 +3257,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	int prior_packets;
 	int prior_sacked = tp->sacked_out;
 	int pkts_acked = 0;
-	bool frto_cwnd = false;
 
 	/* If the ack is older than previous acks
 	 * then we can probably ignore it.
@@ -3690,22 +3330,15 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	pkts_acked = prior_packets - tp->packets_out;
 
-	if (tp->frto_counter)
-		frto_cwnd = tcp_process_frto(sk, flag);
-	/* Guarantee sacktag reordering detection against wrap-arounds */
-	if (before(tp->frto_highmark, tp->snd_una))
-		tp->frto_highmark = 0;
-
 	if (tcp_ack_is_dubious(sk, flag)) {
 		/* Advance CWND, if state allows this. */
-		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
-		    tcp_may_raise_cwnd(sk, flag))
+		if ((flag & FLAG_DATA_ACKED) && tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
 		tcp_fastretrans_alert(sk, pkts_acked, prior_sacked,
 				      is_dupack, flag);
 	} else {
-		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
+		if (flag & FLAG_DATA_ACKED)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 	}
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8f0234f..05eaf89 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -422,9 +422,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_cwnd = TCP_INIT_CWND;
 		newtp->snd_cwnd_cnt = 0;
 
-		newtp->frto_counter = 0;
-		newtp->frto_highmark = 0;
-
 		if (newicsk->icsk_ca_ops != &tcp_init_congestion_ops &&
 		    !try_module_get(newicsk->icsk_ca_ops->owner))
 			newicsk->icsk_ca_ops = &tcp_init_congestion_ops;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e787ece..163cf5f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -78,10 +78,6 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 	tcp_advance_send_head(sk, skb);
 	tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
 
-	/* Don't override Nagle indefinitely with F-RTO */
-	if (tp->frto_counter == 2)
-		tp->frto_counter = 3;
-
 	tp->packets_out += tcp_skb_pcount(skb);
 	if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
 	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
@@ -1470,11 +1466,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
 	if (nonagle & TCP_NAGLE_PUSH)
 		return true;
 
-	/* Don't use the nagle rule for urgent data (or for the final FIN).
-	 * Nagle can be ignored during F-RTO too (see RFC4138).
-	 */
-	if (tcp_urg_mode(tp) || (tp->frto_counter == 2) ||
-	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
+	/* Don't use the nagle rule for urgent data (or for the final FIN). */
+	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
 		return true;
 
 	if (!tcp_nagle_check(tp, skb, cur_mss, nonagle))
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index eeccf79..4b85e6f 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -416,11 +416,7 @@ void tcp_retransmit_timer(struct sock *sk)
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 	}
 
-	if (tcp_use_frto(sk)) {
-		tcp_enter_frto(sk);
-	} else {
-		tcp_enter_loss(sk, 0);
-	}
+	tcp_enter_loss(sk, 0);
 
 	if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
 		/* Retransmission failed because of local congestion,
diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
index 1b91bf4..76a1e23 100644
--- a/net/ipv4/tcp_westwood.c
+++ b/net/ipv4/tcp_westwood.c
@@ -236,7 +236,7 @@ static void tcp_westwood_event(struct sock *sk, enum tcp_ca_event event)
 		tp->snd_cwnd = tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
 		break;
 
-	case CA_EVENT_FRTO:
+	case CA_EVENT_LOSS:
 		tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
 		/* Update RTT_min when next ack arrives */
 		w->reset_rtt_min = 1;
-- 
1.8.1.3

^ permalink raw reply related

* [PATCH v2 2/3 net-next] tcp: refactor CA_Loss state processing
From: Yuchung Cheng @ 2013-03-20 23:32 UTC (permalink / raw)
  To: davem, ncardwell, edumazet, nanditad; +Cc: ilpo.jarvinen, netdev, Yuchung Cheng
In-Reply-To: <1363822380-16687-1-git-send-email-ycheng@google.com>

Consolidate all of TCP CA_Loss state processing in
tcp_fastretrans_alert() into a new function called tcp_process_loss().
This is to prepare the new F-RTO implementation in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 231c79f..8d821e4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2664,6 +2664,30 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
 	tcp_set_ca_state(sk, TCP_CA_Recovery);
 }
 
+/* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
+ * recovered or spurious. Otherwise retransmits more on partial ACKs.
+ */
+static void tcp_process_loss(struct sock *sk, int flag)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!before(tp->snd_una, tp->high_seq)) {
+		icsk->icsk_retransmits = 0;
+		tcp_try_undo_recovery(sk);
+		return;
+	}
+
+	if (flag & FLAG_DATA_ACKED)
+		icsk->icsk_retransmits = 0;
+	if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
+		tcp_reset_reno_sack(tp);
+	if (tcp_try_undo_loss(sk))
+		return;
+	tcp_moderate_cwnd(tp);
+	tcp_xmit_retransmit_queue(sk);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2710,12 +2734,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		tp->retrans_stamp = 0;
 	} else if (!before(tp->snd_una, tp->high_seq)) {
 		switch (icsk->icsk_ca_state) {
-		case TCP_CA_Loss:
-			icsk->icsk_retransmits = 0;
-			if (tcp_try_undo_recovery(sk))
-				return;
-			break;
-
 		case TCP_CA_CWR:
 			/* CWR is to be held something *above* high_seq
 			 * is ACKed for CWR bit to reach receiver. */
@@ -2746,18 +2764,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		newly_acked_sacked = pkts_acked + tp->sacked_out - prior_sacked;
 		break;
 	case TCP_CA_Loss:
-		if (flag & FLAG_DATA_ACKED)
-			icsk->icsk_retransmits = 0;
-		if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
-			tcp_reset_reno_sack(tp);
-		if (!tcp_try_undo_loss(sk)) {
-			tcp_moderate_cwnd(tp);
-			tcp_xmit_retransmit_queue(sk);
-			return;
-		}
+		tcp_process_loss(sk, flag);
 		if (icsk->icsk_ca_state != TCP_CA_Open)
 			return;
-		/* Loss is undone; fall through to processing in Open state. */
+		/* Fall through to processing in Open state. */
 	default:
 		if (tcp_is_reno(tp)) {
 			if (flag & FLAG_SND_UNA_ADVANCED)
-- 
1.8.1.3

^ permalink raw reply related

* [PATCH v2 3/3 net-next] tcp: implement RFC5682 F-RTO
From: Yuchung Cheng @ 2013-03-20 23:33 UTC (permalink / raw)
  To: davem, ncardwell, edumazet, nanditad; +Cc: ilpo.jarvinen, netdev, Yuchung Cheng
In-Reply-To: <1363822380-16687-1-git-send-email-ycheng@google.com>

This patch implements F-RTO (foward RTO recovery):

When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted.  Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.

The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss().  The basic version is not supported
because SACK enhanced version also works for non-SACK connections.

The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.

Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - Removed extra while spaces
 - Allow F-RTO in sack reneging case to detect spurious retransmit(s)
 - Re-tested after merging with the recent TCP tail loss probe patch

 Documentation/networking/ip-sysctl.txt | 18 +++------
 include/linux/tcp.h                    |  3 +-
 net/ipv4/tcp_input.c                   | 73 ++++++++++++++++++++++++++++------
 3 files changed, 68 insertions(+), 26 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 8a977a0..f98ca63 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -225,19 +225,13 @@ tcp_fin_timeout - INTEGER
 	Default: 60 seconds
 
 tcp_frto - INTEGER
-	Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
+	Enables Forward RTO-Recovery (F-RTO) defined in RFC5682.
 	F-RTO is an enhanced recovery algorithm for TCP retransmission
-	timeouts.  It is particularly beneficial in wireless environments
-	where packet loss is typically due to random radio interference
-	rather than intermediate router congestion.  F-RTO is sender-side
-	only modification. Therefore it does not require any support from
-	the peer.
-
-	If set to 1, basic version is enabled.  2 enables SACK enhanced
-	F-RTO if flow uses SACK.  The basic version can be used also when
-	SACK is in use though scenario(s) with it exists where F-RTO
-	interacts badly with the packet counting of the SACK enabled TCP
-	flow.
+	timeouts.  It is particularly beneficial in networks where the
+	RTT fluctuates (e.g., wireless). F-RTO is sender-side only
+	modification. It does not require any support from the peer.
+
+	By default it's enabled with a non-zero value. 0 disables F-RTO.
 
 tcp_keepalive_time - INTEGER
 	How often TCP sends out keepalive messages when keepalive is enabled.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f5f203b..5adbc33 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -192,7 +192,8 @@ struct tcp_sock {
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
-		repair      : 1;
+		repair      : 1,
+		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
 		syn_data:1,	/* SYN includes data */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8d821e4..b2b3619 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -107,6 +107,7 @@ int sysctl_tcp_early_retrans __read_mostly = 3;
 #define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
 #define FLAG_ECE		0x40 /* ECE in this ACK				*/
 #define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
+#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
 #define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
 #define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
 #define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
@@ -1155,6 +1156,8 @@ static u8 tcp_sacktag_one(struct sock *sk,
 					   tcp_highest_sack_seq(tp)))
 					state->reord = min(fack_count,
 							   state->reord);
+				if (!after(end_seq, tp->high_seq))
+					state->flag |= FLAG_ORIG_SACK_ACKED;
 			}
 
 			if (sacked & TCPCB_LOST) {
@@ -1835,10 +1838,13 @@ void tcp_enter_loss(struct sock *sk, int how)
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	bool new_recovery = false;
 
 	/* Reduce ssthresh if it has not yet been made inside this window. */
-	if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq ||
+	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
+	    !after(tp->high_seq, tp->snd_una) ||
 	    (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
+		new_recovery = true;
 		tp->prior_ssthresh = tcp_current_ssthresh(sk);
 		tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
 		tcp_ca_event(sk, CA_EVENT_LOSS);
@@ -1883,6 +1889,14 @@ void tcp_enter_loss(struct sock *sk, int how)
 	tcp_set_ca_state(sk, TCP_CA_Loss);
 	tp->high_seq = tp->snd_nxt;
 	TCP_ECN_queue_cwr(tp);
+
+	/* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous
+	 * loss recovery is underway except recurring timeout(s) on
+	 * the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing
+	 */
+	tp->frto = sysctl_tcp_frto &&
+		   (new_recovery || icsk->icsk_retransmits) &&
+		   !inet_csk(sk)->icsk_mtup.probe_size;
 }
 
 /* If ACK arrived pointing to a remembered SACK, it means that our
@@ -2426,12 +2440,12 @@ static int tcp_try_undo_partial(struct sock *sk, int acked)
 	return failed;
 }
 
-/* Undo during loss recovery after partial ACK. */
-static bool tcp_try_undo_loss(struct sock *sk)
+/* Undo during loss recovery after partial ACK or using F-RTO. */
+static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (frto_undo || tcp_may_undo(tp)) {
 		struct sk_buff *skb;
 		tcp_for_write_queue(skb, sk) {
 			if (skb == tcp_send_head(sk))
@@ -2445,9 +2459,12 @@ static bool tcp_try_undo_loss(struct sock *sk)
 		tp->lost_out = 0;
 		tcp_undo_cwr(sk, true);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
+		if (frto_undo)
+			NET_INC_STATS_BH(sock_net(sk),
+					 LINUX_MIB_TCPSPURIOUSRTOS);
 		inet_csk(sk)->icsk_retransmits = 0;
 		tp->undo_marker = 0;
-		if (tcp_is_sack(tp))
+		if (frto_undo || tcp_is_sack(tp))
 			tcp_set_ca_state(sk, TCP_CA_Open);
 		return true;
 	}
@@ -2667,24 +2684,52 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
 /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
-static void tcp_process_loss(struct sock *sk, int flag)
+static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool recovered = !before(tp->snd_una, tp->high_seq);
 
-	if (!before(tp->snd_una, tp->high_seq)) {
+	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
+		if (flag & FLAG_ORIG_SACK_ACKED) {
+			/* Step 3.b. A timeout is spurious if not all data are
+			 * lost, i.e., never-retransmitted data are (s)acked.
+			 */
+			tcp_try_undo_loss(sk, true);
+			return;
+		}
+		if (after(tp->snd_nxt, tp->high_seq) &&
+		    (flag & FLAG_DATA_SACKED || is_dupack)) {
+			tp->frto = 0; /* Loss was real: 2nd part of step 3.a */
+		} else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) {
+			tp->high_seq = tp->snd_nxt;
+			__tcp_push_pending_frames(sk, tcp_current_mss(sk),
+						  TCP_NAGLE_OFF);
+			if (after(tp->snd_nxt, tp->high_seq))
+				return; /* Step 2.b */
+			tp->frto = 0;
+		}
+	}
+
+	if (recovered) {
+		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
 		icsk->icsk_retransmits = 0;
 		tcp_try_undo_recovery(sk);
 		return;
 	}
-
 	if (flag & FLAG_DATA_ACKED)
 		icsk->icsk_retransmits = 0;
-	if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED)
-		tcp_reset_reno_sack(tp);
-	if (tcp_try_undo_loss(sk))
+	if (tcp_is_reno(tp)) {
+		/* A Reno DUPACK means new data in F-RTO step 2.b above are
+		 * delivered. Lower inflight to clock out (re)tranmissions.
+		 */
+		if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
+			tcp_add_reno_sack(sk);
+		else if (flag & FLAG_SND_UNA_ADVANCED)
+			tcp_reset_reno_sack(tp);
+	}
+	if (tcp_try_undo_loss(sk, false))
 		return;
-	tcp_moderate_cwnd(tp);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2764,7 +2809,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		newly_acked_sacked = pkts_acked + tp->sacked_out - prior_sacked;
 		break;
 	case TCP_CA_Loss:
-		tcp_process_loss(sk, flag);
+		tcp_process_loss(sk, flag, is_dupack);
 		if (icsk->icsk_ca_state != TCP_CA_Open)
 			return;
 		/* Fall through to processing in Open state. */
@@ -3003,6 +3048,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			}
 			if (!(sacked & TCPCB_SACKED_ACKED))
 				reord = min(pkts_acked, reord);
+			if (!after(scb->end_seq, tp->high_seq))
+				flag |= FLAG_ORIG_SACK_ACKED;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED)
-- 
1.8.1.3

^ permalink raw reply related

* Re: [PATCH v2 2/3 net-next] tcp: refactor CA_Loss state processing
From: Eric Dumazet @ 2013-03-20 23:56 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ncardwell, edumazet, nanditad, ilpo.jarvinen, netdev
In-Reply-To: <1363822380-16687-2-git-send-email-ycheng@google.com>

On Wed, 2013-03-20 at 16:32 -0700, Yuchung Cheng wrote:
> Consolidate all of TCP CA_Loss state processing in
> tcp_fastretrans_alert() into a new function called tcp_process_loss().
> This is to prepare the new F-RTO implementation in the next patch.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH v2 1/3 net-next] tcp: refactor F-RTO
From: Eric Dumazet @ 2013-03-20 23:55 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ncardwell, edumazet, nanditad, ilpo.jarvinen, netdev
In-Reply-To: <1363822380-16687-1-git-send-email-ycheng@google.com>

On Wed, 2013-03-20 at 16:32 -0700, Yuchung Cheng wrote:
> The patch series refactor the F-RTO feature (RFC4138/5682).
> 
> This is to simplify the loss recovery processing. Existing F-RTO
> was developed during the experimental stage (RFC4138) and has
> many experimental features.  It takes a separate code path from
> the traditional timeout processing by overloading CA_Disorder
> instead of using CA_Loss state. This complicates CA_Disorder state
> handling because it's also used for handling dubious ACKs and undos.
> While the algorithm in the RFC does not change the congestion control,
> the implementation intercepts congestion control in various places
> (e.g., frto_cwnd in tcp_ack()).
> 
> The new code implements newer F-RTO RFC5682 using CA_Loss processing
> path.  F-RTO becomes a small extension in the timeout processing
> and interfaces with congestion control and Eifel undo modules.
> It lets congestion control (module) determines how many to send
> independently.  F-RTO only chooses what to send in order to detect
> spurious retranmission. If timeout is found spurious it invokes
> existing Eifel undo algorithms like DSACK or TCP timestamp based
> detection.
> 
> The first patch removes all F-RTO code except the sysctl_tcp_frto is
> left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
> westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH -next] openvswitch: fix error return code in ovs_vport_cmd_set()
From: Jesse Gross @ 2013-03-21  0:06 UTC (permalink / raw)
  To: David Miller
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY,
	weiyj.lk-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20130320.122213.135318342578938888.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Wed, Mar 20, 2013 at 9:22 AM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> Date: Wed, 20 Mar 2013 08:57:53 -0700
>
>> On Wed, Mar 20, 2013 at 5:14 AM, Wei Yongjun <weiyj.lk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> From: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>
>>>
>>> Fix to return a negative error code from the error handling
>>> case instead of 0, as returned elsewhere in this function.
>>>
>>> Signed-off-by: Wei Yongjun <yongjun_wei-zrsr2BFq86L20UzCJQGyNP8+0UxHXcjY@public.gmane.org>
>>
>> This isn't right because at this point in the function the change has
>> already been made, only the notification allocation failed.  Therefore
>> we don't want to return an error since the caller will assume that
>> nothing has changed; instead an error is set on the Netlink socket.
>
> Then the way to fix this is to allocate the netlink SKB first, before
> any config changes occur, then pass that SKB into ovs_vport_cmd_build_info.
>
> We can still get errors, for -EMSGSIZE situations, but I'd say that'd
> be an implementation bug that should be logged.  If we aren't allocating
> large enough SKBs for the netlink reply, that's really an internal
> error.

That solves the problem here.  I'll take care of it, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: fix psock_fanout selftest hash collision
From: Willem de Bruijn @ 2013-03-21  0:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20130320.135934.156565403641352244.davem@davemloft.net>

On Wed, Mar 20, 2013 at 1:59 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 20 Mar 2013 12:33:44 -0400 (EDT)
>
>> From: Willem de Bruijn <willemb@google.com>
>> Date: Wed, 20 Mar 2013 02:42:44 -0400
>>
>>> Fix flaky results with PACKET_FANOUT_HASH depending on whether the
>>> two flows hash into the same packet socket or not.
>>>
>>> Also adds tests for PACKET_FANOUT_LB and PACKET_FANOUT_CPU and
>>> replaces the counting method with a packet ring.
>>>
>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>
>> Applied, thanks.  I'll retest on my sparc64 box later today.
>
> Unfortunately, it's still broken there:

This looks like a new problem. Now the counters all stay zero.

I am looking into it. I have not been able to reproduce this on my
x86_64 so far, so just brought a sparc32 up in qemu. Had less luck
with sparc64, but impressive that it works at all. Come to think of
it, is this a 64-bit kernel with 32-bit userland? Perhaps that
affects packet ring memory layout.


> --------------------
> running psock_fanout test
> --------------------
> test: control single socket
> test: control multiple sockets
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> info: trying alternate ports (4)
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> info: trying alternate ports (3)
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> info: trying alternate ports (2)
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> info: trying alternate ports (1)
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> info: trying alternate ports (0)
> test: datapath 0x0
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,5
> ERROR: incorrect queue lengths
> test: datapath 0x1000
> info: count=0,0, expect=0,0
> info: count=0,0, expect=15,5
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,15
> ERROR: incorrect queue lengths
> test: datapath 0x1
> info: count=0,0, expect=0,0
> info: count=0,0, expect=10,10
> ERROR: incorrect queue lengths
> info: count=0,0, expect=18,17
> ERROR: incorrect queue lengths
> test: datapath 0x3
> info: count=0,0, expect=0,0
> info: count=0,0, expect=20,0
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,15
> ERROR: incorrect queue lengths
> test: datapath 0x2
> info: count=0,0, expect=0,0
> info: count=0,0, expect=20,0
> ERROR: incorrect queue lengths
> info: count=0,0, expect=20,0
> ERROR: incorrect queue lengths
> test: datapath 0x2
> info: count=0,0, expect=0,0
> info: count=0,0, expect=0,20
> ERROR: incorrect queue lengths
> info: count=0,0, expect=0,20
> ERROR: incorrect queue lengths
> [FAIL]

^ permalink raw reply

* Re: Hello
From: Wenzhou Ourvis @ 2013-03-20 23:37 UTC (permalink / raw)
  To: Recipients

Dear E-mail User,


  This is to immediately inform you that your email address with Micros ID: IDJT-32238-EHI-676GI-8ED has won you 2,485,000.00 GBP and a brand new Range Rover SUV from the Wenzhou Ourvis yearly draw promo.

Use the details below to login and fill in the claim form to immediately begin your claims.



WEBSITE: www.wenzhourvis.cla.fr
USERNAME: ISO912PP
PASSWORD: XNO87PS


Regards,
Dr. Warren Payne,
Wenzhou Ourvis Company.
European Zonal Coordinator,
United Kingdom.

^ permalink raw reply

* Re: [PATCH net-next] net: fix psock_fanout selftest hash collision
From: David Miller @ 2013-03-21  1:16 UTC (permalink / raw)
  To: willemb; +Cc: netdev
In-Reply-To: <CA+FuTScnkvSZ1U=vmELNEpS4hU62iMV7B2ZDHT8KGAwxCxA-Ng@mail.gmail.com>

From: Willem de Bruijn <willemb@google.com>
Date: Wed, 20 Mar 2013 20:07:21 -0400

> Come to think of it, is this a 64-bit kernel with 32-bit userland?
> Perhaps that affects packet ring memory layout.

Yes, it is.

^ permalink raw reply

* JUMBO 2013 NEW YEAR DRAW
From: Mahgoub, Ahmed (NSN - EG/Cairo) @ 2013-03-20 23:32 UTC (permalink / raw)


Your Email Id has Won 1,000,000.00 Pounds From JUMBO 2013 NEW YEAR DRAW, Held on January 28th, 2013 attached with draw number "244". You are to contact our claim agent on this Email: (jumboclaims203@hotmail.co.uk<mailto:jumboclaims203@hotmail.co.uk> ) with the below details for claims.

Full Name:
Address:
Mobile Number:
Age:
Country:

Contact Person: Mr. John Carrick
Contact Email: jumboclaims203@hotmail.co.uk<mailto:jumboclaims203@hotmail.co.uk>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox