[RFC] Change socket checkpoint to retain DGRAM source addresses

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Change socket checkpoint to retain DGRAM source addresses
@ 2009-08-24 17:11 Dan Smith
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-24 17:11 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

This is a proposed change to the way sockets are checkpointed.
It makes the socket itself a proper objhash object, which can be
checkpointed or restored as part of reading the stream (like many
of the other first-class objects).  Thus, we worry about
checkpointing and restoring the socket-typed file, and read the
related socket object(s) as a matter of course.

By doing this, we are able to checkpoint sockets we find that
aren't attached to descriptors.  This is used in the final patch
to make sure that a socket buffer's owner socket has been
checkpointed, allowing us to use that socket to re-send the
buffer on restore (thus retaining the source address).

I've got a unit test for this that sets up three sockets, and
loads some in-flight buffers before checkpoint, verifying that
after checkpoint, recvfrom() sees them from the appropriate
source socket.

Does this approach seem reasonable?  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] Set the CHECKPOINTED flag on objects before calling checkpoint
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-24 17:11   ` Dan Smith
       [not found]     ` <1251133918-8117-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-24 17:11   ` [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them Dan Smith
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-24 17:11 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

This helps prevent infinite recursion where the checkpoint() operation on an
object could potentially result in another checkpoint() of itself.

UNIX sockets would easily encounter this when we call checkpoint on all
related sockets, which would in turn try to checkpoint us again.


Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/objhash.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index dc0a10a..019077b 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -677,9 +677,8 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 
 		/* invoke callback to actually dump the state */
 		BUG_ON(!obj->ops->checkpoint);
-		ret = obj->ops->checkpoint(ctx, ptr);
-
 		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+		ret = obj->ops->checkpoint(ctx, ptr);
 	}
 
 	obj->flags |= CKPT_OBJ_VISITED;
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-24 17:11   ` [PATCH 1/3] Set the CHECKPOINTED flag on objects before calling checkpoint Dan Smith
@ 2009-08-24 17:11   ` Dan Smith
       [not found]     ` <1251133918-8117-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-24 17:11   ` [PATCH 3/3] Store socket owner with buffers Dan Smith
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-24 17:11 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

This changes the checkpoint/restart procedure for sockets a bit.  The
socket file header is now checkpointed separately from the socket itself,
which allows us to checkpoint a socket without arriving at it from a
file descriptor.  Thus, most sockets will be checkpointed as a result
of processing the file table, calling sock_file_checkpoint(fd), which
in turn calls checkpoint_obj(socket).

However, we may arrive at some sockets while checkpointing other objects,
such as the other end of an AF_UNIX socket with buffers in flight.  This
patch just opens that door, which is utilized by the next patch.

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/objhash.c           |    2 +
 include/linux/checkpoint_hdr.h |    4 +-
 include/net/sock.h             |    2 +
 net/checkpoint.c               |  116 ++++++++++++++++++++++++++++-----------
 net/unix/checkpoint.c          |    3 +-
 5 files changed, 91 insertions(+), 36 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 019077b..4f26e86 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -381,6 +381,8 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.obj_type = CKPT_OBJ_SOCK,
 		.ref_drop = obj_sock_drop,
 		.ref_grab = obj_sock_grab,
+		.checkpoint = checkpoint_sock,
+		.restore = restore_sock,
 	},
 };
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 78f1f27..39b3cb4 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -68,6 +68,7 @@ enum {
 	CKPT_HDR_USER,
 	CKPT_HDR_GROUPINFO,
 	CKPT_HDR_TASK_CREDS,
+	CKPT_HDR_SOCKET,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -367,6 +368,7 @@ struct ckpt_hdr_file_pipe {
 
 /* socket */
 struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
 	struct { /* struct socket */
 		__u64 flags;
 		__u8 state;
@@ -425,7 +427,7 @@ struct ckpt_hdr_socket_unix {
 
 struct ckpt_hdr_file_socket {
 	struct ckpt_hdr_file common;
-	struct ckpt_hdr_socket socket;
+	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_utsns {
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e3b050..0db1ca3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1644,6 +1644,8 @@ extern __u32 sysctl_rmem_default;
 /* Checkpoint/Restart Functions */
 struct ckpt_ctx;
 struct ckpt_hdr_file;
+extern int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sock(struct ckpt_ctx *ctx);
 extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
 extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *h);
diff --git a/net/checkpoint.c b/net/checkpoint.c
index fdbf8e7..c84511e 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -411,31 +411,26 @@ static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
 		return 0;
 }
 
-int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+static int do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
 {
-	struct ckpt_hdr_file_socket *h;
-	struct socket *sock = file->private_data;
-	struct sock *sk = sock->sk;
 	int ret;
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
 
 	if (!sock->ops->checkpoint) {
 		ckpt_write_err(ctx, "socket (proto_ops: %pS)", sock->ops);
 		return -ENOSYS;
 	}
 
-	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_SOCKET;
-
 	/* part I: common to all sockets */
-	ret = sock_cptrst(ctx, sk, &h->socket, CKPT_CPT);
-	if (ret < 0)
-		goto out;
-	ret = checkpoint_file_common(ctx, file, &h->common);
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
 	if (ret < 0)
 		goto out;
+
 	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 	if (ret < 0)
 		goto out;
@@ -452,6 +447,42 @@ int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 			goto out;
 		ret = sock_write_buffers(ctx, &sk->sk_write_queue);
 	}
+
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_sock_checkpoint(ctx, (struct sock *)ptr);
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -511,35 +542,30 @@ static struct file *sock_alloc_attach_fd(struct socket *sock)
 	return file;
 }
 
-struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+static struct sock *do_sock_restore(struct ckpt_ctx *ctx)
 {
-	struct ckpt_hdr_file_socket *hh = (struct ckpt_hdr_file_socket *) ptr;
-	struct ckpt_hdr_socket *h = &hh->socket;
+	struct ckpt_hdr_socket *h;
 	struct socket *sock;
-	struct file *file;
 	int ret;
 
-	if (ptr->h.type != CKPT_HDR_FILE  ||
-	    ptr->h.len != sizeof(*hh) || ptr->f_type != CKPT_FILE_SOCKET)
-		return ERR_PTR(-EINVAL);
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
 
 	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
 	h->sock.type &= SOCK_TYPE_MASK;
 
 	ret = sock_create(h->sock_common.family, h->sock.type, 0, &sock);
 	if (ret < 0)
-		return ERR_PTR(ret);
+		goto err;
 
+	/* part II and III: per-protocol restore */
 	if (!sock->ops->restore) {
 		ckpt_debug("proto_ops lacks checkpoint: %pS\n", sock->ops);
 		ret = -EINVAL;
 		goto err;
 	}
 
-	/*
-	 * part II: per socket type state
-	 * (also takes care of part III: socket buffer)
-	 */
 	ret = sock->ops->restore(ctx, sock, h);
 	if (ret < 0)
 		goto err;
@@ -549,21 +575,45 @@ struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 	if (ret < 0)
 		goto err;
 
-	file = sock_alloc_attach_fd(sock);
-	if (IS_ERR(file)) {
-		ret = PTR_ERR(file);
-		goto err;
-	}
+	ckpt_hdr_put(ctx, h);
+
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+
+	return ERR_PTR(ret);
+}
+
+void *restore_sock(struct ckpt_ctx *ctx)
+{
+	return do_sock_restore(ctx);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	file = sock_alloc_attach_fd(sk->sk_socket);
+	if (IS_ERR(file))
+		return file;
 
 	ret = restore_file_common(ctx, file, ptr);
 	if (ret < 0) {
 		fput(file);
-		file = ERR_PTR(ret);
+		return ERR_PTR(ret);
 	}
-	return file;
 
- err:
-	sock_release(sock);
-	return ERR_PTR(ret);
+	return file;
 }
 
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
index 366bc80..cda8434 100644
--- a/net/unix/checkpoint.c
+++ b/net/unix/checkpoint.c
@@ -57,7 +57,6 @@ static int unix_write_cwd(struct ckpt_ctx *ctx,
 int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
 {
 	struct unix_sock *sk = unix_sk(sock->sk);
-	struct unix_sock *pr = unix_sk(sk->peer);
 	struct ckpt_hdr_socket_unix *un;
 	int new;
 	int ret = -ENOMEM;
@@ -86,7 +85,7 @@ int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
 		goto out;
 
 	if (sk->peer)
-		un->peer = ckpt_obj_lookup_add(ctx, pr, CKPT_OBJ_SOCK, &new);
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
 	else
 		un->peer = 0;
 
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] Store socket owner with buffers
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-24 17:11   ` [PATCH 1/3] Set the CHECKPOINTED flag on objects before calling checkpoint Dan Smith
  2009-08-24 17:11   ` [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them Dan Smith
@ 2009-08-24 17:11   ` Dan Smith
       [not found]     ` <1251133918-8117-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-25  0:06   ` [RFC] Change socket checkpoint to retain DGRAM source addresses Serge E. Hallyn
  2009-08-25  4:33   ` Oren Laadan
  4 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-24 17:11 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

When we write out the buffers for a socket, store the objref of the owner
socket as well (which in the case of AF_UNIX is the sender).  On restore,
we can fetch the socket from the objhash (reading if necessary) and use
it to re-send the buffers with sendmsg(), thus retaining the identity of
the sender for use with recvmsg().

Note: This removes the code recently in place to set the buffer limits
to the current system maximum to handle the case where the application
set the limit below the size of their current incoming queue.  This is
a valid scenario, but we'll fail on that now because there won't be
enough space for the buffers after we set their limit.  Should we always
set the maximum and then use the deferqueue to replace the user's limit
after restart?

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/checkpoint.h     |    3 -
 include/linux/checkpoint_hdr.h |    6 ++
 net/checkpoint.c               |   76 ++++++++++++-------------
 net/unix/checkpoint.c          |  121 +++++++++++++++++++++++++---------------
 4 files changed, 118 insertions(+), 88 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 761cad5..88861a1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -84,9 +84,6 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 			      struct socket *socket,
 			      struct sockaddr *loc, unsigned *loc_len,
 			      struct sockaddr *rem, unsigned *rem_len);
-extern struct ckpt_hdr_socket_queue *
-ckpt_sock_read_buffer_hdr(struct ckpt_ctx *ctx,
-			  uint32_t *bufsize);
 
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 39b3cb4..5473a4f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -411,6 +411,12 @@ struct ckpt_hdr_socket_queue {
 	__u32 total_bytes;
 } __attribute__ ((aligned(8)));
 
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__s32 src_objref;
+	__u8 data[0];
+};
+
 #define CKPT_UNIX_LINKED 1
 struct ckpt_hdr_socket_unix {
 	struct ckpt_hdr h;
diff --git a/net/checkpoint.c b/net/checkpoint.c
index c84511e..a7e7db0 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -58,6 +58,7 @@ static int sock_copy_buffers(struct sk_buff_head *from,
 			break; /* The queue changed as we read it */
 
 		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
 		skb_queue_tail(to, skbs[i]);
 
 		*total_bytes += skb->len;
@@ -85,9 +86,11 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
 				struct sk_buff_head *queue)
 {
 	struct sk_buff *skb;
-	int ret = 0;
 
 	skb_queue_walk(queue, skb) {
+		struct ckpt_hdr_socket_buffer *h;
+		int ret = 0;
+
 		/* FIXME: This could be a false positive for non-unix
 		 *        buffers, so add a type check here in the
 		 *        future
@@ -104,9 +107,34 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
 		 * information contained in the skb.
 		 */
 
-		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+		if (!h)
+			return -ENOMEM;
+
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto end;
+		h->src_objref = ret;
+
+		/* To avoid a memory copy to compose the header and
+		 * the socket buffer data, write only the header for
+		 * the object plus the data length, followed by the
+		 * actual object and then the data itself.
+		 */
+		ret = ckpt_write_obj_type(ctx, NULL, sizeof(*h) + skb->len,
 					  CKPT_HDR_SOCKET_BUFFER);
-		if (ret)
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_kwrite(ctx, h, sizeof(*h));
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_kwrite(ctx, skb->data, skb->len);
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
 			return ret;
 	}
 
@@ -488,42 +516,6 @@ int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	return ret;
 }
 
-struct ckpt_hdr_socket_queue *ckpt_sock_read_buffer_hdr(struct ckpt_ctx *ctx,
-							uint32_t *bufsize)
-{
-	struct ckpt_hdr_socket_queue *h;
-	int err = 0;
-
-	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
-	if (IS_ERR(h))
-		return h;
-
-	if (!bufsize) {
-		if (h->total_bytes != 0) {
-			ckpt_debug("Expected empty buffer, got %u\n",
-				   h->total_bytes);
-			err = -EINVAL;
-		}
-	} else if (h->total_bytes > *bufsize) {
-		/* NB: We let CAP_NET_ADMIN override the system buffer limit
-		 *     as setsockopt() does
-		 */
-		if (capable(CAP_NET_ADMIN))
-			*bufsize = h->total_bytes;
-		else {
-			ckpt_debug("Buffer total %u exceeds limit %u\n",
-			   h->total_bytes, *bufsize);
-			err = -EINVAL;
-		}
-	}
-
-	if (err) {
-		ckpt_hdr_put(ctx, h);
-		return ERR_PTR(err);
-	} else
-		return h;
-}
-
 static struct file *sock_alloc_attach_fd(struct socket *sock)
 {
 	struct file *file;
@@ -614,6 +606,12 @@ struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 		return ERR_PTR(ret);
 	}
 
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in the
+	 * socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sk);
+
 	return file;
 }
 
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
index cda8434..4a94165 100644
--- a/net/unix/checkpoint.c
+++ b/net/unix/checkpoint.c
@@ -109,17 +109,24 @@ int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
 	return ret;
 }
 
-static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx, struct sock *sk)
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
 {
 	struct msghdr msg;
 	struct kvec kvec;
+	struct ckpt_hdr_socket_buffer h;
+	struct sock *sk;
+	uint8_t sock_shutdown;
 	void *buf;
 	int ret = 0;
+	int totlen;
 	int len;
 
 	memset(&msg, 0, sizeof(msg));
 
-	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_SOCKET_BUFFER);
+	totlen = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_SOCKET_BUFFER);
+	len = totlen - sizeof(h);
 	if (len < 0)
 		return len;
 
@@ -135,14 +142,33 @@ static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx, struct sock *sk)
 	if (!buf)
 		return -ENOMEM;
 
+	ret = ckpt_kread(ctx, &h, sizeof(h));
+	if (ret < 0)
+		goto out;
+
+	sk = ckpt_obj_fetch(ctx, h.src_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
 	ret = ckpt_kread(ctx, kvec.iov_base, len);
 	if (ret < 0)
 		goto out;
 
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
 	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, len);
-	ckpt_debug("kernel_sendmsg(%i): %i\n", len, ret);
+	ckpt_debug("kernel_sendmsg(%i,%i): %i\n", h.src_objref, len, ret);
 	if ((ret > 0) && (ret != len))
 		ret = -ENOMEM;
+
+	sk->sk_shutdown = sock_shutdown;
  out:
 	kfree(buf);
 
@@ -150,38 +176,35 @@ static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx, struct sock *sk)
 }
 
 static int unix_read_buffers(struct ckpt_ctx *ctx,
-			     struct sock *sk, uint32_t *bufsize)
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
 {
-	uint8_t sock_shutdown;
 	struct ckpt_hdr_socket_queue *h;
 	int ret = 0;
 	int i;
 
-	h = ckpt_sock_read_buffer_hdr(ctx, bufsize);
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	/* If peer is shutdown, unshutdown it for this process */
-	sock_shutdown = sk->sk_shutdown;
-	sk->sk_shutdown &= ~SHUTDOWN_MASK;
-
 	for (i = 0; i < h->skb_count; i++) {
-		ret = sock_read_buffer_sendmsg(ctx, sk);
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
 		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
 		if (ret < 0)
-			break;
+			goto out;
 
 		if (ret > h->total_bytes) {
 			ckpt_debug("Buffers exceeded claim");
 			ret = -EINVAL;
-			break;
+			goto out;
 		}
 
 		h->total_bytes -= ret;
 		ret = 0;
 	}
 
-	sk->sk_shutdown = sock_shutdown;
+	ret = h->skb_count;
+ out:
 	ckpt_hdr_put(ctx, h);
 
 	return ret;
@@ -243,8 +266,19 @@ static int unix_restore_connected(struct ckpt_ctx *ctx,
 	struct sock *this = ckpt_obj_fetch(ctx, un->this, CKPT_OBJ_SOCK);
 	struct sock *peer = ckpt_obj_fetch(ctx, un->peer, CKPT_OBJ_SOCK);
 	struct socket *tmp = NULL;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
 	int ret;
 
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
 	if (!IS_ERR(this) && !IS_ERR(peer)) {
 		/* We're last */
 		struct socket *old = this->sk_socket;
@@ -258,26 +292,29 @@ static int unix_restore_connected(struct ckpt_ctx *ctx,
 		int family = sock->sk->sk_family;
 		int type = sock->sk->sk_type;
 
-		ret = sock_create(family, type, 0, &tmp);
-		ckpt_debug("sock_create: %i\n", ret);
-		if (ret)
-			goto out;
-
 		this = sock->sk;
-		peer = tmp->sk;
 
 		ret = ckpt_obj_insert(ctx, this, un->this, CKPT_OBJ_SOCK);
 		if (ret < 0)
 			goto out;
 
-		ret = ckpt_obj_insert(ctx, peer, un->peer, CKPT_OBJ_SOCK);
-		if (ret < 0)
-			goto out;
-
-		ret = unix_join(ctx, this, peer, un);
-		ckpt_debug("unix_join: %i\n", ret);
-		if (ret)
-			goto out;
+		if (un->peer > 0) {
+			ret = sock_create(family, type, 0, &tmp);
+			ckpt_debug("sock_create: %i\n", ret);
+			if (ret)
+				goto out;
+
+			peer = tmp->sk;
+			ret = ckpt_obj_insert(ctx, peer, un->peer,
+					      CKPT_OBJ_SOCK);
+			if (ret < 0)
+				goto out;
+
+			ret = unix_join(ctx, this, peer, un);
+			ckpt_debug("unix_join: %i\n", ret);
+			if (ret)
+				goto out;
+		}
 
 	} else {
 		ckpt_debug("Order Error\n");
@@ -297,28 +334,20 @@ static int unix_restore_connected(struct ckpt_ctx *ctx,
 		return -1;
 	}
 
-	/* Prime the socket's buffer limit with the maximum.  These will be
-	 * overwritten with the values in the checkpoint stream in a later
-	 * phase.
-	 */
-	peer->sk_userlocks |= SOCK_SNDBUF_LOCK;
-	peer->sk_sndbuf = sysctl_wmem_max;
-
-	/* Read my buffers and sendmsg() them back to me via my peer */
-
-	/* TODO: handle the unconnected case, as well, as the case
-	 *       where sendto() has been used on some of the buffers
-	 */
-
-	ret = unix_read_buffers(ctx, peer, &peer->sk_sndbuf);
+	/* Read incoming buffers and sendmsg() them back to me */
+	ret = unix_read_buffers(ctx, addr, addrlen);
 	ckpt_debug("unix_read_buffers: %i\n", ret);
-	if (ret)
+	if (ret < 0)
 		goto out;
 
-	/* Read peer's buffers and expect 0 */
-	ret = unix_read_buffers(ctx, peer, NULL);
+	/* Read outgoing buffers and expect none */
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	if (ret > 0) {
+		printk("UNIX read buffers not empty\n");
+		ret = -EINVAL;
+	}
  out:
-	if (tmp && ret)
+	if (tmp && (ret < 0))
 		sock_release(tmp);
 
 	return ret;
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC] Change socket checkpoint to retain DGRAM source addresses
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-08-24 17:11   ` [PATCH 3/3] Store socket owner with buffers Dan Smith
@ 2009-08-25  0:06   ` Serge E. Hallyn
  2009-08-25  4:33   ` Oren Laadan
  4 siblings, 0 replies; 16+ messages in thread
From: Serge E. Hallyn @ 2009-08-25  0:06 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> This is a proposed change to the way sockets are checkpointed.
> It makes the socket itself a proper objhash object, which can be
> checkpointed or restored as part of reading the stream (like many
> of the other first-class objects).  Thus, we worry about
> checkpointing and restoring the socket-typed file, and read the
> related socket object(s) as a matter of course.
> 
> By doing this, we are able to checkpoint sockets we find that
> aren't attached to descriptors.  This is used in the final patch
> to make sure that a socket buffer's owner socket has been
> checkpointed, allowing us to use that socket to re-send the
> buffer on restore (thus retaining the source address).
> 
> I've got a unit test for this that sets up three sockets, and
> loads some in-flight buffers before checkpoint, verifying that
> after checkpoint, recvfrom() sees them from the appropriate
> source socket.
> 
> Does this approach seem reasonable?  

I think so...  and so far haven't found any problems in the
code itself.  Will look a little more, but so far

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

-serge

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Change socket checkpoint to retain DGRAM source addresses
       [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-08-25  0:06   ` [RFC] Change socket checkpoint to retain DGRAM source addresses Serge E. Hallyn
@ 2009-08-25  4:33   ` Oren Laadan
  4 siblings, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-08-25  4:33 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg



Dan Smith wrote:
> This is a proposed change to the way sockets are checkpointed.
> It makes the socket itself a proper objhash object, which can be
> checkpointed or restored as part of reading the stream (like many
> of the other first-class objects).  Thus, we worry about
> checkpointing and restoring the socket-typed file, and read the
> related socket object(s) as a matter of course.
> 
> By doing this, we are able to checkpoint sockets we find that
> aren't attached to descriptors.  This is used in the final patch
> to make sure that a socket buffer's owner socket has been
> checkpointed, allowing us to use that socket to re-send the
> buffer on restore (thus retaining the source address).

It will also be useful when restoring two other important cases:

1) A listening socket that with "pending" connections - that is,
established connections for which the app should call accept().

2) Sockets that are "dangling": have been closed by the owner
process but linger in the system because they are referenced
somehow (through skb, peer etc).

> 
> I've got a unit test for this that sets up three sockets, and
> loads some in-flight buffers before checkpoint, verifying that
> after checkpoint, recvfrom() sees them from the appropriate
> source socket.
> 
> Does this approach seem reasonable?  
> 

Yep.

Oren.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found]     ` <1251133918-8117-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-25  5:01       ` Oren Laadan
       [not found]         ` <4A937031.10300-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Laadan @ 2009-08-25  5:01 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

Hi Dan,

Dan Smith wrote:
> This changes the checkpoint/restart procedure for sockets a bit.  The
> socket file header is now checkpointed separately from the socket itself,
> which allows us to checkpoint a socket without arriving at it from a
> file descriptor.  Thus, most sockets will be checkpointed as a result
> of processing the file table, calling sock_file_checkpoint(fd), which
> in turn calls checkpoint_obj(socket).

It's perhaps more accurate to s/most sockets/some sockets/. It may
be more likely for a socket to be checkpointed as a peer of another
process, or as the sender of an skb.

The patch looks fine - see a few comment below.

Now that you made 'struct sock' a 1st class object, they deserve to
enjoy 1st class treatment :p  That also means proper collect() method
- probably starting with the f_op ...

I may be mistaken, but I suspect that the suggested implementation
cannot limit the depth of recursive calls to checkpoint_obj(). For
instance, consider a dgram socket that received data from another
dgram socket, that received data from another dgram, ad infinitum.

> 
> However, we may arrive at some sockets while checkpointing other objects,
> such as the other end of an AF_UNIX socket with buffers in flight.  This
> patch just opens that door, which is utilized by the next patch.
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  checkpoint/objhash.c           |    2 +
>  include/linux/checkpoint_hdr.h |    4 +-
>  include/net/sock.h             |    2 +
>  net/checkpoint.c               |  116 ++++++++++++++++++++++++++++-----------
>  net/unix/checkpoint.c          |    3 +-
>  5 files changed, 91 insertions(+), 36 deletions(-)
> 
> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> index 019077b..4f26e86 100644
> --- a/checkpoint/objhash.c
> +++ b/checkpoint/objhash.c
> @@ -381,6 +381,8 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
>  		.obj_type = CKPT_OBJ_SOCK,
>  		.ref_drop = obj_sock_drop,
>  		.ref_grab = obj_sock_grab,
> +		.checkpoint = checkpoint_sock,
> +		.restore = restore_sock,
>  	},
>  };
>  
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 78f1f27..39b3cb4 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -68,6 +68,7 @@ enum {
>  	CKPT_HDR_USER,
>  	CKPT_HDR_GROUPINFO,
>  	CKPT_HDR_TASK_CREDS,
> +	CKPT_HDR_SOCKET,
>  
>  	/* 201-299: reserved for arch-dependent */
>  
> @@ -367,6 +368,7 @@ struct ckpt_hdr_file_pipe {
>  
>  /* socket */
>  struct ckpt_hdr_socket {
> +	struct ckpt_hdr h;
>  	struct { /* struct socket */
>  		__u64 flags;
>  		__u8 state;
> @@ -425,7 +427,7 @@ struct ckpt_hdr_socket_unix {
>  
>  struct ckpt_hdr_file_socket {
>  	struct ckpt_hdr_file common;
> -	struct ckpt_hdr_socket socket;
> +	__s32 sock_objref;
>  } __attribute__((aligned(8)));

I'm thinking about the two other use cases that I mentioned:
"dangling" (not-referenced by a file) and "pending" (not yet
accepted) sockets.

In both cases (well, at least with "pending"), the 'struct sock'
exist but the 'struct socket' does not exit until after the
socket is attached to a file descriptor. IIRC, the lifespan of
'struct socket' is coupled to that of the referencing file.

In that case, I guess it make more sense to leave the 'struct
socket' related data within ckpt_hdr_file_socket.

[...]

> +	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
> +	if (IS_ERR(sk))
> +		return ERR_PTR(PTR_ERR(sk));

Nit: I vaguely recall some disapproval of such construct...
How about '(struct file *) sk' ?

> +
> +	file = sock_alloc_attach_fd(sk->sk_socket);
> +	if (IS_ERR(file))
> +		return file;

[...]

Oren.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] Store socket owner with buffers
       [not found]     ` <1251133918-8117-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-25  5:33       ` Oren Laadan
  2009-08-26 17:31         ` Dan Smith
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Laadan @ 2009-08-25  5:33 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg



Dan Smith wrote:
> When we write out the buffers for a socket, store the objref of the owner
> socket as well (which in the case of AF_UNIX is the sender).  On restore,
> we can fetch the socket from the objhash (reading if necessary) and use
> it to re-send the buffers with sendmsg(), thus retaining the identity of
> the sender for use with recvmsg().
> 

I suspect this won't pass a test in which the sender of a dgram
has been closed already (and therefore is "dangling") ?

> Note: This removes the code recently in place to set the buffer limits
> to the current system maximum to handle the case where the application
> set the limit below the size of their current incoming queue.  This is
> a valid scenario, but we'll fail on that now because there won't be
> enough space for the buffers after we set their limit.  Should we always
> set the maximum and then use the deferqueue to replace the user's limit
> after restart?

I'm not sure why you removed that code completely.

It can be executed, for example, by the caller of unix_read_buffers(),
before and after that call ?

> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/checkpoint.h     |    3 -
>  include/linux/checkpoint_hdr.h |    6 ++
>  net/checkpoint.c               |   76 ++++++++++++-------------
>  net/unix/checkpoint.c          |  121 +++++++++++++++++++++++++---------------
>  4 files changed, 118 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 761cad5..88861a1 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -84,9 +84,6 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
>  			      struct socket *socket,
>  			      struct sockaddr *loc, unsigned *loc_len,
>  			      struct sockaddr *rem, unsigned *rem_len);
> -extern struct ckpt_hdr_socket_queue *
> -ckpt_sock_read_buffer_hdr(struct ckpt_ctx *ctx,
> -			  uint32_t *bufsize);
>  
>  /* ckpt kflags */
>  #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 39b3cb4..5473a4f 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -411,6 +411,12 @@ struct ckpt_hdr_socket_queue {
>  	__u32 total_bytes;
>  } __attribute__ ((aligned(8)));
>  
> +struct ckpt_hdr_socket_buffer {
> +	struct ckpt_hdr h;
> +	__s32 src_objref;
> +	__u8 data[0];
> +};
> +
>  #define CKPT_UNIX_LINKED 1
>  struct ckpt_hdr_socket_unix {
>  	struct ckpt_hdr h;
> diff --git a/net/checkpoint.c b/net/checkpoint.c
> index c84511e..a7e7db0 100644
> --- a/net/checkpoint.c
> +++ b/net/checkpoint.c
> @@ -58,6 +58,7 @@ static int sock_copy_buffers(struct sk_buff_head *from,
>  			break; /* The queue changed as we read it */
>  
>  		skb_morph(skbs[i], skb);
> +		skbs[i]->sk = skb->sk;
>  		skb_queue_tail(to, skbs[i]);
>  
>  		*total_bytes += skb->len;
> @@ -85,9 +86,11 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
>  				struct sk_buff_head *queue)
>  {
>  	struct sk_buff *skb;
> -	int ret = 0;
>  
>  	skb_queue_walk(queue, skb) {
> +		struct ckpt_hdr_socket_buffer *h;
> +		int ret = 0;
> +
>  		/* FIXME: This could be a false positive for non-unix
>  		 *        buffers, so add a type check here in the
>  		 *        future
> @@ -104,9 +107,34 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
>  		 * information contained in the skb.
>  		 */
>  
> -		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
> +		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
> +		if (!h)
> +			return -ENOMEM;
> +
> +		BUG_ON(!skb->sk);
> +		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
> +		if (ret < 0)
> +			goto end;
> +		h->src_objref = ret;
> +
> +		/* To avoid a memory copy to compose the header and
> +		 * the socket buffer data, write only the header for
> +		 * the object plus the data length, followed by the
> +		 * actual object and then the data itself.
> +		 */
> +		ret = ckpt_write_obj_type(ctx, NULL, sizeof(*h) + skb->len,
>  					  CKPT_HDR_SOCKET_BUFFER);
> -		if (ret)
> +		if (ret < 0)
> +			goto end;
> +
> +		ret = ckpt_kwrite(ctx, h, sizeof(*h));

The 'struct ckpt_hdr' part of @h was already written in the call
to ckpt_write_obj_type() above.

> +		if (ret < 0)
> +			goto end;
> +
> +		ret = ckpt_kwrite(ctx, skb->data, skb->len);
> +	end:
> +		ckpt_hdr_put(ctx, h);
> +		if (ret < 0)
>  			return ret;
>  	}
>  
> @@ -488,42 +516,6 @@ int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  	return ret;
>  }
>  
> -struct ckpt_hdr_socket_queue *ckpt_sock_read_buffer_hdr(struct ckpt_ctx *ctx,
> -							uint32_t *bufsize)
> -{
> -	struct ckpt_hdr_socket_queue *h;
> -	int err = 0;
> -
> -	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
> -	if (IS_ERR(h))
> -		return h;
> -
> -	if (!bufsize) {
> -		if (h->total_bytes != 0) {
> -			ckpt_debug("Expected empty buffer, got %u\n",
> -				   h->total_bytes);
> -			err = -EINVAL;
> -		}
> -	} else if (h->total_bytes > *bufsize) {
> -		/* NB: We let CAP_NET_ADMIN override the system buffer limit
> -		 *     as setsockopt() does
> -		 */
> -		if (capable(CAP_NET_ADMIN))
> -			*bufsize = h->total_bytes;
> -		else {
> -			ckpt_debug("Buffer total %u exceeds limit %u\n",
> -			   h->total_bytes, *bufsize);
> -			err = -EINVAL;
> -		}
> -	}
> -
> -	if (err) {
> -		ckpt_hdr_put(ctx, h);
> -		return ERR_PTR(err);
> -	} else
> -		return h;
> -}
> -
>  static struct file *sock_alloc_attach_fd(struct socket *sock)
>  {
>  	struct file *file;
> @@ -614,6 +606,12 @@ struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
>  		return ERR_PTR(ret);
>  	}
>  
> +	/* Since objhash assumes the initial reference for a socket,
> +	 * we bump it here for this descriptor, unlike other places in the
> +	 * socket code which assume the descriptor is the owner.
> +	 */
> +	sock_hold(sk);
> +
>  	return file;
>  }

This piece belongs to the previous patch ...

>  
> diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
> index cda8434..4a94165 100644
> --- a/net/unix/checkpoint.c
> +++ b/net/unix/checkpoint.c
> @@ -109,17 +109,24 @@ int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
>  	return ret;
>  }
>  
> -static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx, struct sock *sk)
> +static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
> +				    struct sockaddr *addr,
> +				    unsigned int addrlen)
>  {
>  	struct msghdr msg;
>  	struct kvec kvec;
> +	struct ckpt_hdr_socket_buffer h;
> +	struct sock *sk;
> +	uint8_t sock_shutdown;
>  	void *buf;
>  	int ret = 0;
> +	int totlen;
>  	int len;
>  
>  	memset(&msg, 0, sizeof(msg));
>  
> -	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_SOCKET_BUFFER);
> +	totlen = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_SOCKET_BUFFER);
> +	len = totlen - sizeof(h);

Catch read errors:
	if (totlen < 0)
		return totlen;

>  	if (len < 0)
>  		return len;
		      ^^^^^^
		      -EINVAL;

[...]

> @@ -243,8 +266,19 @@ static int unix_restore_connected(struct ckpt_ctx *ctx,
>  	struct sock *this = ckpt_obj_fetch(ctx, un->this, CKPT_OBJ_SOCK);
>  	struct sock *peer = ckpt_obj_fetch(ctx, un->peer, CKPT_OBJ_SOCK);
>  	struct socket *tmp = NULL;
> +	struct sockaddr *addr = NULL;
> +	unsigned int addrlen = 0;
>  	int ret;
>  
> +	if (un->peer == 0) {
> +		/* These get propagated to the msghdr, so only set them
> +		 * if we're not connected to a peer, else we'll get an error
> +		 * when we sendmsg()
> +		 */
> +		addr = (struct sockaddr *)&un->laddr;
> +		addrlen = un->laddr_len;
> +	}
> +

I wonder if we need to consider this:
With unix domain sockets, when a connected, or unconnected dgram
socket is (re)connected or disconnected, IIRC, all the existing
data in the receive buffers is removed.

I suppose this will automagically be enforced, because when you
connect the socket, old buffers will be discarded already, and if
it is already connected, then sending from other sockets (not peer)
will fail.

So I think we're protected from such malicious data. Perhaps this
is worth a comment ?


>  	if (!IS_ERR(this) && !IS_ERR(peer)) {
>  		/* We're last */
>  		struct socket *old = this->sk_socket;
> @@ -258,26 +292,29 @@ static int unix_restore_connected(struct ckpt_ctx *ctx,
>  		int family = sock->sk->sk_family;
>  		int type = sock->sk->sk_type;
>  
> -		ret = sock_create(family, type, 0, &tmp);
> -		ckpt_debug("sock_create: %i\n", ret);
> -		if (ret)
> -			goto out;
> -
>  		this = sock->sk;
> -		peer = tmp->sk;
>  
>  		ret = ckpt_obj_insert(ctx, this, un->this, CKPT_OBJ_SOCK);
>  		if (ret < 0)
>  			goto out;
>  
> -		ret = ckpt_obj_insert(ctx, peer, un->peer, CKPT_OBJ_SOCK);
> -		if (ret < 0)
> -			goto out;
> -
> -		ret = unix_join(ctx, this, peer, un);
> -		ckpt_debug("unix_join: %i\n", ret);
> -		if (ret)
> -			goto out;
> +		if (un->peer > 0) {
> +			ret = sock_create(family, type, 0, &tmp);
> +			ckpt_debug("sock_create: %i\n", ret);
> +			if (ret)
> +				goto out;
> +
> +			peer = tmp->sk;
> +			ret = ckpt_obj_insert(ctx, peer, un->peer,
> +					      CKPT_OBJ_SOCK);

Hmmm... I suspect this is plain wrong. Objects that are treated
via checkpoint_obj() and restore_obj() should only be added to
the objhash via these two functions.

Because at checkpoint you used un->peer = checkpoint_obj(...), a
ckpt_obj_fetch() on @un->peer _must_ return the matching object,
or the image file data is invalid.

So sock_create() above should be ckpt_obj_fetch(), and the
ckpt_obj_insert() should go away.

Unfortunately, neither restore_obj() nor ckpt_obj_insert() will
complain about adding the same objref twice...

> +			if (ret < 0)
> +				goto out;
> +
> +			ret = unix_join(ctx, this, peer, un);
> +			ckpt_debug("unix_join: %i\n", ret);
> +			if (ret)
> +				goto out;
> +		}

[...]

Oren.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] Set the CHECKPOINTED flag on objects before calling checkpoint
       [not found]     ` <1251133918-8117-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-25  5:53       ` Oren Laadan
  0 siblings, 0 replies; 16+ messages in thread
From: Oren Laadan @ 2009-08-25  5:53 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg


Right, added.

Dan Smith wrote:
> This helps prevent infinite recursion where the checkpoint() operation on an
> object could potentially result in another checkpoint() of itself.
> 
> UNIX sockets would easily encounter this when we call checkpoint on all
> related sockets, which would in turn try to checkpoint us again.
> 
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  checkpoint/objhash.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> index dc0a10a..019077b 100644
> --- a/checkpoint/objhash.c
> +++ b/checkpoint/objhash.c
> @@ -677,9 +677,8 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
>  
>  		/* invoke callback to actually dump the state */
>  		BUG_ON(!obj->ops->checkpoint);
> -		ret = obj->ops->checkpoint(ctx, ptr);
> -
>  		obj->flags |= CKPT_OBJ_CHECKPOINTED;
> +		ret = obj->ops->checkpoint(ctx, ptr);
>  	}
>  
>  	obj->flags |= CKPT_OBJ_VISITED;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found]         ` <4A937031.10300-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-08-25 14:52           ` Dan Smith
       [not found]             ` <87y6p8q728.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  2009-08-26  2:47           ` Matt Helsley
  1 sibling, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-25 14:52 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

OL> It's perhaps more accurate to s/most sockets/some sockets/. It may
OL> be more likely for a socket to be checkpointed as a peer of
OL> another process, or as the sender of an skb.

Um, how about "most of the time" ?  I definitely think that the
(overwhelmingly) common case is a pair of sockets each attached to a
file descriptor.

OL> Now that you made 'struct sock' a 1st class object, they deserve to
OL> enjoy 1st class treatment :p  That also means proper collect() method
OL> - probably starting with the f_op ...

Okay.

OL> I may be mistaken, but I suspect that the suggested implementation
OL> cannot limit the depth of recursive calls to checkpoint_obj(). For
OL> instance, consider a dgram socket that received data from another
OL> dgram socket, that received data from another dgram, ad infinitum.

At the very least, a single receive socket is limited in how many
skb's may be queued for it, which limits an attacker's ability to
reach the "ad infinitum" case, I'd say.  Do we need something more?

OL> I'm thinking about the two other use cases that I mentioned:
OL> "dangling" (not-referenced by a file) and "pending" (not yet
OL> accepted) sockets.

OL> In both cases (well, at least with "pending"), the 'struct sock'
OL> exist but the 'struct socket' does not exit until after the socket
OL> is attached to a file descriptor. IIRC, the lifespan of 'struct
OL> socket' is coupled to that of the referencing file.

OL> In that case, I guess it make more sense to leave the 'struct
OL> socket' related data within ckpt_hdr_file_socket.

Hmm, not by my reading.  From what I can tell, the accept operation
converts a pending socket into a regular one (for lack of a better
term) by doing a sock_graft().  In fact, it takes the 'struct socket'
of the pending one, and a 'struct socket' attached to the file handle
and grafts the former onto the latter.  So, the pending socket's
'struct socket' *does* exist, unless I'm missing something.

The dangling case isn't quite as clear, so I'll need to investigate
further.  While sock_close() prevents further ->ops calls, I don't see
that it (or ops->release()) actually dissociates the 'struct socket'
from the 'struct sock' until later.

>> +		return ERR_PTR(PTR_ERR(sk));

OL> Nit: I vaguely recall some disapproval of such construct...
OL> How about '(struct file *) sk' ?

Casting it to the wrong type seems less desirable to me.  I was
following the lead of:

  % fgrep -r 'ERR_PTR(PTR_ERR' . | wc -l
  36

and:

  % fgrep -r 'ERR_PTR(PTR_ERR' checkpoint
  checkpoint/namespace.c:		return ERR_PTR(PTR_ERR(h));
  checkpoint/signal.c:		return ERR_PTR(PTR_ERR(h));

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found]             ` <87y6p8q728.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-08-25 17:55               ` Oren Laadan
       [not found]                 ` <4A94257C.5060702-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Laadan @ 2009-08-25 17:55 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg



Dan Smith wrote:
> OL> It's perhaps more accurate to s/most sockets/some sockets/. It may
> OL> be more likely for a socket to be checkpointed as a peer of
> OL> another process, or as the sender of an skb.
> 
> Um, how about "most of the time" ?  I definitely think that the
> (overwhelmingly) common case is a pair of sockets each attached to a
> file descriptor.
> 
> OL> Now that you made 'struct sock' a 1st class object, they deserve to
> OL> enjoy 1st class treatment :p  That also means proper collect() method
> OL> - probably starting with the f_op ...
> 
> Okay.
> 
> OL> I may be mistaken, but I suspect that the suggested implementation
> OL> cannot limit the depth of recursive calls to checkpoint_obj(). For
> OL> instance, consider a dgram socket that received data from another
> OL> dgram socket, that received data from another dgram, ad infinitum.
> 
> At the very least, a single receive socket is limited in how many
> skb's may be queued for it, which limits an attacker's ability to
> reach the "ad infinitum" case, I'd say.  Do we need something more?

Multiple buffers adds iteration, and one level of recursion. I had in
mind a slightly different scenario: instead of many buffers for one
socket, many sockets "chained" -

Assume N sockets S_1...S_n, all dgram, none is connected. Each socket
S_i send one packet to S_i+1.  Suppose you first checkpoint S_n, then
you'll need to checkpoint S_n-1, for which you'll need to checkpoint
S_n-2 etc.

> OL> I'm thinking about the two other use cases that I mentioned:
> OL> "dangling" (not-referenced by a file) and "pending" (not yet
> OL> accepted) sockets.
> 
> OL> In both cases (well, at least with "pending"), the 'struct sock'
> OL> exist but the 'struct socket' does not exit until after the socket
> OL> is attached to a file descriptor. IIRC, the lifespan of 'struct
> OL> socket' is coupled to that of the referencing file.
> 
> OL> In that case, I guess it make more sense to leave the 'struct
> OL> socket' related data within ckpt_hdr_file_socket.
> 
> Hmm, not by my reading.  From what I can tell, the accept operation

You are right: sock_init_data() sets it up, and I believe it is
for the entire lifetime of the sock/socket.

>>> +		return ERR_PTR(PTR_ERR(sk));
> 
> OL> Nit: I vaguely recall some disapproval of such construct...
> OL> How about '(struct file *) sk' ?
> 
> Casting it to the wrong type seems less desirable to me.  I was
> following the lead of:
> 
>   % fgrep -r 'ERR_PTR(PTR_ERR' . | wc -l
>   36

Yep. That settles it then :)

> 
> and:
> 
>   % fgrep -r 'ERR_PTR(PTR_ERR' checkpoint
>   checkpoint/namespace.c:		return ERR_PTR(PTR_ERR(h));
>   checkpoint/signal.c:		return ERR_PTR(PTR_ERR(h));

(FWIW, this was criticized ...)

Oren.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found]         ` <4A937031.10300-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-08-25 14:52           ` Dan Smith
@ 2009-08-26  2:47           ` Matt Helsley
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Helsley @ 2009-08-26  2:47 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Dan Smith

On Tue, Aug 25, 2009 at 01:01:37AM -0400, Oren Laadan wrote:
> Hi Dan,
> 
> Dan Smith wrote:
> > This changes the checkpoint/restart procedure for sockets a bit.  The
> > socket file header is now checkpointed separately from the socket itself,
> > which allows us to checkpoint a socket without arriving at it from a
> > file descriptor.  Thus, most sockets will be checkpointed as a result
> > of processing the file table, calling sock_file_checkpoint(fd), which
> > in turn calls checkpoint_obj(socket).
> 
> It's perhaps more accurate to s/most sockets/some sockets/. It may
> be more likely for a socket to be checkpointed as a peer of another
> process, or as the sender of an skb.
> 
> The patch looks fine - see a few comment below.
> 
> Now that you made 'struct sock' a 1st class object, they deserve to
> enjoy 1st class treatment :p  That also means proper collect() method
> - probably starting with the f_op ...
> 
> I may be mistaken, but I suspect that the suggested implementation
> cannot limit the depth of recursive calls to checkpoint_obj(). For
> instance, consider a dgram socket that received data from another
> dgram socket, that received data from another dgram, ad infinitum.
> 
> > 
> > However, we may arrive at some sockets while checkpointing other objects,
> > such as the other end of an AF_UNIX socket with buffers in flight.  This
> > patch just opens that door, which is utilized by the next patch.
> > 
> > Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > ---
> >  checkpoint/objhash.c           |    2 +
> >  include/linux/checkpoint_hdr.h |    4 +-
> >  include/net/sock.h             |    2 +
> >  net/checkpoint.c               |  116 ++++++++++++++++++++++++++++-----------
> >  net/unix/checkpoint.c          |    3 +-
> >  5 files changed, 91 insertions(+), 36 deletions(-)
> > 
> > diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> > index 019077b..4f26e86 100644
> > --- a/checkpoint/objhash.c
> > +++ b/checkpoint/objhash.c
> > @@ -381,6 +381,8 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
> >  		.obj_type = CKPT_OBJ_SOCK,
> >  		.ref_drop = obj_sock_drop,
> >  		.ref_grab = obj_sock_grab,
> > +		.checkpoint = checkpoint_sock,
> > +		.restore = restore_sock,
> >  	},
> >  };
> >  
> > diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> > index 78f1f27..39b3cb4 100644
> > --- a/include/linux/checkpoint_hdr.h
> > +++ b/include/linux/checkpoint_hdr.h
> > @@ -68,6 +68,7 @@ enum {
> >  	CKPT_HDR_USER,
> >  	CKPT_HDR_GROUPINFO,
> >  	CKPT_HDR_TASK_CREDS,
> > +	CKPT_HDR_SOCKET,
> >  
> >  	/* 201-299: reserved for arch-dependent */
> >  
> > @@ -367,6 +368,7 @@ struct ckpt_hdr_file_pipe {
> >  
> >  /* socket */
> >  struct ckpt_hdr_socket {
> > +	struct ckpt_hdr h;
> >  	struct { /* struct socket */
> >  		__u64 flags;
> >  		__u8 state;
> > @@ -425,7 +427,7 @@ struct ckpt_hdr_socket_unix {
> >  
> >  struct ckpt_hdr_file_socket {
> >  	struct ckpt_hdr_file common;
> > -	struct ckpt_hdr_socket socket;
> > +	__s32 sock_objref;
> >  } __attribute__((aligned(8)));
> 
> I'm thinking about the two other use cases that I mentioned:
> "dangling" (not-referenced by a file) and "pending" (not yet
> accepted) sockets.
> 
> In both cases (well, at least with "pending"), the 'struct sock'
> exist but the 'struct socket' does not exit until after the
> socket is attached to a file descriptor. IIRC, the lifespan of
> 'struct socket' is coupled to that of the referencing file.
> 
> In that case, I guess it make more sense to leave the 'struct
> socket' related data within ckpt_hdr_file_socket.
> 
> [...]
> 
> > +	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
> > +	if (IS_ERR(sk))
> > +		return ERR_PTR(PTR_ERR(sk));
> 
> Nit: I vaguely recall some disapproval of such construct...
> How about '(struct file *) sk' ?

This pattern raised an objection from me. At first glance
ERR_PTR(PTR_ERR(x)) looks redundant. But it's just a way of
casting to another pointer type. I think we'd be better off with the
necessary cast or a new ERR macro in err.h specifically for this situation.
Maybe something like:

#define EPOINTER(x) ((void*)x)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them
       [not found]                 ` <4A94257C.5060702-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-08-26  2:53                   ` Matt Helsley
  0 siblings, 0 replies; 16+ messages in thread
From: Matt Helsley @ 2009-08-26  2:53 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Dan Smith

On Tue, Aug 25, 2009 at 01:55:08PM -0400, Oren Laadan wrote:
> 
> 
> Dan Smith wrote:
> > OL> It's perhaps more accurate to s/most sockets/some sockets/. It may
> > OL> be more likely for a socket to be checkpointed as a peer of
> > OL> another process, or as the sender of an skb.
> > 
> > Um, how about "most of the time" ?  I definitely think that the
> > (overwhelmingly) common case is a pair of sockets each attached to a
> > file descriptor.
> > 
> > OL> Now that you made 'struct sock' a 1st class object, they deserve to
> > OL> enjoy 1st class treatment :p  That also means proper collect() method
> > OL> - probably starting with the f_op ...
> > 
> > Okay.
> > 
> > OL> I may be mistaken, but I suspect that the suggested implementation
> > OL> cannot limit the depth of recursive calls to checkpoint_obj(). For
> > OL> instance, consider a dgram socket that received data from another
> > OL> dgram socket, that received data from another dgram, ad infinitum.
> > 
> > At the very least, a single receive socket is limited in how many
> > skb's may be queued for it, which limits an attacker's ability to
> > reach the "ad infinitum" case, I'd say.  Do we need something more?
> 
> Multiple buffers adds iteration, and one level of recursion. I had in
> mind a slightly different scenario: instead of many buffers for one
> socket, many sockets "chained" -
> 
> Assume N sockets S_1...S_n, all dgram, none is connected. Each socket
> S_i send one packet to S_i+1.  Suppose you first checkpoint S_n, then
> you'll need to checkpoint S_n-1, for which you'll need to checkpoint
> S_n-2 etc.
> 
> > OL> I'm thinking about the two other use cases that I mentioned:
> > OL> "dangling" (not-referenced by a file) and "pending" (not yet
> > OL> accepted) sockets.
> > 
> > OL> In both cases (well, at least with "pending"), the 'struct sock'
> > OL> exist but the 'struct socket' does not exit until after the socket
> > OL> is attached to a file descriptor. IIRC, the lifespan of 'struct
> > OL> socket' is coupled to that of the referencing file.
> > 
> > OL> In that case, I guess it make more sense to leave the 'struct
> > OL> socket' related data within ckpt_hdr_file_socket.
> > 
> > Hmm, not by my reading.  From what I can tell, the accept operation
> 
> You are right: sock_init_data() sets it up, and I believe it is
> for the entire lifetime of the sock/socket.
> 
> >>> +		return ERR_PTR(PTR_ERR(sk));
> > 
> > OL> Nit: I vaguely recall some disapproval of such construct...
> > OL> How about '(struct file *) sk' ?
> > 
> > Casting it to the wrong type seems less desirable to me.  I was
> > following the lead of:
> > 
> >   % fgrep -r 'ERR_PTR(PTR_ERR' . | wc -l
> >   36
> 
> Yep. That settles it then :)

Hmm, OK. For some reason I thought that pattern only showed up in
checkpoint/*...

I still think it would be nice to see a macro specifically for this.
I can submit a patch for that myself though.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] Store socket owner with buffers
  2009-08-25  5:33       ` Oren Laadan
@ 2009-08-26 17:31         ` Dan Smith
       [not found]           ` <87bpm2qy6q.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Smith @ 2009-08-26 17:31 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

OL> I suspect this won't pass a test in which the sender of a dgram
OL> has been closed already (and therefore is "dangling") ?

I had a test case for this, but it was passing because I was closing
the wrong socket.  However, a few modifications to the code (and a
corrected test case) have it working now.  I'll post that in the next
go-round.

OL> I'm not sure why you removed that code completely.

OL> It can be executed, for example, by the caller of
OL> unix_read_buffers(), before and after that call ?

You're right, I was confusing the behavior of sk_sndbuf with
sk_wmem_alloc and thinking that I'd need to track it separately.

>> +		ret = ckpt_write_obj_type(ctx, NULL, sizeof(*h) + skb->len,
>> CKPT_HDR_SOCKET_BUFFER);
>> -		if (ret)
>> +		if (ret < 0)
>> +			goto end;
>> +
>> +		ret = ckpt_kwrite(ctx, h, sizeof(*h));

OL> The 'struct ckpt_hdr' part of @h was already written in the call
OL> to ckpt_write_obj_type() above.

I'll admit that the increased complexity of header and object writing
recently has me a bit unsure of what to do when, but I'm not sure how
it is not failing to run if you're correct.  Certainly restore would
get out of sync with the headers if there was an extra ckpt_hdr in the
stream, no?

Anyway, I've changed that bit to avoid the unnecessary appearance that
the buffer data is actually part of the ckpt_hdr_socket_buffer
structure.  It would never be optimal to read it that way anyway.

OL> I wonder if we need to consider this: With unix domain sockets,
OL> when a connected, or unconnected dgram socket is (re)connected or
OL> disconnected, IIRC, all the existing data in the receive buffers
OL> is removed.

OL> I suppose this will automagically be enforced, because when you
OL> connect the socket, old buffers will be discarded already, and if
OL> it is already connected, then sending from other sockets (not
OL> peer) will fail.

OL> So I think we're protected from such malicious data. Perhaps this
OL> is worth a comment ?

While swizzling things to handle SOCK_DEAD, I explicitly avoid writing
buffers for DEAD sockets (like LISTEN sockets).  That should make it
clear enough, right?

OL> Hmmm... I suspect this is plain wrong. Objects that are treated
OL> via checkpoint_obj() and restore_obj() should only be added to the
OL> objhash via these two functions.

OL> Because at checkpoint you used un->peer = checkpoint_obj(...), a
OL> ckpt_obj_fetch() on @un->peer _must_ return the matching object,
OL> or the image file data is invalid.

No, because the restore() may be a result of another restore() not yet
completed, right?  If I'm restoring A, which restores B as a
side-effect, the restore of B will expect to see its peer (A) as
already in the objhash or it will fail.

OL> Unfortunately, neither restore_obj() nor ckpt_obj_insert() will
OL> complain about adding the same objref twice...

That's going to be a problem for other scenarios where we have a
circular dependency, right?

Assume A and B have buffers outstanding for each other.  When we
restore A, we restore the first buffer and see that we need to restore
B.  While restoring B's first buffer, we see think that we need to
restore A because it's not in the objhash yet, and then we're stuck.

We could permit the restore function to insert itself (if necessary)
and then we could check before re-insert it after ops->restore(),
right?

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] Store socket owner with buffers
       [not found]           ` <87bpm2qy6q.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-08-26 20:34             ` Oren Laadan
       [not found]               ` <4A959C4A.6050803-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Oren Laadan @ 2009-08-26 20:34 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

Dan Smith wrote:
> OL> I suspect this won't pass a test in which the sender of a dgram
> OL> has been closed already (and therefore is "dangling") ?
> 
> I had a test case for this, but it was passing because I was closing
> the wrong socket.  However, a few modifications to the code (and a
> corrected test case) have it working now.  I'll post that in the next
> go-round.
> 
> OL> I'm not sure why you removed that code completely.
> 
> OL> It can be executed, for example, by the caller of
> OL> unix_read_buffers(), before and after that call ?
> 
> You're right, I was confusing the behavior of sk_sndbuf with
> sk_wmem_alloc and thinking that I'd need to track it separately.
> 
>>> +		ret = ckpt_write_obj_type(ctx, NULL, sizeof(*h) + skb->len,
>>> CKPT_HDR_SOCKET_BUFFER);
>>> -		if (ret)
>>> +		if (ret < 0)
>>> +			goto end;
>>> +
>>> +		ret = ckpt_kwrite(ctx, h, sizeof(*h));
> 
> OL> The 'struct ckpt_hdr' part of @h was already written in the call
> OL> to ckpt_write_obj_type() above.
> 
> I'll admit that the increased complexity of header and object writing
> recently has me a bit unsure of what to do when, but I'm not sure how
> it is not failing to run if you're correct.  Certainly restore would
> get out of sync with the headers if there was an extra ckpt_hdr in the
> stream, no?

Lol - well, it worked because you wrote the ckpt_hdr twice at checkpoint
and then read it twice at restart...

> 
> Anyway, I've changed that bit to avoid the unnecessary appearance that
> the buffer data is actually part of the ckpt_hdr_socket_buffer
> structure.  It would never be optimal to read it that way anyway.
> 
> OL> I wonder if we need to consider this: With unix domain sockets,
> OL> when a connected, or unconnected dgram socket is (re)connected or
> OL> disconnected, IIRC, all the existing data in the receive buffers
> OL> is removed.
> 
> OL> I suppose this will automagically be enforced, because when you
> OL> connect the socket, old buffers will be discarded already, and if
> OL> it is already connected, then sending from other sockets (not
> OL> peer) will fail.
> 
> OL> So I think we're protected from such malicious data. Perhaps this
> OL> is worth a comment ?
> 
> While swizzling things to handle SOCK_DEAD, I explicitly avoid writing
> buffers for DEAD sockets (like LISTEN sockets).  That should make it
> clear enough, right?

I'm confused...

The concern I raised above is about manipulated checkpoint data that
attempts to make a connected socket to have data from another (not
peer) socket.

Then I realized that since you use the usual send functions, this
_should_ be covered by the existing checks in the unix networking
code itself.

So I wondered if it's worth a comment somewhere, just to let readers
of the code know that this scenario was considered. (Of course, after
verifying that my statement is indeed correct :)

Regarding DEAD sockets - the may still be needed if they are the
source of skb's, no ?  (of course, whatever they had in receive
buffer should have been discarded).

> 
> OL> Hmmm... I suspect this is plain wrong. Objects that are treated
> OL> via checkpoint_obj() and restore_obj() should only be added to the
> OL> objhash via these two functions.
> 
> OL> Because at checkpoint you used un->peer = checkpoint_obj(...), a
> OL> ckpt_obj_fetch() on @un->peer _must_ return the matching object,
> OL> or the image file data is invalid.
> 
> No, because the restore() may be a result of another restore() not yet
> completed, right?  If I'm restoring A, which restores B as a
> side-effect, the restore of B will expect to see its peer (A) as
> already in the objhash or it will fail.

Maybe I need to read the code again. My impression was that when you
checkpoint socket A, and you need B, then you call checkpoint_obj()
on B, from within the processing of checkpoint_obj(A).

restart-obj() - it gets called automatically when ckpt_read_obj() sees
an object of type CKPT_HDR_OBJREF; there is never an explicit call.
That's why I assumed that it will be called for A and then, while
processing A, it will be called for B, since A will read in data and
the ckpt_obj_read() will see the shared object.

> 
> OL> Unfortunately, neither restore_obj() nor ckpt_obj_insert() will
> OL> complain about adding the same objref twice...
> 
> That's going to be a problem for other scenarios where we have a
> circular dependency, right?
> 
> Assume A and B have buffers outstanding for each other.  When we
> restore A, we restore the first buffer and see that we need to restore
> B.  While restoring B's first buffer, we see think that we need to
> restore A because it's not in the objhash yet, and then we're stuck.

I wonder what would the checkpoint image look like ?
Reading the code, I expect the following to occur during checkpoint:

1) find A through the file pointers, call checkpoint_obj(S_a).
2) while in there, find B as peer, call checkpoint_obj(S_b)
3) write B's state, including B's buffers which will reference A
4) (back to A) write A's state, and buffers which will reference B

And at restart:

1) See ckpt_hdr_socket of A (start restart of A)
2) See ckpt_hdr_socket of B (start restart of B)
3) Read B's state in, including buffers. A will not have completed
at this point, so it won't be in the hash either...
...

So I guess it works because you explicitly ckpt_obj_insert() the
sockets to the objhash;  Then later restart_obj() inserts them
(or something) again, but the second instance is never actually
found in a lookup/fetch - the first instance is always returned.

Or am I missing something... ?

> 
> We could permit the restore function to insert itself (if necessary)
> and then we could check before re-insert it after ops->restore(),
> right?
> 

Hmmm... I need to think about it.

Oren.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] Store socket owner with buffers
       [not found]               ` <4A959C4A.6050803-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-08-26 21:44                 ` Dan Smith
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Smith @ 2009-08-26 21:44 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

OL> The concern I raised above is about manipulated checkpoint data
OL> that attempts to make a connected socket to have data from another
OL> (not peer) socket.

OL> Then I realized that since you use the usual send functions, this
OL> _should_ be covered by the existing checks in the unix networking
OL> code itself.

OL> So I wondered if it's worth a comment somewhere, just to let
OL> readers of the code know that this scenario was considered. (Of
OL> course, after verifying that my statement is indeed correct :)

Correct, the checks in sendmsg() and recvmsg() should properly protect
us from sending buffers to sockets that aren't supposed to receive
them (i.e. ones that are connected elsewhere).

OL> Regarding DEAD sockets - the may still be needed if they are the
OL> source of skb's, no ?  (of course, whatever they had in receive
OL> buffer should have been discarded).

Right, that's what I'm saying I've updated since this post.

OL> Maybe I need to read the code again. My impression was that when
OL> you checkpoint socket A, and you need B, then you call
OL> checkpoint_obj() on B, from within the processing of
OL> checkpoint_obj(A).

...correct.

OL> restart-obj() - it gets called automatically when ckpt_read_obj()
OL> sees an object of type CKPT_HDR_OBJREF; there is never an explicit
OL> call.  That's why I assumed that it will be called for A and then,
OL> while processing A, it will be called for B, since A will read in
OL> data and the ckpt_obj_read() will see the shared object.

Correct, it's the circular dependency between the sockets and their
buffers that the problem arises.

OL> So I guess it works because you explicitly ckpt_obj_insert() the
OL> sockets to the objhash; Then later restart_obj() inserts them (or
OL> something) again, but the second instance is never actually found
OL> in a lookup/fetch - the first instance is always returned.

I think we've resolved this on IRC just now, or at least, are close to
it :)

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-08-26 21:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-24 17:11 [RFC] Change socket checkpoint to retain DGRAM source addresses Dan Smith
     [not found] ` <1251133918-8117-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-24 17:11   ` [PATCH 1/3] Set the CHECKPOINTED flag on objects before calling checkpoint Dan Smith
     [not found]     ` <1251133918-8117-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-25  5:53       ` Oren Laadan
2009-08-24 17:11   ` [PATCH 2/3] Make sockets proper objhash objects and use checkpoint_obj() on them Dan Smith
     [not found]     ` <1251133918-8117-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-25  5:01       ` Oren Laadan
     [not found]         ` <4A937031.10300-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-08-25 14:52           ` Dan Smith
     [not found]             ` <87y6p8q728.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-08-25 17:55               ` Oren Laadan
     [not found]                 ` <4A94257C.5060702-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-08-26  2:53                   ` Matt Helsley
2009-08-26  2:47           ` Matt Helsley
2009-08-24 17:11   ` [PATCH 3/3] Store socket owner with buffers Dan Smith
     [not found]     ` <1251133918-8117-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-25  5:33       ` Oren Laadan
2009-08-26 17:31         ` Dan Smith
     [not found]           ` <87bpm2qy6q.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-08-26 20:34             ` Oren Laadan
     [not found]               ` <4A959C4A.6050803-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-08-26 21:44                 ` Dan Smith
2009-08-25  0:06   ` [RFC] Change socket checkpoint to retain DGRAM source addresses Serge E. Hallyn
2009-08-25  4:33   ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.