* [PATCH 1/3] Add _ckpt_read_hdr_type() helper
@ 2009-07-01 18:20 Dan Smith
[not found] ` <1246472414-23105-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: Dan Smith @ 2009-07-01 18:20 UTC (permalink / raw)
To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: Dan Smith
This helper function gives us a way to read a header object and the
subsequent payload from the checkpoint stream directly into our own
buffer. This is used by the net/checkpoint.c code to read skb's
without a memcpy().
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
checkpoint/restart.c | 36 ++++++++++++++++++++++++++++++++++--
include/linux/checkpoint.h | 4 ++++
2 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4155426..3bfee31 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -177,6 +177,38 @@ int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
}
/**
+ * _ckpt_read_hdr_type - read a header record and check the type
+ * @ctx: checkpoint context
+ * @h: provided header buffer (returned)
+ * @type: optional type, 0 to ignore
+ *
+ * Returns the size of the payload to follow or negative on error
+ */
+int _ckpt_read_hdr_type(struct ckpt_ctx *ctx, struct ckpt_hdr *h, int type)
+{
+ int ret;
+
+ ret = ckpt_kread(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+ else if (type && h->type != type)
+ return -EINVAL;
+ else
+ return h->len - sizeof(*h);
+}
+
+/**
+ * _ckpt_read_payload - read the payload associated with a recent header read
+ * @ctx: checkpoint_context
+ * @h: header returned from _ckpt_read_hdr_type()
+ * @buffer: pre-allocated buffer to store payload
+ */
+int _ckpt_read_payload(struct ckpt_ctx *ctx, struct ckpt_hdr *h, void *buffer)
+{
+ return ckpt_kread(ctx, buffer, h->len - sizeof(*h));
+}
+
+/**
* ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
* @ctx: checkpoint context
* @h: object descriptor
@@ -192,7 +224,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
int ret;
again:
- ret = ckpt_kread(ctx, &hh, sizeof(hh));
+ ret = _ckpt_read_hdr_type(ctx, &hh, 0);
if (ret < 0)
return ERR_PTR(ret);
_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
@@ -217,7 +249,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
*h = hh; /* yay ! */
- ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+ ret = _ckpt_read_payload(ctx, &hh, (h + 1));
if (ret < 0) {
ckpt_hdr_put(ctx, h);
h = ERR_PTR(ret);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2765c33..ccc4aab 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -63,6 +63,10 @@ extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
extern void __ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...);
extern int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...);
+extern int _ckpt_read_payload(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+ void *buffer);
+extern int _ckpt_read_hdr_type(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+ int type);
extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
void *ptr, int len, int type);
extern int _ckpt_read_nbuffer(struct ckpt_ctx *ctx, void *ptr, int len);
--
1.6.2.2
^ permalink raw reply related [flat|nested] 8+ messages in thread[parent not found: <1246472414-23105-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* [PATCH 2/3] c/r: Add AF_UNIX support (v4) [not found] ` <1246472414-23105-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-07-01 18:20 ` Dan Smith [not found] ` <1246472414-23105-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-07-01 18:28 ` [PATCH 1/3] Add _ckpt_read_hdr_type() helper Dan Smith 1 sibling, 1 reply; 8+ messages in thread From: Dan Smith @ 2009-07-01 18:20 UTC (permalink / raw) To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: Dan Smith, Alexey Dobriyan This patch adds basic checkpoint/restart support for AF_UNIX sockets. It has been tested with a single and multiple processes, and with data inflight at the time of checkpoint. It supports socketpair()s, path-based, and abstract sockets. Changes in v4: - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog to match their struct sock definitions. This should avoid issues with sign extension. - Add a sock_cptrst_verify() function to be run at restore time to validate several of the values in the checkpoint image against limits, flag masks, etc. - Write an error string with ctk_write_err() in the obscure cases - Don't write socket buffers for listen sockets - Sanity check address lengths before we agree to allocate memory - Check the result of inserting the peer object in the objhash on restart - Check return value of sock_cptrst() on restart - Change logic in remote getname() phase of checkpoint to not fail for closed (et al) sockets - Eliminate the memory copy while reading socket buffers on restart Changes in v3: - Move sock_file_checkpoint() above sock_file_restore() - Change __sock_file_*() functions to do_sock_file_*() - Adjust some of the struct cr_hdr_socket alignment - Improve the sock_copy_buffers() algorithm to avoid locking the source queue for the entire operation - Fix alignment in the socket header struct(s) - Move the per-protocol structure (ckpt_hdr_socket_un) out of the common socket header and read/write it separately - Fix missing call to sock_cptrst() in restore path - Break out the socket joining into another function - Fix failure to restore the socket address thus fixing getname() - Check the state values on restart - Fix case of state being TCP_CLOSE, which allows dgram sockets to be properly connected (if appropriate) to their peer and maintain the sockaddr for getname() operation - Fix restoring a listening socket that has been unlink()'d - Fix checkpointing sockets with an in-flight FD-passing SKB. Fail with EBUSY. - Fix checkpointing listening sockets with an unaccepted connection. Fail with EBUSY. - Changed 'un' to 'unix' in function and structure names Changes in v2: - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems to be rather common in other uses of skb_copy()) - Move the ckpt_hdr_socket structure definition to linux/socket.h - Fix whitespace issue - Move sock_file_checkpoint() to net/socket.c for symmetry Cc: Oren Laaden <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Cc: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/files.c | 7 + checkpoint/objhash.c | 27 ++ include/linux/checkpoint_hdr.h | 13 + include/linux/socket.h | 59 +++++ include/net/sock.h | 9 + net/Makefile | 2 + net/checkpoint.c | 545 ++++++++++++++++++++++++++++++++++++++++ net/socket.c | 84 ++++++ 8 files changed, 746 insertions(+), 0 deletions(-) create mode 100644 net/checkpoint.c diff --git a/checkpoint/files.c b/checkpoint/files.c index c32b95b..176d3fd 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> +#include <net/sock.h> /************************************************************************** @@ -519,6 +520,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_PIPE, .restore = pipe_file_restore, }, + /* socket */ + { + .file_name = "SOCKET", + .file_type = CKPT_FILE_SOCKET, + .restore = sock_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index f604655..17686b5 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -20,6 +20,7 @@ #include <linux/user_namespace.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> +#include <net/sock.h> struct ckpt_obj; struct ckpt_obj_ops; @@ -264,6 +265,22 @@ static int obj_groupinfo_users(void *ptr) return atomic_read(&((struct group_info *) ptr)->usage); } +static int obj_sock_grab(void *ptr) +{ + sock_hold((struct sock *) ptr); + return 0; +} + +static void obj_sock_drop(void *ptr) +{ + sock_put((struct sock *) ptr); +} + +static int obj_sock_users(void *ptr) +{ + return atomic_read(&((struct sock *) ptr)->sk_refcnt); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -391,6 +408,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_groupinfo, .restore = restore_groupinfo, }, + /* sock object */ + { + .obj_name = "SOCKET", + .obj_type = CKPT_OBJ_SOCK, + .ref_drop = obj_sock_drop, + .ref_grab = obj_sock_grab, + .ref_users = obj_sock_users, + .checkpoint = sock_file_checkpoint, + .restore = sock_file_restore, + }, }; diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 37bae3d..f59b071 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -88,6 +88,12 @@ enum { CKPT_HDR_SIGHAND = 601, + CKPT_HDR_FD_SOCKET = 601, + CKPT_HDR_SOCKET, + CKPT_HDR_SOCKET_BUFFERS, + CKPT_HDR_SOCKET_BUFFER, + CKPT_HDR_SOCKET_UNIX, + CKPT_HDR_TAIL = 9001, CKPT_HDR_ERROR = 9999, @@ -121,6 +127,7 @@ enum obj_type { CKPT_OBJ_CRED, CKPT_OBJ_USER, CKPT_OBJ_GROUPINFO, + CKPT_OBJ_SOCK, CKPT_OBJ_MAX }; @@ -316,6 +323,7 @@ enum file_type { CKPT_FILE_IGNORE = 0, CKPT_FILE_GENERIC, CKPT_FILE_PIPE, + CKPT_FILE_SOCKET, CKPT_FILE_MAX }; @@ -339,6 +347,11 @@ struct ckpt_hdr_file_pipe { __s32 pipe_objref; } __attribute__((aligned(8))); +struct ckpt_hdr_file_socket { + struct ckpt_hdr_file common; + __u16 family; +} __attribute__((aligned(8))); + struct ckpt_hdr_file_pipe_state { struct ckpt_hdr h; __s32 pipe_len; diff --git a/include/linux/socket.h b/include/linux/socket.h index 421afb4..6480c47 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -23,6 +23,7 @@ struct __kernel_sockaddr_storage { #include <linux/uio.h> /* iovec support */ #include <linux/types.h> /* pid_t */ #include <linux/compiler.h> /* __user */ +#include <linux/checkpoint_hdr.h> /* ckpt_hdr */ #ifdef __KERNEL__ # ifdef CONFIG_PROC_FS @@ -323,5 +324,63 @@ extern int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr *ka extern int put_cmsg(struct msghdr*, int level, int type, int len, void *data); #endif + +struct ckpt_hdr_socket_unix { + struct ckpt_hdr h; + __u32 this; + __u32 peer; + __u8 linked; +} __attribute__ ((aligned(8))); + +struct ckpt_hdr_socket { + struct ckpt_hdr h; + + struct ckpt_socket { /* struct socket */ + __u64 flags; + __u8 state; + } socket __attribute__ ((aligned(8))); + + struct ckpt_sock_common { /* struct sock_common */ + __u32 bound_dev_if; + __u16 family; + __u8 state; + __u8 reuse; + } sock_common __attribute__ ((aligned(8))); + + struct ckpt_sock { /* struct sock */ + __s64 rcvlowat; + __s64 rcvtimeo; + __s64 sndtimeo; + __u64 flags; + __u64 lingertime; + + __u32 err; + __u32 err_soft; + __u32 priority; + __s32 rcvbuf; + __s32 sndbuf; + __u16 type; + __s16 backlog; + + __u8 protocol; + __u8 state; + __u8 shutdown; + __u8 userlocks; + __u8 no_check; + } sock __attribute__ ((aligned(8))); + + /* common to all supported families */ + __u32 laddr_len; + __u32 raddr_len; + struct sockaddr laddr; + struct sockaddr raddr; + +} __attribute__ ((aligned(8))); + +struct ckpt_hdr_socket_buffer { + struct ckpt_hdr h; + __u32 skb_count; +} __attribute__ ((aligned(8))); + #endif /* not kernel and not glibc */ #endif /* _LINUX_SOCKET_H */ diff --git a/include/net/sock.h b/include/net/sock.h index 4bb1ff9..a8b6af1 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1482,4 +1482,13 @@ extern int sysctl_optmem_max; extern __u32 sysctl_wmem_default; extern __u32 sysctl_rmem_default; +/* Checkpoint/Restart Functions */ +struct ckpt_ctx; +struct ckpt_hdr_socket; +extern int sock_file_checkpoint(struct ckpt_ctx *, void *); +extern void *sock_file_restore(struct ckpt_ctx *); +extern struct socket *do_sock_file_restore(struct ckpt_ctx *, + struct ckpt_hdr_socket *); +extern int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); + #endif /* _SOCK_H */ diff --git a/net/Makefile b/net/Makefile index 9e00a55..c226ed1 100644 --- a/net/Makefile +++ b/net/Makefile @@ -65,3 +65,5 @@ ifeq ($(CONFIG_NET),y) obj-$(CONFIG_SYSCTL) += sysctl_net.o endif obj-$(CONFIG_WIMAX) += wimax/ + +obj-$(CONFIG_CHECKPOINT) += checkpoint.o diff --git a/net/checkpoint.c b/net/checkpoint.c new file mode 100644 index 0000000..701d26c --- /dev/null +++ b/net/checkpoint.c @@ -0,0 +1,545 @@ +/* + * Copyright 2009 IBM Corporation + * + * Author: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + */ + +#include <linux/socket.h> +#include <linux/mount.h> +#include <linux/file.h> + +#include <net/af_unix.h> +#include <net/tcp_states.h> + +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> +#include <linux/namei.h> + +/* Size of an empty struct sockaddr_un */ +#define UNIX_LEN_EMPTY 2 + +static int sock_copy_buffers(struct sk_buff_head *from, struct sk_buff_head *to) +{ + int count = 0; + struct sk_buff *skb; + + skb_queue_walk(from, skb) { + struct sk_buff *tmp; + + tmp = dev_alloc_skb(skb->len); + if (!tmp) + return -ENOMEM; + + spin_lock(&from->lock); + skb_morph(tmp, skb); + spin_unlock(&from->lock); + + skb_queue_tail(to, tmp); + count++; + } + + return count; +} + +static int __sock_write_buffers(struct ckpt_ctx *ctx, + struct sk_buff_head *queue) +{ + struct sk_buff *skb; + int ret = 0; + + skb_queue_walk(queue, skb) { + if (UNIXCB(skb).fp) { + ckpt_write_err(ctx, "fd-passing is not supported"); + return -EBUSY; + } + + ret = ckpt_write_obj_type(ctx, skb->data, skb->len, + CKPT_HDR_SOCKET_BUFFER); + if (ret) + return ret; + } + + return 0; +} + +static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue) +{ + struct ckpt_hdr_socket_buffer *h; + struct sk_buff_head tmpq; + int ret = -ENOMEM; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFERS); + if (!h) + goto out; + + skb_queue_head_init(&tmpq); + + h->skb_count = sock_copy_buffers(queue, &tmpq); + if (h->skb_count < 0) { + ret = h->skb_count; + goto out; + } + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); + if (!ret) + ret = __sock_write_buffers(ctx, &tmpq); + + out: + ckpt_hdr_put(ctx, h); + __skb_queue_purge(&tmpq); + + return ret; +} + +static int sock_unix_checkpoint(struct ckpt_ctx *ctx, + struct sock *sock, + struct ckpt_hdr_socket *h) +{ + struct unix_sock *sk = unix_sk(sock); + struct unix_sock *pr = unix_sk(sk->peer); + struct ckpt_hdr_socket_unix *un; + int new; + int ret = -ENOMEM; + + if ((sock->sk_state == TCP_LISTEN) && + !skb_queue_empty(&sock->sk_receive_queue)) { + ckpt_write_err(ctx, "listening socket has unaccepted peers"); + return -EBUSY; + } + + un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX); + if (!un) + goto out; + + un->linked = sk->dentry && (sk->dentry->d_inode->i_nlink > 0); + + un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new); + if (un->this < 0) + goto out; + + if (sk->peer) + un->peer = ckpt_obj_lookup_add(ctx, pr, CKPT_OBJ_SOCK, &new); + else + un->peer = 0; + + if (un->peer < 0) { + ret = un->peer; + goto out; + } + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); + if (ret < 0) + goto out; + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un); + out: + ckpt_hdr_put(ctx, un); + + return ret; +} + +static int sock_cptrst_verify(struct ckpt_hdr_socket *h) +{ + uint8_t userlocks_mask = SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK | + SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK; + + if (h->sock.shutdown & ~SHUTDOWN_MASK) + return -EINVAL; + if (h->sock.userlocks & ~userlocks_mask) + return -EINVAL; + if (h->sock.sndtimeo < 0) + return -EINVAL; + if (h->sock.rcvtimeo < 0) + return -EINVAL; + if ((h->sock.userlocks & SOCK_SNDBUF_LOCK) && + ((h->sock.sndbuf < SOCK_MIN_SNDBUF) || + (h->sock.sndbuf > sysctl_wmem_max))) + return -EINVAL; + if ((h->sock.userlocks & SOCK_RCVBUF_LOCK) && + ((h->sock.rcvbuf < SOCK_MIN_RCVBUF) || + (h->sock.rcvbuf > sysctl_rmem_max))) + return -EINVAL; + if ((h->sock.flags & SOCK_LINGER) && + (h->sock.lingertime > MAX_SCHEDULE_TIMEOUT)) + return -EINVAL; + /* Current highest errno is ~530; this should provide some sanity */ + if ((h->sock.err < 0) || (h->sock.err > 1024)) + return -EINVAL; + + return 0; +} + +static int sock_cptrst(struct ckpt_ctx *ctx, + struct sock *sock, + struct ckpt_hdr_socket *h, + int op) +{ + if (sock->sk_socket) { + CKPT_COPY(op, h->socket.flags, sock->sk_socket->flags); + CKPT_COPY(op, h->socket.state, sock->sk_socket->state); + } + + CKPT_COPY(op, h->sock_common.reuse, sock->sk_reuse); + CKPT_COPY(op, h->sock_common.bound_dev_if, sock->sk_bound_dev_if); + CKPT_COPY(op, h->sock_common.family, sock->sk_family); + + CKPT_COPY(op, h->sock.shutdown, sock->sk_shutdown); + CKPT_COPY(op, h->sock.userlocks, sock->sk_userlocks); + CKPT_COPY(op, h->sock.no_check, sock->sk_no_check); + CKPT_COPY(op, h->sock.protocol, sock->sk_protocol); + CKPT_COPY(op, h->sock.err, sock->sk_err); + CKPT_COPY(op, h->sock.err_soft, sock->sk_err_soft); + CKPT_COPY(op, h->sock.priority, sock->sk_priority); + CKPT_COPY(op, h->sock.rcvlowat, sock->sk_rcvlowat); + CKPT_COPY(op, h->sock.backlog, sock->sk_max_ack_backlog); + CKPT_COPY(op, h->sock.rcvtimeo, sock->sk_rcvtimeo); + CKPT_COPY(op, h->sock.sndtimeo, sock->sk_sndtimeo); + CKPT_COPY(op, h->sock.rcvbuf, sock->sk_rcvbuf); + CKPT_COPY(op, h->sock.sndbuf, sock->sk_sndbuf); + CKPT_COPY(op, h->sock.flags, sock->sk_flags); + CKPT_COPY(op, h->sock.lingertime, sock->sk_lingertime); + CKPT_COPY(op, h->sock.type, sock->sk_type); + CKPT_COPY(op, h->sock.state, sock->sk_state); + + if ((h->socket.state == SS_CONNECTED) && + (h->sock.state != TCP_ESTABLISHED)) { + ckpt_write_err(ctx, "socket/sock in inconsistent state: %i/%i", + h->socket.state, h->sock.state); + return -EINVAL; + } else if ((h->sock.state < TCP_ESTABLISHED) || + (h->sock.state >= TCP_MAX_STATES)) { + ckpt_write_err(ctx, "sock in invalid state: %i", h->sock.state); + return -EINVAL; + } else if ((h->socket.state < SS_FREE) || + (h->socket.state > SS_DISCONNECTING)) { + ckpt_write_err(ctx, "socket in invalid state: %i", + h->socket.state); + return -EINVAL; + } + + if (op == CKPT_CPT) + return sock_cptrst_verify(h); + else + return 0; +} + +int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct socket *socket = file->private_data; + struct sock *sock = socket->sk; + struct ckpt_hdr_socket *h; + int ret = 0; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET); + if (!h) + return -ENOMEM; + + h->laddr_len = sizeof(h->laddr); + h->raddr_len = sizeof(h->raddr); + + if (socket->ops->getname(socket, &h->laddr, &h->laddr_len, 0)) { + ckpt_write_err(ctx, "Unable to getname of local"); + ret = -EINVAL; + goto out; + } + + if (socket->ops->getname(socket, &h->raddr, &h->raddr_len, 1)) { + if ((sock->sk_type != SOCK_DGRAM) && + (sock->sk_state == TCP_ESTABLISHED)) { + ckpt_write_err(ctx, "Unable to getname of remote"); + ret = -EINVAL; + goto out; + } + h->raddr_len = 0; + } + + ret = sock_cptrst(ctx, sock, h, CKPT_CPT); + if (ret) + goto out; + + if (sock->sk_family == AF_UNIX) { + ret = sock_unix_checkpoint(ctx, sock, h); + if (ret) + goto out; + } else { + ckpt_write_err(ctx, "unsupported socket family %i", + sock->sk_family); + ret = EINVAL; + goto out; + } + + if (sock->sk_state != TCP_LISTEN) { + ret = sock_write_buffers(ctx, &sock->sk_receive_queue); + if (ret) + goto out; + + ret = sock_write_buffers(ctx, &sock->sk_write_queue); + if (ret) + goto out; + } + out: + ckpt_hdr_put(ctx, h); + + return ret; +} + +static int sock_read_buffer(struct ckpt_ctx *ctx, + struct sock *sock, + struct sk_buff **skb) +{ + struct ckpt_hdr h; + int ret = 0; + int len; + + len = _ckpt_read_hdr_type(ctx, &h, CKPT_HDR_SOCKET_BUFFER); + if (len < 0) + return len; + + if (len > SKB_MAX_ALLOC) { + ckpt_debug("Socket buffer too big (%i > %lu)", + len, SKB_MAX_ALLOC); + return -ENOSPC; + } + + *skb = sock_alloc_send_skb(sock, len, MSG_DONTWAIT, &ret); + if (*skb == NULL) + return ENOMEM; + + ret = _ckpt_read_payload(ctx, &h, skb_put(*skb, len)); + + return ret; +} + +static int sock_read_buffers(struct ckpt_ctx *ctx, + struct sock *sock, + struct sk_buff_head *queue) +{ + struct ckpt_hdr_socket_buffer *h; + int ret = 0; + int i; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFERS); + if (IS_ERR(h)) { + ret = PTR_ERR(h); + goto out; + } + + for (i = 0; i < h->skb_count; i++) { + struct sk_buff *skb = NULL; + + ret = sock_read_buffer(ctx, sock, &skb); + if (ret) + break; + + skb_queue_tail(queue, skb); + } + out: + ckpt_hdr_put(ctx, h); + + return ret; +} + +static struct unix_address *sock_unix_makeaddr(struct sockaddr_un *sun_addr, + unsigned len) +{ + struct unix_address *addr; + + if (len > UNIX_PATH_MAX) + return ERR_PTR(ENOSPC); + + addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL); + if (!addr) + return ERR_PTR(ENOMEM); + + memcpy(addr->name, sun_addr, len); + addr->len = len; + atomic_set(&addr->refcnt, 1); + + return addr; +} + +static int sock_unix_join(struct sock *a, + struct sock *b, + struct ckpt_hdr_socket *h) +{ + struct unix_address *addr; + + sock_hold(a); + sock_hold(b); + + unix_sk(a)->peer = b; + unix_sk(b)->peer = a; + + a->sk_peercred.pid = task_tgid_vnr(current); + current_euid_egid(&a->sk_peercred.uid, + &a->sk_peercred.gid); + + b->sk_peercred.pid = task_tgid_vnr(current); + current_euid_egid(&b->sk_peercred.uid, + &b->sk_peercred.gid); + + if (h->laddr_len == UNIX_LEN_EMPTY) + addr = sock_unix_makeaddr((struct sockaddr_un *)&h->raddr, + h->raddr_len); + else + addr = sock_unix_makeaddr((struct sockaddr_un *)&h->laddr, + h->laddr_len); + if (IS_ERR(addr)) + return PTR_ERR(addr); + + atomic_inc(&addr->refcnt); /* Held by both ends */ + unix_sk(a)->addr = unix_sk(b)->addr = addr; + + return 0; +} + +static int sock_unix_unlink(const char *name) +{ + struct path spath; + struct path ppath; + int ret; + + ret = kern_path(name, 0, &spath); + if (ret) + return ret; + + ret = kern_path(name, LOOKUP_PARENT, &ppath); + if (ret) + goto out_s; + + if (!spath.dentry) { + ckpt_debug("No dentry found for %s\n", name); + ret = -ENOENT; + goto out_p; + } + + if (!ppath.dentry || !ppath.dentry->d_inode) { + ckpt_debug("No inode for parent of %s\n", name); + ret = -ENOENT; + goto out_p; + } + + ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry); + out_p: + path_put(&ppath); + out_s: + path_put(&spath); + + return ret; +} + +static int sock_unix_restart(struct ckpt_ctx *ctx, + struct ckpt_hdr_socket *h, + struct socket *socket) +{ + struct sock *peer; + struct ckpt_hdr_socket_unix *un; + int ret = 0; + + un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX); + if (IS_ERR(un)) + return PTR_ERR(un); + + if (un->peer < 0) { + ret = -EINVAL; + goto out; + } + + peer = ckpt_obj_fetch(ctx, un->peer, CKPT_OBJ_SOCK); + + if ((h->sock.state == TCP_ESTABLISHED) || + (h->sock.state == TCP_CLOSE)) { + if (!IS_ERR(peer)) { + /* We're last, so join with peer */ + struct sock *this = socket->sk; + + ret = sock_unix_join(this, peer, h); + } else if (PTR_ERR(peer) == -EINVAL) { + /* We're first, so add our socket and wait for peer */ + ret = ckpt_obj_insert(ctx, socket->sk, un->this, + CKPT_OBJ_SOCK); + if (ret >= 0) + ret = 0; + } else { + ret = PTR_ERR(peer); + } + + } else if (h->sock.state == TCP_LISTEN) { + ret = socket->ops->bind(socket, + (struct sockaddr *)&h->laddr, + h->laddr_len); + if (ret < 0) + goto out; + + ret = socket->ops->listen(socket, h->sock.backlog); + if (ret < 0) + goto out; + + /* We can unlink this blindly because we just created it + * above and would have failed already without proper + * permissions + */ + if (!un->linked) { + struct sockaddr_un *sun = + (struct sockaddr_un *)&h->laddr; + ret = sock_unix_unlink(sun->sun_path); + } + } else + ckpt_write_err(ctx, "unsupported UNIX socket state %i", + h->sock.state); + out: + ckpt_hdr_put(ctx, un); + return ret; +} + +struct socket *do_sock_file_restore(struct ckpt_ctx *ctx, + struct ckpt_hdr_socket *h) +{ + struct socket *socket; + int ret; + + ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket); + if (ret < 0) + return ERR_PTR(ret); + + if (h->sock_common.family == AF_UNIX) { + ret = sock_unix_restart(ctx, h, socket); + ckpt_debug("sock_unix_restart: %i\n", ret); + } else { + ckpt_write_err(ctx, "unsupported family %i\n", + h->sock_common.family); + ret = -EINVAL; + } + + if (ret) + goto out; + + ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST); + if (ret) + goto out; + + if (h->sock.state != TCP_LISTEN) { + struct sock *sk = socket->sk; + + ret = sock_read_buffers(ctx, socket->sk, &sk->sk_receive_queue); + if (ret) + goto out; + + ret = sock_read_buffers(ctx, socket->sk, &sk->sk_write_queue); + if (ret) + goto out; + } + out: + if (ret) { + sock_release(socket); + socket = ERR_PTR(ret); + } + + return socket; +} + diff --git a/net/socket.c b/net/socket.c index 791d71a..be8c562 100644 --- a/net/socket.c +++ b/net/socket.c @@ -96,6 +96,9 @@ #include <net/sock.h> #include <linux/netfilter.h> +#include <linux/checkpoint.h> +#include <linux/checkpoint_hdr.h> + static int sock_no_open(struct inode *irrelevant, struct file *dontcare); static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos); @@ -140,6 +143,9 @@ static const struct file_operations socket_file_ops = { .sendpage = sock_sendpage, .splice_write = generic_splice_sendpage, .splice_read = sock_splice_read, +#ifdef CONFIG_CHECKPOINT + .checkpoint = sock_file_checkpoint, +#endif }; /* @@ -415,6 +421,84 @@ int sock_map_fd(struct socket *sock, int flags) return fd; } +int sock_file_checkpoint(struct ckpt_ctx *ctx, void *ptr) +{ + struct ckpt_hdr_file_socket *h; + int ret; + struct file *file = ptr; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h->common.f_type = CKPT_FILE_SOCKET; + + ret = checkpoint_file_common(ctx, file, &h->common); + if (ret < 0) + goto out; + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); + if (ret < 0) + goto out; + + ret = do_sock_file_checkpoint(ctx, file); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static struct file *sock_alloc_attach_fd(struct socket *socket) +{ + struct file *file; + int err; + + file = get_empty_filp(); + if (!file) + return ERR_PTR(ENOMEM); + + err = sock_attach_fd(socket, file, 0); + if (err < 0) { + put_filp(file); + file = ERR_PTR(err); + } + + return file; +} + +void *sock_file_restore(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_socket *h = NULL; + struct socket *socket = NULL; + struct file *file = NULL; + int err; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET); + if (IS_ERR(h)) + return h; + + socket = do_sock_file_restore(ctx, h); + if (IS_ERR(socket)) { + err = PTR_ERR(socket); + goto err_put; + } + + file = sock_alloc_attach_fd(socket); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_release; + } + + ckpt_hdr_put(ctx, h); + + return file; + + err_release: + sock_release(socket); + err_put: + ckpt_hdr_put(ctx, h); + + return ERR_PTR(err); +} + static struct socket *sock_from_file(struct file *file, int *err) { if (file->f_op == &socket_file_ops) -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
[parent not found: <1246472414-23105-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* [PATCH 3/3] Add AF_INET c/r support (v2) [not found] ` <1246472414-23105-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-07-01 18:20 ` Dan Smith [not found] ` <1246472414-23105-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-07-01 20:06 ` [PATCH 2/3] c/r: Add AF_UNIX support (v4) Brian Haley 1 sibling, 1 reply; 8+ messages in thread From: Dan Smith @ 2009-07-01 18:20 UTC (permalink / raw) To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: Dan Smith, Alexey Dobriyan This patch adds AF_INET c/r support based on the framework established in my AF_UNIX patch. I've tested it by checkpointing a single app with a pair of sockets connected over loopback. A couple points about the operation: 1. In order to properly hook up the established sockets with the matching listening parent socket, I added a new list to the ckpt_ctx and run the parent attachment in the deferqueue at the end of the restart process. 2. I don't do anything to redirect or freeze traffic flowing to or from the remote system (to prevent a RST from breaking things). I expect that userspace will bring down a veth device or freeze traffic to the remote system to handle this case. Changes in v2: - Check for data in the TCP out-of-order queue and fail if present - Fix a logic issue in sock_add_parent() - Add comment about holding a reference to sock for parent list - Write error messages to checkpoint stream where appropriate - Fix up checking of some return values in restart phase - Fix up restart logic to restore socket info for all states - Avoid running sk_proto->hash() for non-TCP sockets - Fix calling bind() for unconnected (i.e. DGRAM) sockets - Change 'in' to 'inet' in structure and function names Cc: Oren Laaden <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Cc: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- checkpoint/sys.c | 2 + include/linux/checkpoint_hdr.h | 1 + include/linux/checkpoint_types.h | 2 + include/linux/socket.h | 95 +++++++++++ net/checkpoint.c | 319 +++++++++++++++++++++++++++++++++++++- 5 files changed, 414 insertions(+), 5 deletions(-) diff --git a/checkpoint/sys.c b/checkpoint/sys.c index 38a5299..b6f18ea 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -242,6 +242,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags, INIT_LIST_HEAD(&ctx->pgarr_pool); init_waitqueue_head(&ctx->waitq); + INIT_LIST_HEAD(&ctx->listen_sockets); + err = -EBADF; ctx->file = fget(fd); if (!ctx->file) diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index f59b071..16e21ee 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -93,6 +93,7 @@ enum { CKPT_HDR_SOCKET_BUFFERS, CKPT_HDR_SOCKET_BUFFER, CKPT_HDR_SOCKET_UNIX, + CKPT_HDR_SOCKET_INET, CKPT_HDR_TAIL = 9001, diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index 27fbe26..d7db190 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -60,6 +60,8 @@ struct ckpt_ctx { struct list_head pgarr_list; /* page array to dump VMA contents */ struct list_head pgarr_pool; /* pool of empty page arrays chain */ + struct list_head listen_sockets;/* listening parent sockets */ + /* [multi-process checkpoint] */ struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */ int nr_tasks; /* size of tasks array */ diff --git a/include/linux/socket.h b/include/linux/socket.h index 6480c47..4fe5102 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -332,6 +332,101 @@ struct ckpt_hdr_socket_unix { __u8 linked; } __attribute__ ((aligned(8))); +struct ckpt_hdr_socket_inet { + struct ckpt_hdr h; + + __u32 daddr; + __u32 rcv_saddr; + __u32 saddr; + __u16 dport; + __u16 num; + __u16 sport; + __s16 uc_ttl; + __u16 cmsg_flags; + __u16 __pad; + + struct { + __u64 timeout; + __u32 ato; + __u32 lrcvtime; + __u16 last_seg_size; + __u16 rcv_mss; + __u8 pending; + __u8 quick; + __u8 pingpong; + __u8 blocked; + } icsk_ack __attribute__ ((aligned(8))); + + /* FIXME: Skipped opt, tos, multicast, cork settings */ + + struct { + __u64 last_synq_overflow; + + __u32 rcv_nxt; + __u32 copied_seq; + __u32 rcv_wup; + __u32 snd_nxt; + __u32 snd_una; + __u32 snd_sml; + __u32 rcv_tstamp; + __u32 lsndtime; + + __u32 snd_wl1; + __u32 snd_wnd; + __u32 max_window; + __u32 mss_cache; + __u32 window_clamp; + __u32 rcv_ssthresh; + __u32 frto_highmark; + + __u32 srtt; + __u32 mdev; + __u32 mdev_max; + __u32 rttvar; + __u32 rtt_seq; + + __u32 packets_out; + __u32 retrans_out; + + __u32 snd_up; + __u32 rcv_wnd; + __u32 write_seq; + __u32 pushed_seq; + __u32 lost_out; + __u32 sacked_out; + __u32 fackets_out; + __u32 tso_deferred; + __u32 bytes_acked; + + __s32 lost_cnt_hint; + __u32 retransmit_high; + + __u32 lost_retrans_low; + + __u32 prior_ssthresh; + __u32 high_seq; + + __u32 retrans_stamp; + __u32 undo_marker; + __s32 undo_retrans; + __u32 total_retrans; + + __u32 urg_seq; + __u32 keepalive_time; + __u32 keepalive_intvl; + + __u16 urg_data; + __u16 advmss; + __u8 frto_counter; + __u8 nonagle; + + __u8 ecn_flags; + __u8 reordering; + + __u8 keepalive_probes; + } tcp __attribute__ ((aligned(8))); +} __attribute__ ((aligned(8))); + struct ckpt_hdr_socket { struct ckpt_hdr h; diff --git a/net/checkpoint.c b/net/checkpoint.c index 701d26c..b3fb66d 100644 --- a/net/checkpoint.c +++ b/net/checkpoint.c @@ -14,11 +14,64 @@ #include <linux/file.h> #include <net/af_unix.h> +#include <net/tcp.h> #include <net/tcp_states.h> +#include <linux/tcp.h> +#include <linux/in.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> #include <linux/namei.h> +#include <linux/deferqueue.h> + +struct ckpt_parent_sock { + struct sock *sock; + __u32 oref; + struct list_head list; +}; + +static int sock_add_parent(struct ckpt_ctx *ctx, struct sock *sock) +{ + struct ckpt_parent_sock *parent; + __u32 objref; + int new; + + objref = ckpt_obj_lookup_add(ctx, sock, CKPT_OBJ_SOCK, &new); + if (objref < 0) + return objref; + else if (!new) + return 0; + + parent = kmalloc(sizeof(*parent), GFP_KERNEL); + if (!parent) + return -ENOMEM; + + /* The deferqueue is processed before the objhash is free()'d, thus + * the objhash holds a reference to sock for us + */ + parent->sock = sock; + parent->oref = objref; + INIT_LIST_HEAD(&parent->list); + + list_add(&parent->list, &ctx->listen_sockets); + + return 0; +} + +static struct sock *sock_get_parent(struct ckpt_ctx *ctx, struct sock *sock) +{ + struct ckpt_parent_sock *parent; + struct inet_sock *c = inet_sk(sock); + + list_for_each_entry(parent, &ctx->listen_sockets, list) { + struct inet_sock *p = inet_sk(parent->sock); + + if (c->sport == p->sport) + return parent->sock; + } + + return NULL; +} /* Size of an empty struct sockaddr_un */ #define UNIX_LEN_EMPTY 2 @@ -47,17 +100,23 @@ static int sock_copy_buffers(struct sk_buff_head *from, struct sk_buff_head *to) } static int __sock_write_buffers(struct ckpt_ctx *ctx, + uint16_t family, struct sk_buff_head *queue) { struct sk_buff *skb; int ret = 0; skb_queue_walk(queue, skb) { - if (UNIXCB(skb).fp) { + if ((family == AF_UNIX) && UNIXCB(skb).fp) { ckpt_write_err(ctx, "fd-passing is not supported"); return -EBUSY; } + if (skb_shinfo(skb)->nr_frags) { + ckpt_write_err(ctx, "socket has fragments in-flight"); + return -EBUSY; + } + ret = ckpt_write_obj_type(ctx, skb->data, skb->len, CKPT_HDR_SOCKET_BUFFER); if (ret) @@ -67,7 +126,9 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx, return 0; } -static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue) +static int sock_write_buffers(struct ckpt_ctx *ctx, + uint16_t family, + struct sk_buff_head *queue) { struct ckpt_hdr_socket_buffer *h; struct sk_buff_head tmpq; @@ -87,7 +148,7 @@ static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue) ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); if (!ret) - ret = __sock_write_buffers(ctx, &tmpq); + ret = __sock_write_buffers(ctx, family, &tmpq); out: ckpt_hdr_put(ctx, h); @@ -228,6 +289,152 @@ static int sock_cptrst(struct ckpt_ctx *ctx, return 0; } +static int sock_inet_tcp_cptrst(struct ckpt_ctx *ctx, + struct tcp_sock *sk, + struct ckpt_hdr_socket_inet *hh, + int op) +{ + CKPT_COPY(op, hh->tcp.rcv_nxt, sk->rcv_nxt); + CKPT_COPY(op, hh->tcp.copied_seq, sk->copied_seq); + CKPT_COPY(op, hh->tcp.rcv_wup, sk->rcv_wup); + CKPT_COPY(op, hh->tcp.snd_nxt, sk->snd_nxt); + CKPT_COPY(op, hh->tcp.snd_una, sk->snd_una); + CKPT_COPY(op, hh->tcp.snd_sml, sk->snd_sml); + CKPT_COPY(op, hh->tcp.rcv_tstamp, sk->rcv_tstamp); + CKPT_COPY(op, hh->tcp.lsndtime, sk->lsndtime); + + CKPT_COPY(op, hh->tcp.snd_wl1, sk->snd_wl1); + CKPT_COPY(op, hh->tcp.snd_wnd, sk->snd_wnd); + CKPT_COPY(op, hh->tcp.max_window, sk->max_window); + CKPT_COPY(op, hh->tcp.mss_cache, sk->mss_cache); + CKPT_COPY(op, hh->tcp.window_clamp, sk->window_clamp); + CKPT_COPY(op, hh->tcp.rcv_ssthresh, sk->rcv_ssthresh); + CKPT_COPY(op, hh->tcp.frto_highmark, sk->frto_highmark); + CKPT_COPY(op, hh->tcp.advmss, sk->advmss); + CKPT_COPY(op, hh->tcp.frto_counter, sk->frto_counter); + CKPT_COPY(op, hh->tcp.nonagle, sk->nonagle); + + CKPT_COPY(op, hh->tcp.srtt, sk->srtt); + CKPT_COPY(op, hh->tcp.mdev, sk->mdev); + CKPT_COPY(op, hh->tcp.mdev_max, sk->mdev_max); + CKPT_COPY(op, hh->tcp.rttvar, sk->rttvar); + CKPT_COPY(op, hh->tcp.rtt_seq, sk->rtt_seq); + + CKPT_COPY(op, hh->tcp.packets_out, sk->packets_out); + CKPT_COPY(op, hh->tcp.retrans_out, sk->retrans_out); + + CKPT_COPY(op, hh->tcp.urg_data, sk->urg_data); + CKPT_COPY(op, hh->tcp.ecn_flags, sk->ecn_flags); + CKPT_COPY(op, hh->tcp.reordering, sk->reordering); + CKPT_COPY(op, hh->tcp.snd_up, sk->snd_up); + + CKPT_COPY(op, hh->tcp.keepalive_probes, sk->keepalive_probes); + + CKPT_COPY(op, hh->tcp.rcv_wnd, sk->rcv_wnd); + CKPT_COPY(op, hh->tcp.write_seq, sk->write_seq); + CKPT_COPY(op, hh->tcp.pushed_seq, sk->pushed_seq); + CKPT_COPY(op, hh->tcp.lost_out, sk->lost_out); + CKPT_COPY(op, hh->tcp.sacked_out, sk->sacked_out); + CKPT_COPY(op, hh->tcp.fackets_out, sk->fackets_out); + CKPT_COPY(op, hh->tcp.tso_deferred, sk->tso_deferred); + CKPT_COPY(op, hh->tcp.bytes_acked, sk->bytes_acked); + + CKPT_COPY(op, hh->tcp.lost_cnt_hint, sk->lost_cnt_hint); + CKPT_COPY(op, hh->tcp.retransmit_high, sk->retransmit_high); + + CKPT_COPY(op, hh->tcp.lost_retrans_low, sk->lost_retrans_low); + + CKPT_COPY(op, hh->tcp.prior_ssthresh, sk->prior_ssthresh); + CKPT_COPY(op, hh->tcp.high_seq, sk->high_seq); + + CKPT_COPY(op, hh->tcp.retrans_stamp, sk->retrans_stamp); + CKPT_COPY(op, hh->tcp.undo_marker, sk->undo_marker); + CKPT_COPY(op, hh->tcp.undo_retrans, sk->undo_retrans); + CKPT_COPY(op, hh->tcp.total_retrans, sk->total_retrans); + + CKPT_COPY(op, hh->tcp.urg_seq, sk->urg_seq); + CKPT_COPY(op, hh->tcp.keepalive_time, sk->keepalive_time); + CKPT_COPY(op, hh->tcp.keepalive_intvl, sk->keepalive_intvl); + + CKPT_COPY(op, hh->tcp.last_synq_overflow, sk->last_synq_overflow); + + return 0; +} + +static int sock_inet_cptrst(struct ckpt_ctx *ctx, + struct sock *sock, + struct ckpt_hdr_socket_inet *hh, + int op) +{ + struct inet_sock *sk = inet_sk(sock); + struct inet_connection_sock *icsk = inet_csk(sock); + int ret; + + CKPT_COPY(op, hh->daddr, sk->daddr); + CKPT_COPY(op, hh->rcv_saddr, sk->rcv_saddr); + CKPT_COPY(op, hh->dport, sk->dport); + CKPT_COPY(op, hh->num, sk->num); + CKPT_COPY(op, hh->saddr, sk->saddr); + CKPT_COPY(op, hh->sport, sk->sport); + CKPT_COPY(op, hh->uc_ttl, sk->uc_ttl); + CKPT_COPY(op, hh->cmsg_flags, sk->cmsg_flags); + + CKPT_COPY(op, hh->icsk_ack.pending, icsk->icsk_ack.pending); + CKPT_COPY(op, hh->icsk_ack.quick, icsk->icsk_ack.quick); + CKPT_COPY(op, hh->icsk_ack.pingpong, icsk->icsk_ack.pingpong); + CKPT_COPY(op, hh->icsk_ack.blocked, icsk->icsk_ack.blocked); + CKPT_COPY(op, hh->icsk_ack.ato, icsk->icsk_ack.ato); + CKPT_COPY(op, hh->icsk_ack.timeout, icsk->icsk_ack.timeout); + CKPT_COPY(op, hh->icsk_ack.lrcvtime, icsk->icsk_ack.lrcvtime); + CKPT_COPY(op, + hh->icsk_ack.last_seg_size, icsk->icsk_ack.last_seg_size); + CKPT_COPY(op, hh->icsk_ack.rcv_mss, icsk->icsk_ack.rcv_mss); + + if (sock->sk_protocol == IPPROTO_TCP) + ret = sock_inet_tcp_cptrst(ctx, tcp_sk(sock), hh, op); + else if (sock->sk_protocol == IPPROTO_UDP) + ret = 0; + else { + ckpt_write_err(ctx, "unknown socket protocol %d", + sock->sk_protocol); + ret = -EINVAL; + } + + return ret; +} + +static int sock_inet_checkpoint(struct ckpt_ctx *ctx, + struct sock *sock, + struct ckpt_hdr_socket *h) +{ + int ret = -EINVAL; + struct ckpt_hdr_socket_inet *in; + + if (sock->sk_protocol == IPPROTO_TCP) { + struct tcp_sock *tsock = tcp_sk(sock); + if (!skb_queue_empty(&tsock->out_of_order_queue)) { + ckpt_write_err(ctx, "TCP socket has out-of-order data"); + return -EBUSY; + } + } + + in = ckpt_hdr_get_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET); + if (!in) + goto out; + + ret = sock_inet_cptrst(ctx, sock, in, CKPT_CPT); + if (ret < 0) + goto out; + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); + if (ret < 0) + goto out; + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) in); + out: + return ret; +} + int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) { struct socket *socket = file->private_data; @@ -266,6 +473,10 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) ret = sock_unix_checkpoint(ctx, sock, h); if (ret) goto out; + } else if (sock->sk_family == AF_INET) { + ret = sock_inet_checkpoint(ctx, sock, h); + if (ret) + goto out; } else { ckpt_write_err(ctx, "unsupported socket family %i", sock->sk_family); @@ -274,11 +485,13 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) } if (sock->sk_state != TCP_LISTEN) { - ret = sock_write_buffers(ctx, &sock->sk_receive_queue); + uint16_t family = sock->sk_family; + + ret = sock_write_buffers(ctx, family, &sock->sk_receive_queue); if (ret) goto out; - ret = sock_write_buffers(ctx, &sock->sk_write_queue); + ret = sock_write_buffers(ctx, family, &sock->sk_write_queue); if (ret) goto out; } @@ -497,6 +710,99 @@ static int sock_unix_restart(struct ckpt_ctx *ctx, return ret; } +struct dq_sock { + struct sock *sock; + struct ckpt_ctx *ctx; +}; + +static int __sock_hash_parent(void *data) +{ + struct dq_sock *dq = (struct dq_sock *)data; + struct sock *parent; + + dq->sock->sk_prot->hash(dq->sock); + + parent = sock_get_parent(dq->ctx, dq->sock); + if (parent) { + inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport); + local_bh_disable(); + __inet_inherit_port(parent, dq->sock); + local_bh_enable(); + } else { + inet_sk(dq->sock)->num = 0; + inet_hash_connect(&tcp_death_row, dq->sock); + inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport); + } + + return 0; +} + +static int sock_defer_hash(struct ckpt_ctx *ctx, struct sock *sock) +{ + struct dq_sock dq; + + dq.sock = sock; + dq.ctx = ctx; + + return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq), + __sock_hash_parent, __sock_hash_parent); +} + +static int sock_inet_restart(struct ckpt_ctx *ctx, + struct ckpt_hdr_socket *h, + struct socket *socket) +{ + int ret; + struct ckpt_hdr_socket_inet *in; + struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr; + + in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET); + if (IS_ERR(in)) + return PTR_ERR(in); + + /* Listening sockets and those that are closed but have a local + * address need to call bind() + */ + if ((h->sock.state == TCP_LISTEN) || + ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) { + socket->sk->sk_reuse = 2; + inet_sk(socket->sk)->freebind = 1; + ret = socket->ops->bind(socket, + (struct sockaddr *)l, + h->laddr_len); + if (ret < 0) + goto out; + + if (h->sock.state == TCP_LISTEN) { + ret = socket->ops->listen(socket, h->sock.backlog); + if (ret < 0) + goto out; + + ret = sock_add_parent(ctx, socket->sk); + if (ret < 0) + goto out; + } + } else { + ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST); + if (ret) + goto out; + + ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST); + if (ret) + goto out; + + if ((h->sock.state == TCP_ESTABLISHED) && + (h->sock.protocol == IPPROTO_TCP)) + /* Delay hashing this sock until the end so we can + * hook it up with its parent (if appropriate) + */ + ret = sock_defer_hash(ctx, socket->sk); + } + + out: + return ret; + } + struct socket *do_sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_socket *h) { @@ -510,6 +816,9 @@ struct socket *do_sock_file_restore(struct ckpt_ctx *ctx, if (h->sock_common.family == AF_UNIX) { ret = sock_unix_restart(ctx, h, socket); ckpt_debug("sock_unix_restart: %i\n", ret); + } else if (h->sock_common.family == AF_INET) { + ret = sock_inet_restart(ctx, h, socket); + ckpt_debug("sock_inet_restart: %i\n", ret); } else { ckpt_write_err(ctx, "unsupported family %i\n", h->sock_common.family); -- 1.6.2.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
[parent not found: <1246472414-23105-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/3] Add AF_INET c/r support (v2) [not found] ` <1246472414-23105-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-07-01 20:37 ` Brian Haley [not found] ` <4A4BC918.5030003-VXdhtT5mjnY@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Brian Haley @ 2009-07-01 20:37 UTC (permalink / raw) To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Alexey Dobriyan Dan Smith wrote: > This patch adds AF_INET c/r support based on the framework established in > my AF_UNIX patch. I've tested it by checkpointing a single app with a > pair of sockets connected over loopback. You've probably already mentioned it elsewhere, but having IPv6 support before sending to netdev would probably be a good thing. -Brian ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <4A4BC918.5030003-VXdhtT5mjnY@public.gmane.org>]
* Re: [PATCH 3/3] Add AF_INET c/r support (v2) [not found] ` <4A4BC918.5030003-VXdhtT5mjnY@public.gmane.org> @ 2009-07-01 20:46 ` Dan Smith 0 siblings, 0 replies; 8+ messages in thread From: Dan Smith @ 2009-07-01 20:46 UTC (permalink / raw) To: Brian Haley; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Alexey Dobriyan BH> You've probably already mentioned it elsewhere, but having IPv6 BH> support before sending to netdev would probably be a good thing. Bummer :) I'll see what I can cook up... Thanks :) -- Dan Smith IBM Linux Technology Center email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] c/r: Add AF_UNIX support (v4) [not found] ` <1246472414-23105-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-07-01 18:20 ` [PATCH 3/3] Add AF_INET c/r support (v2) Dan Smith @ 2009-07-01 20:06 ` Brian Haley [not found] ` <4A4BC1CC.5020900-VXdhtT5mjnY@public.gmane.org> 1 sibling, 1 reply; 8+ messages in thread From: Brian Haley @ 2009-07-01 20:06 UTC (permalink / raw) To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Alexey Dobriyan Hi Dan, > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1482,4 +1482,13 @@ extern int sysctl_optmem_max; > extern __u32 sysctl_wmem_default; > extern __u32 sysctl_rmem_default; > > +/* Checkpoint/Restart Functions */ > +struct ckpt_ctx; > +struct ckpt_hdr_socket; > +extern int sock_file_checkpoint(struct ckpt_ctx *, void *); > +extern void *sock_file_restore(struct ckpt_ctx *); > +extern struct socket *do_sock_file_restore(struct ckpt_ctx *, > + struct ckpt_hdr_socket *); > +extern int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file); Should this all be under #ifdef CONFIG_CHECKPOINT? > --- a/net/socket.c > +++ b/net/socket.c > @@ -96,6 +96,9 @@ > #include <net/sock.h> > #include <linux/netfilter.h> > > +#include <linux/checkpoint.h> > +#include <linux/checkpoint_hdr.h> > + > static int sock_no_open(struct inode *irrelevant, struct file *dontcare); > static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov, > unsigned long nr_segs, loff_t pos); > @@ -140,6 +143,9 @@ static const struct file_operations socket_file_ops = { > .sendpage = sock_sendpage, > .splice_write = generic_splice_sendpage, > .splice_read = sock_splice_read, > +#ifdef CONFIG_CHECKPOINT > + .checkpoint = sock_file_checkpoint, > +#endif > }; > > /* > @@ -415,6 +421,84 @@ int sock_map_fd(struct socket *sock, int flags) > return fd; > } > > +int sock_file_checkpoint(struct ckpt_ctx *ctx, void *ptr) > +{ > + struct ckpt_hdr_file_socket *h; > + int ret; > + struct file *file = ptr; <snip> And the corresponding code in socket.c too, since you're only assigning .checkpoint above in that case. -Brian ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <4A4BC1CC.5020900-VXdhtT5mjnY@public.gmane.org>]
* Re: [PATCH 2/3] c/r: Add AF_UNIX support (v4) [not found] ` <4A4BC1CC.5020900-VXdhtT5mjnY@public.gmane.org> @ 2009-07-01 20:45 ` Dan Smith 0 siblings, 0 replies; 8+ messages in thread From: Dan Smith @ 2009-07-01 20:45 UTC (permalink / raw) To: Brian Haley; +Cc: containers-qjLDD68F18O7TbgM5vRIOg, Alexey Dobriyan BH> Should this all be under #ifdef CONFIG_CHECKPOINT? Yeah, probably so :) BH> And the corresponding code in socket.c too, since you're only BH> assigning .checkpoint above in that case. Yep, okay, thanks! -- Dan Smith IBM Linux Technology Center email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] Add _ckpt_read_hdr_type() helper [not found] ` <1246472414-23105-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-07-01 18:20 ` [PATCH 2/3] c/r: Add AF_UNIX support (v4) Dan Smith @ 2009-07-01 18:28 ` Dan Smith 1 sibling, 0 replies; 8+ messages in thread From: Dan Smith @ 2009-07-01 18:28 UTC (permalink / raw) To: containers-qjLDD68F18O7TbgM5vRIOg Ack, I meant to patchbomb these with some intro text. This set of patches includes the helper that Oren suggested to make reading in the socket buffers more efficient as well as updates to the UNIX and INET patches as suggested by Oren and Matt. There are a few things not addressed which I asked for clarification on earlier. I think these are getting close to the point where we can solicit some "comments" from the netdev folks. I'll let these fester here for the rest of the week and then plan to post them to netdev for comments on Monday. I will plan to procure some flame-retardant clothing over the weekend. Thanks! -- Dan Smith IBM Linux Technology Center email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-07-01 20:46 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-01 18:20 [PATCH 1/3] Add _ckpt_read_hdr_type() helper Dan Smith
[not found] ` <1246472414-23105-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-01 18:20 ` [PATCH 2/3] c/r: Add AF_UNIX support (v4) Dan Smith
[not found] ` <1246472414-23105-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-01 18:20 ` [PATCH 3/3] Add AF_INET c/r support (v2) Dan Smith
[not found] ` <1246472414-23105-3-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-01 20:37 ` Brian Haley
[not found] ` <4A4BC918.5030003-VXdhtT5mjnY@public.gmane.org>
2009-07-01 20:46 ` Dan Smith
2009-07-01 20:06 ` [PATCH 2/3] c/r: Add AF_UNIX support (v4) Brian Haley
[not found] ` <4A4BC1CC.5020900-VXdhtT5mjnY@public.gmane.org>
2009-07-01 20:45 ` Dan Smith
2009-07-01 18:28 ` [PATCH 1/3] Add _ckpt_read_hdr_type() helper Dan Smith
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.