* c/r: Add support for connected AF_INET sockets
@ 2009-10-20 21:06 Dan Smith
2009-10-20 21:06 ` [PATCH 2/4] [RFC] Add c/r support for connected INET sockets Dan Smith
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 2 replies; 13+ messages in thread
From: Dan Smith @ 2009-10-20 21:06 UTC (permalink / raw)
To: containers-qjLDD68F18O7TbgM5vRIOg
This updated patch set fixes some issues idenitifed in the second
patch, as well as adds some additional features and documentation.
It also brings a third patch that handles TCP timestamp adjustment
and a fourth that adds some information about sockets to the
checkpoint/readme.txt in Documentation/.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/4] Record and restore skb header marks
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-20 21:06 ` Dan Smith
[not found] ` <1256072803-3518-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 21:06 ` [PATCH 3/4] Adjust TCP timestamp values by a scalar value Dan Smith
2009-10-20 21:06 ` [PATCH 4/4] Add some content to the readme.txt for socket c/r Dan Smith
2 siblings, 1 reply; 13+ messages in thread
From: Dan Smith @ 2009-10-20 21:06 UTC (permalink / raw)
To: containers-qjLDD68F18O7TbgM5vRIOg
Save this information when we checkpoint an skb and provide a mechanism
to restore that information on restart. This will be used in the
subsequent INET patch.
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
include/linux/checkpoint.h | 2 ++
include/linux/checkpoint_hdr.h | 7 +++++++
net/checkpoint.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4b61378..1da0b04 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -100,6 +100,8 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
struct socket *socket,
struct sockaddr *loc, unsigned *loc_len,
struct sockaddr *rem, unsigned *rem_len);
+void sock_restore_header_info(struct sk_buff *skb,
+ struct ckpt_hdr_socket_buffer *h);
/* ckpt kflags */
#define ckpt_set_ctx_kflag(__ctx, __kflag) \
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca2500d..3e6cab1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -542,8 +542,15 @@ struct ckpt_hdr_socket_queue {
struct ckpt_hdr_socket_buffer {
struct ckpt_hdr h;
+ __u64 mac_len;
+ __u64 hdr_len;
+ __u64 transport_header;
+ __u64 network_header;
+ __u64 mac_header;
__s32 sk_objref;
__s32 pr_objref;
+ __u16 protocol;
+ __u8 cb[48];
};
#define CKPT_UNIX_LINKED 1
diff --git a/net/checkpoint.c b/net/checkpoint.c
index dd23efd..5ed2724 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -88,6 +88,38 @@ static int sock_copy_buffers(struct sk_buff_head *from,
return -EAGAIN;
}
+static void sock_record_header_info(struct sk_buff *skb,
+ struct ckpt_hdr_socket_buffer *h)
+{
+
+ h->mac_len = skb->mac_len;
+ h->hdr_len = skb->hdr_len;
+
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+ h->transport_header = skb->transport_hdr;
+ h->network_header = skb->network_header;
+ h->mac_header = skb->mac_header;
+#else
+ h->transport_header = skb->transport_header - skb->head;
+ h->network_header = skb->network_header - skb->head;
+ h->mac_header = skb->mac_header - skb->head;
+#endif
+
+ memcpy(h->cb, skb->cb, sizeof(skb->cb));
+}
+
+void sock_restore_header_info(struct sk_buff *skb,
+ struct ckpt_hdr_socket_buffer *h)
+{
+ skb->mac_len = h->mac_len;
+ skb->hdr_len = h->hdr_len;
+ skb_set_transport_header(skb, h->transport_header);
+ skb_set_network_header(skb, h->network_header);
+ skb_set_mac_header(skb, h->mac_header);
+
+ memcpy(skb->cb, h->cb, sizeof(skb->cb));
+}
+
static int __sock_write_buffers(struct ckpt_ctx *ctx,
struct sk_buff_head *queue,
int dst_objref)
@@ -123,6 +155,7 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
goto end;
h->sk_objref = ret;
h->pr_objref = dst_objref;
+ sock_record_header_info(skb, h);
ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
if (ret < 0)
--
1.6.2.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
2009-10-20 21:06 c/r: Add support for connected AF_INET sockets Dan Smith
@ 2009-10-20 21:06 ` Dan Smith
2009-10-21 17:56 ` Serge E. Hallyn
2009-10-23 19:37 ` Oren Laadan
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
1 sibling, 2 replies; 13+ messages in thread
From: Dan Smith @ 2009-10-20 21:06 UTC (permalink / raw)
To: containers; +Cc: netdev, Oren Laadan, John Dykstra
This patch adds basic support for C/R of open INET sockets. I think that
all the important bits of the TCP and ICSK socket structures is saved,
but I think there is still some additional IPv6 stuff that needs to be
handled.
With this patch applied, the following script can be used to demonstrate
the functionality:
https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
It shows that this enables migration of a sendmail process with open
connections from one machine to another without dropping.
We still need comments from the netdev people about what sort of sanity
checking we need to do on the values in the ckpt_hdr_socket_inet
structure on restart.
Note that this still doesn't address lingering sockets yet.
Changes in v2:
- Restore saddr, rcv_saddr, daddr, sport, and dport from the sockaddr
structure instead of saving them separately
- Fix 'sock' naming in sock_cptrst()
- Don't take the queue lock before skb_queue_tail() since it is
done for us
- Allow "listen only" restore behavior if RESTART_SOCK_LISTENONLY
flag is specified on sys_restart()
- Pull the implementation of the list of listening sockets back into
this patch
- Fix dangling printk
- Add some comments around the parent/child restore logic
Cc: netdev@vger.kernel.org
Cc: Oren Laadan <orenl@librato.com>
Cc: John Dykstra <jdykstra72@gmail.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
checkpoint/sys.c | 4 +
include/linux/checkpoint.h | 5 +-
include/linux/checkpoint_hdr.h | 97 ++++++++++++++
include/linux/checkpoint_types.h | 2 +
net/checkpoint.c | 23 ++--
net/ipv4/checkpoint.c | 263 +++++++++++++++++++++++++++++++++++++-
6 files changed, 379 insertions(+), 15 deletions(-)
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 260a1ee..df00973 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -221,6 +221,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
kfree(ctx->pids_arr);
+ sock_listening_list_free(&ctx->listen_sockets);
+
kfree(ctx);
}
@@ -249,6 +251,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
spin_lock_init(&ctx->lock);
#endif
+ INIT_LIST_HEAD(&ctx->listen_sockets);
+
err = -EBADF;
ctx->file = fget(fd);
if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1da0b04..73d1677 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,7 @@
#define RESTART_TASKSELF 0x1
#define RESTART_FROZEN 0x2
#define RESTART_GHOST 0x4
+#define RESTART_SOCK_LISTENONLY 0x8
#ifdef __KERNEL__
#ifdef CONFIG_CHECKPOINT
@@ -48,7 +49,8 @@
#define RESTART_USER_FLAGS \
(RESTART_TASKSELF | \
RESTART_FROZEN | \
- RESTART_GHOST)
+ RESTART_GHOST | \
+ RESTART_SOCK_LISTENONLY)
extern int walk_task_subtree(struct task_struct *task,
int (*func)(struct task_struct *, void *),
@@ -102,6 +104,7 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
struct sockaddr *rem, unsigned *rem_len);
void sock_restore_header_info(struct sk_buff *skb,
struct ckpt_hdr_socket_buffer *h);
+void sock_listening_list_free(struct list_head *head);
/* ckpt kflags */
#define ckpt_set_ctx_kflag(__ctx, __kflag) \
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3e6cab1..0c10657 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -20,6 +20,7 @@
#include <linux/socket.h>
#include <linux/un.h>
#include <linux/in.h>
+#include <linux/in6.h>
#else
#include <sys/socket.h>
#include <sys/un.h>
@@ -569,6 +570,102 @@ struct ckpt_hdr_socket_unix {
struct ckpt_hdr_socket_inet {
struct ckpt_hdr h;
+ __u32 daddr;
+ __u32 rcv_saddr;
+ __u32 saddr;
+ __u16 dport;
+ __u16 num;
+ __u16 sport;
+ __s16 uc_ttl;
+ __u16 cmsg_flags;
+
+ struct {
+ __u64 timeout;
+ __u32 ato;
+ __u32 lrcvtime;
+ __u16 last_seg_size;
+ __u16 rcv_mss;
+ __u8 pending;
+ __u8 quick;
+ __u8 pingpong;
+ __u8 blocked;
+ } icsk_ack __attribute__ ((aligned(8)));
+
+ /* FIXME: Skipped opt, tos, multicast, cork settings */
+
+ struct {
+ __u64 last_synq_overflow;
+
+ __u32 rcv_nxt;
+ __u32 copied_seq;
+ __u32 rcv_wup;
+ __u32 snd_nxt;
+ __u32 snd_una;
+ __u32 snd_sml;
+ __u32 rcv_tstamp;
+ __u32 lsndtime;
+
+ __u32 snd_wl1;
+ __u32 snd_wnd;
+ __u32 max_window;
+ __u32 mss_cache;
+ __u32 window_clamp;
+ __u32 rcv_ssthresh;
+ __u32 frto_highmark;
+
+ __u32 srtt;
+ __u32 mdev;
+ __u32 mdev_max;
+ __u32 rttvar;
+ __u32 rtt_seq;
+
+ __u32 packets_out;
+ __u32 retrans_out;
+
+ __u32 snd_up;
+ __u32 rcv_wnd;
+ __u32 write_seq;
+ __u32 pushed_seq;
+ __u32 lost_out;
+ __u32 sacked_out;
+ __u32 fackets_out;
+ __u32 tso_deferred;
+ __u32 bytes_acked;
+
+ __s32 lost_cnt_hint;
+ __u32 retransmit_high;
+
+ __u32 lost_retrans_low;
+
+ __u32 prior_ssthresh;
+ __u32 high_seq;
+
+ __u32 retrans_stamp;
+ __u32 undo_marker;
+ __s32 undo_retrans;
+ __u32 total_retrans;
+
+ __u32 urg_seq;
+ __u32 keepalive_time;
+ __u32 keepalive_intvl;
+
+ __u16 urg_data;
+ __u16 advmss;
+ __u8 frto_counter;
+ __u8 nonagle;
+
+ __u8 ecn_flags;
+ __u8 reordering;
+
+ __u8 keepalive_probes;
+ } tcp __attribute__ ((aligned(8)));
+
+ struct {
+ struct in6_addr saddr;
+ struct in6_addr rcv_saddr;
+ struct in6_addr daddr;
+ } inet6 __attribute__ ((aligned(8)));
+
__u32 laddr_len;
__u32 raddr_len;
struct sockaddr_in laddr;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index fa57cdc..91c141b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -65,6 +65,8 @@ struct ckpt_ctx {
struct list_head pgarr_list; /* page array to dump VMA contents */
struct list_head pgarr_pool; /* pool of empty page arrays chain */
+ struct list_head listen_sockets;/* listening parent sockets */
+
/* [multi-process checkpoint] */
struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
int nr_tasks; /* size of tasks array */
diff --git a/net/checkpoint.c b/net/checkpoint.c
index 5ed2724..3e7574d 100644
--- a/net/checkpoint.c
+++ b/net/checkpoint.c
@@ -122,6 +122,7 @@ void sock_restore_header_info(struct sk_buff *skb,
static int __sock_write_buffers(struct ckpt_ctx *ctx,
struct sk_buff_head *queue,
+ uint16_t family,
int dst_objref)
{
struct sk_buff *skb;
@@ -130,11 +131,7 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
struct ckpt_hdr_socket_buffer *h;
int ret = 0;
- /* FIXME: This could be a false positive for non-unix
- * buffers, so add a type check here in the
- * future
- */
- if (UNIXCB(skb).fp) {
+ if ((family == AF_UNIX) && UNIXCB(skb).fp) {
ckpt_write_err(ctx, "TE", "af_unix: pass fd", -EBUSY);
return -EBUSY;
}
@@ -174,6 +171,7 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
static int sock_write_buffers(struct ckpt_ctx *ctx,
struct sk_buff_head *queue,
+ uint16_t family,
int dst_objref)
{
struct ckpt_hdr_socket_queue *h;
@@ -193,7 +191,7 @@ static int sock_write_buffers(struct ckpt_ctx *ctx,
h->skb_count = ret;
ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
if (!ret)
- ret = __sock_write_buffers(ctx, &tmpq, dst_objref);
+ ret = __sock_write_buffers(ctx, &tmpq, family, dst_objref);
out:
ckpt_hdr_put(ctx, h);
@@ -215,12 +213,14 @@ int sock_deferred_write_buffers(void *data)
return dst_objref;
}
- ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue, dst_objref);
+ ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue,
+ dq->sk->sk_family, dst_objref);
ckpt_debug("write recv buffers: %i\n", ret);
if (ret < 0)
return ret;
- ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue, dst_objref);
+ ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue,
+ dq->sk->sk_family, dst_objref);
ckpt_debug("write send buffers: %i\n", ret);
return ret;
@@ -745,10 +745,9 @@ struct sock *do_sock_restore(struct ckpt_ctx *ctx)
goto err;
if ((h->sock_common.family == AF_INET) &&
- (h->sock.state != TCP_LISTEN)) {
- /* Temporary hack to enable restore of TCP_LISTEN sockets
- * while forcing anything else to a closed state
- */
+ (h->sock.state != TCP_LISTEN) &&
+ (ctx->uflags & RESTART_SOCK_LISTENONLY)) {
+ ckpt_debug("Forcing open socket closed\n");
sock->sk->sk_state = TCP_CLOSE;
sock->state = SS_UNCONNECTED;
}
diff --git a/net/ipv4/checkpoint.c b/net/ipv4/checkpoint.c
index 9cbbf5e..5913652 100644
--- a/net/ipv4/checkpoint.c
+++ b/net/ipv4/checkpoint.c
@@ -17,6 +17,7 @@
#include <linux/deferqueue.h>
#include <net/tcp_states.h>
#include <net/tcp.h>
+#include <net/ipv6.h>
struct dq_sock {
struct ckpt_ctx *ctx;
@@ -28,6 +29,233 @@ struct dq_buffers {
struct sock *sk;
};
+struct listen_item {
+ struct sock *sk;
+ struct list_head list;
+};
+
+void sock_listening_list_free(struct list_head *head)
+{
+ struct listen_item *item, *tmp;
+
+ list_for_each_entry_safe(item, tmp, head, list) {
+ list_del(&item->list);
+ kfree(item);
+ }
+}
+
+static int sock_listening_list_add(struct ckpt_ctx *ctx, struct sock *sk)
+{
+ struct listen_item *item;
+
+ item = kmalloc(sizeof(*item), GFP_KERNEL);
+ if (!item)
+ return -ENOMEM;
+
+ item->sk = sk;
+ list_add(&item->list, &ctx->listen_sockets);
+
+ return 0;
+}
+
+static struct sock *sock_get_parent(struct ckpt_ctx *ctx, struct sock *sk)
+{
+ struct listen_item *item;
+
+ list_for_each_entry(item, &ctx->listen_sockets, list) {
+ if (inet_sk(sk)->sport == inet_sk(item->sk)->sport)
+ return item->sk;
+ }
+
+ return NULL;
+}
+
+static int sock_hash_parent(void *data)
+{
+ struct dq_sock *dq = (struct dq_sock *)data;
+ struct sock *parent;
+
+ ckpt_debug("INET post-restart hash\n");
+
+ dq->sk->sk_prot->hash(dq->sk);
+
+ /* If there is a listening socket with the same source port,
+ * then become a child of that socket [we are the result of an
+ * accept()]. Otherwise hash ourselves directly in [we are
+ * the result of a connect()]
+ */
+
+ parent = sock_get_parent(dq->ctx, dq->sk);
+ if (parent) {
+ inet_sk(dq->sk)->num = ntohs(inet_sk(dq->sk)->sport);
+ local_bh_disable();
+ __inet_inherit_port(parent, dq->sk);
+ local_bh_enable();
+ } else {
+ inet_sk(dq->sk)->num = 0;
+ inet_hash_connect(&tcp_death_row, dq->sk);
+ inet_sk(dq->sk)->num = ntohs(inet_sk(dq->sk)->sport);
+ }
+
+ return 0;
+}
+
+static int sock_defer_hash(struct ckpt_ctx *ctx, struct sock *sock)
+{
+ struct dq_sock dq;
+
+ dq.sk = sock;
+ dq.ctx = ctx;
+
+ return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+ sock_hash_parent, NULL);
+}
+
+static int sock_inet_tcp_cptrst(struct ckpt_ctx *ctx,
+ struct tcp_sock *sk,
+ struct ckpt_hdr_socket_inet *hh,
+ int op)
+{
+ CKPT_COPY(op, hh->tcp.rcv_nxt, sk->rcv_nxt);
+ CKPT_COPY(op, hh->tcp.copied_seq, sk->copied_seq);
+ CKPT_COPY(op, hh->tcp.rcv_wup, sk->rcv_wup);
+ CKPT_COPY(op, hh->tcp.snd_nxt, sk->snd_nxt);
+ CKPT_COPY(op, hh->tcp.snd_una, sk->snd_una);
+ CKPT_COPY(op, hh->tcp.snd_sml, sk->snd_sml);
+ CKPT_COPY(op, hh->tcp.rcv_tstamp, sk->rcv_tstamp);
+ CKPT_COPY(op, hh->tcp.lsndtime, sk->lsndtime);
+
+ CKPT_COPY(op, hh->tcp.snd_wl1, sk->snd_wl1);
+ CKPT_COPY(op, hh->tcp.snd_wnd, sk->snd_wnd);
+ CKPT_COPY(op, hh->tcp.max_window, sk->max_window);
+ CKPT_COPY(op, hh->tcp.mss_cache, sk->mss_cache);
+ CKPT_COPY(op, hh->tcp.window_clamp, sk->window_clamp);
+ CKPT_COPY(op, hh->tcp.rcv_ssthresh, sk->rcv_ssthresh);
+ CKPT_COPY(op, hh->tcp.frto_highmark, sk->frto_highmark);
+ CKPT_COPY(op, hh->tcp.advmss, sk->advmss);
+ CKPT_COPY(op, hh->tcp.frto_counter, sk->frto_counter);
+ CKPT_COPY(op, hh->tcp.nonagle, sk->nonagle);
+
+ CKPT_COPY(op, hh->tcp.srtt, sk->srtt);
+ CKPT_COPY(op, hh->tcp.mdev, sk->mdev);
+ CKPT_COPY(op, hh->tcp.mdev_max, sk->mdev_max);
+ CKPT_COPY(op, hh->tcp.rttvar, sk->rttvar);
+ CKPT_COPY(op, hh->tcp.rtt_seq, sk->rtt_seq);
+
+ CKPT_COPY(op, hh->tcp.packets_out, sk->packets_out);
+ CKPT_COPY(op, hh->tcp.retrans_out, sk->retrans_out);
+
+ CKPT_COPY(op, hh->tcp.urg_data, sk->urg_data);
+ CKPT_COPY(op, hh->tcp.ecn_flags, sk->ecn_flags);
+ CKPT_COPY(op, hh->tcp.reordering, sk->reordering);
+ CKPT_COPY(op, hh->tcp.snd_up, sk->snd_up);
+
+ CKPT_COPY(op, hh->tcp.keepalive_probes, sk->keepalive_probes);
+
+ CKPT_COPY(op, hh->tcp.rcv_wnd, sk->rcv_wnd);
+ CKPT_COPY(op, hh->tcp.write_seq, sk->write_seq);
+ CKPT_COPY(op, hh->tcp.pushed_seq, sk->pushed_seq);
+ CKPT_COPY(op, hh->tcp.lost_out, sk->lost_out);
+ CKPT_COPY(op, hh->tcp.sacked_out, sk->sacked_out);
+ CKPT_COPY(op, hh->tcp.fackets_out, sk->fackets_out);
+ CKPT_COPY(op, hh->tcp.tso_deferred, sk->tso_deferred);
+ CKPT_COPY(op, hh->tcp.bytes_acked, sk->bytes_acked);
+
+ CKPT_COPY(op, hh->tcp.lost_cnt_hint, sk->lost_cnt_hint);
+ CKPT_COPY(op, hh->tcp.retransmit_high, sk->retransmit_high);
+
+ CKPT_COPY(op, hh->tcp.lost_retrans_low, sk->lost_retrans_low);
+
+ CKPT_COPY(op, hh->tcp.prior_ssthresh, sk->prior_ssthresh);
+ CKPT_COPY(op, hh->tcp.high_seq, sk->high_seq);
+
+ CKPT_COPY(op, hh->tcp.retrans_stamp, sk->retrans_stamp);
+ CKPT_COPY(op, hh->tcp.undo_marker, sk->undo_marker);
+ CKPT_COPY(op, hh->tcp.undo_retrans, sk->undo_retrans);
+ CKPT_COPY(op, hh->tcp.total_retrans, sk->total_retrans);
+
+ CKPT_COPY(op, hh->tcp.urg_seq, sk->urg_seq);
+ CKPT_COPY(op, hh->tcp.keepalive_time, sk->keepalive_time);
+ CKPT_COPY(op, hh->tcp.keepalive_intvl, sk->keepalive_intvl);
+
+ return 0;
+}
+
+static int sock_inet_restore_addrs(struct inet_sock *inet,
+ struct ckpt_hdr_socket_inet *hh)
+{
+ inet->daddr = hh->raddr.sin_addr.s_addr;
+ inet->saddr = hh->laddr.sin_addr.s_addr;
+ inet->rcv_saddr = inet->saddr;
+
+ inet->dport = hh->raddr.sin_port;
+ inet->sport = hh->laddr.sin_port;
+
+ return 0;
+}
+
+static int sock_inet_cptrst(struct ckpt_ctx *ctx,
+ struct sock *sk,
+ struct ckpt_hdr_socket_inet *hh,
+ int op)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ int ret;
+
+ if (op == CKPT_CPT) {
+ CKPT_COPY(op, hh->daddr, inet->daddr);
+ CKPT_COPY(op, hh->rcv_saddr, inet->rcv_saddr);
+ CKPT_COPY(op, hh->dport, inet->dport);
+ CKPT_COPY(op, hh->saddr, inet->saddr);
+ CKPT_COPY(op, hh->sport, inet->sport);
+ } else {
+ ret = sock_inet_restore_addrs(inet, hh);
+ if (ret)
+ return ret;
+ }
+
+ CKPT_COPY(op, hh->num, inet->num);
+ CKPT_COPY(op, hh->uc_ttl, inet->uc_ttl);
+ CKPT_COPY(op, hh->cmsg_flags, inet->cmsg_flags);
+
+ CKPT_COPY(op, hh->icsk_ack.pending, icsk->icsk_ack.pending);
+ CKPT_COPY(op, hh->icsk_ack.quick, icsk->icsk_ack.quick);
+ CKPT_COPY(op, hh->icsk_ack.pingpong, icsk->icsk_ack.pingpong);
+ CKPT_COPY(op, hh->icsk_ack.blocked, icsk->icsk_ack.blocked);
+ CKPT_COPY(op, hh->icsk_ack.ato, icsk->icsk_ack.ato);
+ CKPT_COPY(op, hh->icsk_ack.timeout, icsk->icsk_ack.timeout);
+ CKPT_COPY(op, hh->icsk_ack.lrcvtime, icsk->icsk_ack.lrcvtime);
+ CKPT_COPY(op,
+ hh->icsk_ack.last_seg_size, icsk->icsk_ack.last_seg_size);
+ CKPT_COPY(op, hh->icsk_ack.rcv_mss, icsk->icsk_ack.rcv_mss);
+
+ if (sk->sk_protocol == IPPROTO_TCP)
+ ret = sock_inet_tcp_cptrst(ctx, tcp_sk(sk), hh, op);
+ else if (sk->sk_protocol == IPPROTO_UDP)
+ ret = 0;
+ else {
+ ckpt_write_err(ctx, "T", "unknown socket protocol %d",
+ sk->sk_protocol);
+ ret = -EINVAL;
+ }
+
+ if (sk->sk_family == AF_INET6) {
+ struct ipv6_pinfo *inet6 = inet6_sk(sk);
+ if (op == CKPT_CPT) {
+ ipv6_addr_copy(&hh->inet6.saddr, &inet6->saddr);
+ ipv6_addr_copy(&hh->inet6.rcv_saddr, &inet6->rcv_saddr);
+ ipv6_addr_copy(&hh->inet6.daddr, &inet6->daddr);
+ } else {
+ ipv6_addr_copy(&inet6->saddr, &hh->inet6.saddr);
+ ipv6_addr_copy(&inet6->rcv_saddr, &hh->inet6.rcv_saddr);
+ ipv6_addr_copy(&inet6->daddr, &hh->inet6.daddr);
+ }
+ }
+
+ return ret;
+}
+
int inet_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
{
struct ckpt_hdr_socket_inet *in;
@@ -43,6 +271,10 @@ int inet_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
if (ret)
goto out;
+ ret = sock_inet_cptrst(ctx, sock->sk, in, CKPT_CPT);
+ if (ret < 0)
+ goto out;
+
ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) in);
out:
ckpt_hdr_put(ctx, in);
@@ -87,9 +319,9 @@ static int inet_read_buffer(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
if (ret < 0)
goto out;
- spin_lock(&queue->lock);
+ sock_restore_header_info(skb, h);
+
skb_queue_tail(queue, skb);
- spin_unlock(&queue->lock);
out:
ckpt_hdr_put(ctx, h);
@@ -209,8 +441,35 @@ int inet_restore(struct ckpt_ctx *ctx,
ckpt_debug("inet listen: %i\n", ret);
if (ret < 0)
goto out;
+
+ /* We are a listening socket, so add ourselves
+ * to the list of parent sockets. This will
+ * allow our children to find us later and
+ * link up
+ */
+
+ ret = sock_listening_list_add(ctx, sock->sk);
+ if (ret < 0)
+ goto out;
}
} else {
+ ret = sock_inet_cptrst(ctx, sock->sk, in, CKPT_RST);
+ if (ret)
+ goto out;
+
+ if ((h->sock.state == TCP_ESTABLISHED) &&
+ (h->sock.protocol == IPPROTO_TCP)) {
+ /* A connected socket that was spawned from an
+ * accept() needs to be hashed with its parent
+ * listening socket in order to receive
+ * traffic on the original port. Since we may
+ * not have restarted the parent yet, we defer
+ * this until later when we know we have all
+ * the listening sockets accounted for.
+ */
+ ret = sock_defer_hash(ctx, sock->sk);
+ }
+
if (!sock_flag(sock->sk, SOCK_DEAD))
ret = inet_defer_restore_buffers(ctx, sock->sk);
}
--
1.6.2.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/4] Adjust TCP timestamp values by a scalar value
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 21:06 ` [PATCH 1/4] Record and restore skb header marks Dan Smith
@ 2009-10-20 21:06 ` Dan Smith
[not found] ` <1256072803-3518-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 21:06 ` [PATCH 4/4] Add some content to the readme.txt for socket c/r Dan Smith
2 siblings, 1 reply; 13+ messages in thread
From: Dan Smith @ 2009-10-20 21:06 UTC (permalink / raw)
To: containers-qjLDD68F18O7TbgM5vRIOg
Adjust the sent and received TCP timestamp value by a scalar value
in the tcp_sock structure. This will be zero most of the time, except
when the socket has been migrated with c/r. If a socket is re-migrated,
we take the new adjusted value as the saved value so that on restart it
can be re-adjusted. Also, copy this into the timewait sock so that
timestamps can continue to be adjusted in timewait state in the
minisocks code.
Note that TCP timestamps are just a jiffies stamp, which means they
have no relation to wall-clock time and thus a simple correction
factor should be enough to ensure correctness.
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
include/linux/checkpoint_hdr.h | 2 ++
include/linux/tcp.h | 3 +++
include/net/tcp.h | 3 ++-
net/ipv4/checkpoint.c | 8 ++++++++
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 14 +++++++-------
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 8 ++++++--
net/ipv4/tcp_output.c | 20 ++++++++++----------
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 2 +-
11 files changed, 42 insertions(+), 24 deletions(-)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0c10657..9c2f13d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -649,6 +649,8 @@ struct ckpt_hdr_socket_inet {
__u32 keepalive_time;
__u32 keepalive_intvl;
+ __s32 tcp_ts;
+
__u16 urg_data;
__u16 advmss;
__u8 frto_counter;
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8afac76..b845e21 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -399,6 +399,8 @@ struct tcp_sock {
u32 probe_seq_end;
} mtu_probe;
+ s32 ts_adjust; /* tcp_time_stamp adjustment factor */
+
#ifdef CONFIG_TCP_MD5SIG
/* TCP AF-Specific parts; only used by MD5 Signature support so far */
struct tcp_sock_af_ops *af_specific;
@@ -420,6 +422,7 @@ struct tcp_timewait_sock {
u32 tw_rcv_wnd;
u32 tw_ts_recent;
long tw_ts_recent_stamp;
+ s32 tw_ts_adjust;
#ifdef CONFIG_TCP_MD5SIG
u16 tw_md5_keylen;
u8 tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 88af843..96b4b27 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,7 +409,8 @@ extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
extern void tcp_parse_options(struct sk_buff *skb,
struct tcp_options_received *opt_rx,
- int estab);
+ int estab,
+ s32 ts_adjust);
extern u8 *tcp_parse_md5sig_option(struct tcphdr *th);
diff --git a/net/ipv4/checkpoint.c b/net/ipv4/checkpoint.c
index 5913652..f858dbc 100644
--- a/net/ipv4/checkpoint.c
+++ b/net/ipv4/checkpoint.c
@@ -178,6 +178,14 @@ static int sock_inet_tcp_cptrst(struct ckpt_ctx *ctx,
CKPT_COPY(op, hh->tcp.keepalive_time, sk->keepalive_time);
CKPT_COPY(op, hh->tcp.keepalive_intvl, sk->keepalive_intvl);
+ if (op == CKPT_CPT)
+ hh->tcp.tcp_ts = tcp_time_stamp + sk->ts_adjust;
+ else
+ sk->ts_adjust = hh->tcp.tcp_ts - tcp_time_stamp;
+
+ ckpt_debug("TCP tcp_ts %i ts_adjust %i\n",
+ hh->tcp.tcp_ts, sk->ts_adjust);
+
return 0;
}
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index cd2b97f..31eafef 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -277,7 +277,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, 0);
+ tcp_parse_options(skb, &tcp_opt, 0, tp->ts_adjust);
if (tcp_opt.saw_tstamp)
cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bdb0da..63cac78 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3699,7 +3699,7 @@ old_ack:
* the fast version below fails.
*/
void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
- int estab)
+ int estab, s32 ts_adjust)
{
unsigned char *ptr;
struct tcphdr *th = tcp_hdr(skb);
@@ -3756,8 +3756,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
((estab && opt_rx->tstamp_ok) ||
(!estab && sysctl_tcp_timestamps))) {
opt_rx->saw_tstamp = 1;
- opt_rx->rcv_tsval = get_unaligned_be32(ptr);
- opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
+ opt_rx->rcv_tsval = get_unaligned_be32(ptr) + ts_adjust;
+ opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4) + ts_adjust;
}
break;
case TCPOPT_SACK_PERM:
@@ -3799,9 +3799,9 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
| (TCPOPT_TIMESTAMP << 8) | TCPOLEN_TIMESTAMP)) {
tp->rx_opt.saw_tstamp = 1;
++ptr;
- tp->rx_opt.rcv_tsval = ntohl(*ptr);
+ tp->rx_opt.rcv_tsval = ntohl(*ptr) + tp->ts_adjust;
++ptr;
- tp->rx_opt.rcv_tsecr = ntohl(*ptr);
+ tp->rx_opt.rcv_tsecr = ntohl(*ptr) + tp->ts_adjust;
return 1;
}
return 0;
@@ -3821,7 +3821,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
if (tcp_parse_aligned_timestamp(tp, th))
return 1;
}
- tcp_parse_options(skb, &tp->rx_opt, 1);
+ tcp_parse_options(skb, &tp->rx_opt, 1, tp->ts_adjust);
return 1;
}
@@ -5366,7 +5366,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
struct inet_connection_sock *icsk = inet_csk(sk);
int saved_clamp = tp->rx_opt.mss_clamp;
- tcp_parse_options(skb, &tp->rx_opt, 0);
+ tcp_parse_options(skb, &tp->rx_opt, 0, tp->ts_adjust);
if (th->ack) {
/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6d88219..e8efe7f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1222,7 +1222,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
tmp_opt.mss_clamp = 536;
tmp_opt.user_mss = tcp_sk(sk)->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, 0, 0);
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f8d67cc..4c72954 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -102,7 +102,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, 0, tcptw->tw_ts_adjust);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = tcptw->tw_ts_recent;
@@ -292,6 +292,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tcptw->tw_snd_nxt = tp->snd_nxt;
tcptw->tw_rcv_wnd = tcp_receive_window(tp);
tcptw->tw_ts_recent = tp->rx_opt.ts_recent;
+ tcptw->tw_ts_adjust = tp->ts_adjust;
tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -503,7 +504,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(struct tcphdr)>>2)) {
- tcp_parse_options(skb, &tmp_opt, 0);
+ /* C/R doesn't support request sockets yet, so we
+ * don't need to worry about passing a ts_adjust here
+ */
+ tcp_parse_options(skb, &tmp_opt, 0, 0);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = req->ts_recent;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bd62712..38c165e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1487,7 +1487,7 @@ static int tcp_mtu_probe(struct sock *sk)
/* We're ready to send. If this fails, the probe will
* be resegmented into mss-sized pieces by tcp_write_xmit(). */
- TCP_SKB_CB(nskb)->when = tcp_time_stamp;
+ TCP_SKB_CB(nskb)->when = tcp_time_stamp + tp->ts_adjust;
if (!tcp_transmit_skb(sk, nskb, 1, GFP_ATOMIC)) {
/* Decrement cwnd here because we are sending
* effectively two packets. */
@@ -1568,7 +1568,7 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
unlikely(tso_fragment(sk, skb, limit, mss_now)))
break;
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tp->ts_adjust;
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
break;
@@ -1922,7 +1922,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
/* Make a copy, if the first transmission SKB clone we made
* is still in somebody's hands, else make a clone.
*/
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tp->ts_adjust;
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
@@ -2138,7 +2138,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
/* Send it off. */
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tcp_sk(sk)->ts_adjust;
if (tcp_transmit_skb(sk, skb, 0, priority))
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
@@ -2176,7 +2176,7 @@ int tcp_send_synack(struct sock *sk)
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_ACK;
TCP_ECN_send_synack(tcp_sk(sk), skb);
}
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tcp_sk(sk)->ts_adjust;
return tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}
@@ -2229,7 +2229,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
TCP_SKB_CB(skb)->when = cookie_init_timestamp(req);
else
#endif
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tp->ts_adjust;
tcp_header_size = tcp_synack_options(sk, req, mss,
skb, &opts, &md5) +
sizeof(struct tcphdr);
@@ -2352,7 +2352,7 @@ int tcp_connect(struct sock *sk)
TCP_ECN_send_syn(sk, buff);
/* Send it off. */
- TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ TCP_SKB_CB(buff)->when = tcp_time_stamp + tp->ts_adjust;
tp->retrans_stamp = TCP_SKB_CB(buff)->when;
skb_header_release(buff);
__tcp_add_write_queue_tail(sk, buff);
@@ -2457,7 +2457,7 @@ void tcp_send_ack(struct sock *sk)
tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPCB_FLAG_ACK);
/* Send it off, this clears delayed acks for us. */
- TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ TCP_SKB_CB(buff)->when = tcp_time_stamp + tcp_sk(sk)->ts_adjust;
tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
}
@@ -2489,7 +2489,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
* send it.
*/
tcp_init_nondata_skb(skb, tp->snd_una - !urgent, TCPCB_FLAG_ACK);
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tp->ts_adjust;
return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC);
}
@@ -2524,7 +2524,7 @@ int tcp_write_wakeup(struct sock *sk)
tcp_set_skb_tso_segs(sk, skb, mss);
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ TCP_SKB_CB(skb)->when = tcp_time_stamp + tp->ts_adjust;
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
if (!err)
tcp_event_new_data_sent(sk, skb);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 8c25139..9337ec6 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -185,7 +185,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, 0);
+ tcp_parse_options(skb, &tcp_opt, 0, tp->ts_adjust);
if (tcp_opt.saw_tstamp)
cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d849dd5..3a83570 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1202,7 +1202,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
tmp_opt.user_mss = tp->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, 0, 0);
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
--
1.6.2.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/4] Add some content to the readme.txt for socket c/r
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 21:06 ` [PATCH 1/4] Record and restore skb header marks Dan Smith
2009-10-20 21:06 ` [PATCH 3/4] Adjust TCP timestamp values by a scalar value Dan Smith
@ 2009-10-20 21:06 ` Dan Smith
[not found] ` <1256072803-3518-5-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2 siblings, 1 reply; 13+ messages in thread
From: Dan Smith @ 2009-10-20 21:06 UTC (permalink / raw)
To: containers-qjLDD68F18O7TbgM5vRIOg
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
Documentation/checkpoint/readme.txt | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
index 571c469..e6c173d 100644
--- a/Documentation/checkpoint/readme.txt
+++ b/Documentation/checkpoint/readme.txt
@@ -334,6 +334,27 @@ we will be forced to more carefully review each of those features.
However, this can be controlled with a sysctl-variable.
+Sockets
+=======
+
+For AF_UNIX sockets, both endpoints must be within the checkpointed
+task set to maintain a connected state after restart. UNIX sockets
+that are in the process of passing a descriptor will cause the
+checkpoint to fail with -EBUSY indicating a transient state that
+cannot be checkpointed. Listening sockets with an unaccepted peer
+will also cause an -EBUSY result.
+
+AF_INET sockets with endpoints outside the checkpointed task set may
+remain open if care is taken to avoid TCP timeouts and resets.
+Careful use of a virtual IP address can help avoid emission of an RST
+to the non-checkpointed endpoint. If desired, the
+RESTART_SOCK_LISTENONLY flag may be passed to the restart syscall
+which will cause all connected AF_INET sockets to be closed during the
+restore process. Listening sockets will still be restored to their
+original state, which makes this mode a candidate for something like
+an HTTP server.
+
+
Kernel interfaces
=================
--
1.6.2.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] Record and restore skb header marks
[not found] ` <1256072803-3518-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 15:52 ` Serge E. Hallyn
[not found] ` <20091021155201.GA15402-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 15:52 UTC (permalink / raw)
To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Save this information when we checkpoint an skb and provide a mechanism
> to restore that information on restart. This will be used in the
> subsequent INET patch.
>
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/checkpoint.h | 2 ++
> include/linux/checkpoint_hdr.h | 7 +++++++
> net/checkpoint.c | 33 +++++++++++++++++++++++++++++++++
> 3 files changed, 42 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 4b61378..1da0b04 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -100,6 +100,8 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
> struct socket *socket,
> struct sockaddr *loc, unsigned *loc_len,
> struct sockaddr *rem, unsigned *rem_len);
> +void sock_restore_header_info(struct sk_buff *skb,
> + struct ckpt_hdr_socket_buffer *h);
>
> /* ckpt kflags */
> #define ckpt_set_ctx_kflag(__ctx, __kflag) \
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index ca2500d..3e6cab1 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -542,8 +542,15 @@ struct ckpt_hdr_socket_queue {
>
> struct ckpt_hdr_socket_buffer {
> struct ckpt_hdr h;
> + __u64 mac_len;
> + __u64 hdr_len;
> + __u64 transport_header;
> + __u64 network_header;
> + __u64 mac_header;
> __s32 sk_objref;
> __s32 pr_objref;
> + __u16 protocol;
> + __u8 cb[48];
> };
>
> #define CKPT_UNIX_LINKED 1
> diff --git a/net/checkpoint.c b/net/checkpoint.c
> index dd23efd..5ed2724 100644
> --- a/net/checkpoint.c
> +++ b/net/checkpoint.c
> @@ -88,6 +88,38 @@ static int sock_copy_buffers(struct sk_buff_head *from,
> return -EAGAIN;
> }
>
> +static void sock_record_header_info(struct sk_buff *skb,
> + struct ckpt_hdr_socket_buffer *h)
> +{
> +
> + h->mac_len = skb->mac_len;
> + h->hdr_len = skb->hdr_len;
> +
> +#ifdef NET_SKBUFF_DATA_USES_OFFSET
> + h->transport_header = skb->transport_hdr;
> + h->network_header = skb->network_header;
> + h->mac_header = skb->mac_header;
> +#else
> + h->transport_header = skb->transport_header - skb->head;
> + h->network_header = skb->network_header - skb->head;
> + h->mac_header = skb->mac_header - skb->head;
> +#endif
> +
> + memcpy(h->cb, skb->cb, sizeof(skb->cb));
> +}
> +
> +void sock_restore_header_info(struct sk_buff *skb,
> + struct ckpt_hdr_socket_buffer *h)
> +{
> + skb->mac_len = h->mac_len;
> + skb->hdr_len = h->hdr_len;
> + skb_set_transport_header(skb, h->transport_header);
> + skb_set_network_header(skb, h->network_header);
> + skb_set_mac_header(skb, h->mac_header);
Should you verify that each of these new headers is located
inside the skb?
> +
> + memcpy(skb->cb, h->cb, sizeof(skb->cb));
verify that (h->h.len - (h->cb - h->h)) > sizeof(skb->cb)) ?
> +}
> +
> static int __sock_write_buffers(struct ckpt_ctx *ctx,
> struct sk_buff_head *queue,
> int dst_objref)
> @@ -123,6 +155,7 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
> goto end;
> h->sk_objref = ret;
> h->pr_objref = dst_objref;
> + sock_record_header_info(skb, h);
>
> ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> if (ret < 0)
> --
> 1.6.2.5
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] Record and restore skb header marks
[not found] ` <20091021155201.GA15402-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 15:57 ` Dan Smith
0 siblings, 0 replies; 13+ messages in thread
From: Dan Smith @ 2009-10-21 15:57 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
SH> Should you verify that each of these new headers is located
SH> inside the skb?
<Ahem> Um. Yes.... :)
--
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
2009-10-20 21:06 ` [PATCH 2/4] [RFC] Add c/r support for connected INET sockets Dan Smith
@ 2009-10-21 17:56 ` Serge E. Hallyn
2009-10-21 18:05 ` Dan Smith
2009-10-23 19:37 ` Oren Laadan
1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 17:56 UTC (permalink / raw)
To: Dan Smith; +Cc: containers, John Dykstra, netdev
Quoting Dan Smith (danms@us.ibm.com):
> This patch adds basic support for C/R of open INET sockets. I think that
> all the important bits of the TCP and ICSK socket structures is saved,
> but I think there is still some additional IPv6 stuff that needs to be
> handled.
>
> With this patch applied, the following script can be used to demonstrate
> the functionality:
>
> https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
>
> It shows that this enables migration of a sendmail process with open
> connections from one machine to another without dropping.
>
> We still need comments from the netdev people about what sort of sanity
> checking we need to do on the values in the ckpt_hdr_socket_inet
> structure on restart.
>
> Note that this still doesn't address lingering sockets yet.
>
> Changes in v2:
> - Restore saddr, rcv_saddr, daddr, sport, and dport from the sockaddr
> structure instead of saving them separately
> - Fix 'sock' naming in sock_cptrst()
> - Don't take the queue lock before skb_queue_tail() since it is
> done for us
> - Allow "listen only" restore behavior if RESTART_SOCK_LISTENONLY
> flag is specified on sys_restart()
> - Pull the implementation of the list of listening sockets back into
> this patch
> - Fix dangling printk
> - Add some comments around the parent/child restore logic
>
> Cc: netdev@vger.kernel.org
> Cc: Oren Laadan <orenl@librato.com>
> Cc: John Dykstra <jdykstra72@gmail.com>
> Signed-off-by: Dan Smith <danms@us.ibm.com>
fwiw,
Acked-by: Serge Hallyn <serue@us.ibm.com>
except
> +static int sock_inet_restore_addrs(struct inet_sock *inet,
> + struct ckpt_hdr_socket_inet *hh)
> +{
> + inet->daddr = hh->raddr.sin_addr.s_addr;
> + inet->saddr = hh->laddr.sin_addr.s_addr;
> + inet->rcv_saddr = inet->saddr;
> +
> + inet->dport = hh->raddr.sin_port;
> + inet->sport = hh->laddr.sin_port;
Sorry, I think we've discussed this before but can't recall - does
setting sport here allow an unpriv user to bypass CAP_NET_BIND_SERVICE?
-serge
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
2009-10-21 17:56 ` Serge E. Hallyn
@ 2009-10-21 18:05 ` Dan Smith
0 siblings, 0 replies; 13+ messages in thread
From: Dan Smith @ 2009-10-21 18:05 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers, John Dykstra, netdev
SH> Sorry, I think we've discussed this before but can't recall - does
SH> setting sport here allow an unpriv user to bypass
SH> CAP_NET_BIND_SERVICE?
Yes, it does. I was kinda considering that part of the input sanity
checking that I officially punted on. However, as far as I know,
we'll just need to check that capability before we bind() in the
listen/closed case and hash in the connected case.
--
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] Adjust TCP timestamp values by a scalar value
[not found] ` <1256072803-3518-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 18:06 ` Serge E. Hallyn
[not found] ` <20091021180638.GA24465-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-10-21 18:06 UTC (permalink / raw)
To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Adjust the sent and received TCP timestamp value by a scalar value
> in the tcp_sock structure. This will be zero most of the time, except
> when the socket has been migrated with c/r. If a socket is re-migrated,
> we take the new adjusted value as the saved value so that on restart it
> can be re-adjusted. Also, copy this into the timewait sock so that
> timestamps can continue to be adjusted in timewait state in the
> minisocks code.
>
> Note that TCP timestamps are just a jiffies stamp, which means they
> have no relation to wall-clock time and thus a simple correction
> factor should be enough to ensure correctness.
>
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
...
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 8afac76..b845e21 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -399,6 +399,8 @@ struct tcp_sock {
> u32 probe_seq_end;
> } mtu_probe;
>
> + s32 ts_adjust; /* tcp_time_stamp adjustment factor */
> +
> #ifdef CONFIG_TCP_MD5SIG
> /* TCP AF-Specific parts; only used by MD5 Signature support so far */
> struct tcp_sock_af_ops *af_specific;
> @@ -420,6 +422,7 @@ struct tcp_timewait_sock {
> u32 tw_rcv_wnd;
> u32 tw_ts_recent;
> long tw_ts_recent_stamp;
> + s32 tw_ts_adjust;
> #ifdef CONFIG_TCP_MD5SIG
> u16 tw_md5_keylen;
> u8 tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
I think this definately needs to go by netdev to see if they object
to the extra fields, and the (negligable?) extra processing in
frequent paths like tcp_send_ack.
-serge
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] Adjust TCP timestamp values by a scalar value
[not found] ` <20091021180638.GA24465-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-21 18:10 ` Dan Smith
0 siblings, 0 replies; 13+ messages in thread
From: Dan Smith @ 2009-10-21 18:10 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
SH> I think this definately needs to go by netdev to see if they
SH> object to the extra fields, and the (negligable?) extra processing
SH> in frequent paths like tcp_send_ack.
Indeed, I thought I had that Cc header in there to send it their way.
Oops.
--
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
2009-10-20 21:06 ` [PATCH 2/4] [RFC] Add c/r support for connected INET sockets Dan Smith
2009-10-21 17:56 ` Serge E. Hallyn
@ 2009-10-23 19:37 ` Oren Laadan
1 sibling, 0 replies; 13+ messages in thread
From: Oren Laadan @ 2009-10-23 19:37 UTC (permalink / raw)
To: Dan Smith; +Cc: containers, netdev, John Dykstra
Dan Smith wrote:
> This patch adds basic support for C/R of open INET sockets. I think that
> all the important bits of the TCP and ICSK socket structures is saved,
> but I think there is still some additional IPv6 stuff that needs to be
> handled.
>
> With this patch applied, the following script can be used to demonstrate
> the functionality:
>
> https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
>
> It shows that this enables migration of a sendmail process with open
> connections from one machine to another without dropping.
>
> We still need comments from the netdev people about what sort of sanity
> checking we need to do on the values in the ckpt_hdr_socket_inet
> structure on restart.
>
> Note that this still doesn't address lingering sockets yet.
>
[...]
> diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
> index fa57cdc..91c141b 100644
> --- a/include/linux/checkpoint_types.h
> +++ b/include/linux/checkpoint_types.h
> @@ -65,6 +65,8 @@ struct ckpt_ctx {
> struct list_head pgarr_list; /* page array to dump VMA contents */
> struct list_head pgarr_pool; /* pool of empty page arrays chain */
>
> + struct list_head listen_sockets;/* listening parent sockets */
> +
Nit: maybe move under the comment "multi-process restart" ?
[...]
Otherwise (and pending comments from netdev people on sanity checks):
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] Add some content to the readme.txt for socket c/r
[not found] ` <1256072803-3518-5-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-23 19:41 ` Oren Laadan
0 siblings, 0 replies; 13+ messages in thread
From: Oren Laadan @ 2009-10-23 19:41 UTC (permalink / raw)
To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
Dan Smith wrote:
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Nicely put !
> ---
> Documentation/checkpoint/readme.txt | 21 +++++++++++++++++++++
> 1 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
> index 571c469..e6c173d 100644
> --- a/Documentation/checkpoint/readme.txt
> +++ b/Documentation/checkpoint/readme.txt
> @@ -334,6 +334,27 @@ we will be forced to more carefully review each of those features.
> However, this can be controlled with a sysctl-variable.
>
>
> +Sockets
> +=======
> +
> +For AF_UNIX sockets, both endpoints must be within the checkpointed
> +task set to maintain a connected state after restart. UNIX sockets
> +that are in the process of passing a descriptor will cause the
> +checkpoint to fail with -EBUSY indicating a transient state that
> +cannot be checkpointed. Listening sockets with an unaccepted peer
> +will also cause an -EBUSY result.
> +
> +AF_INET sockets with endpoints outside the checkpointed task set may
> +remain open if care is taken to avoid TCP timeouts and resets.
> +Careful use of a virtual IP address can help avoid emission of an RST
> +to the non-checkpointed endpoint. If desired, the
> +RESTART_SOCK_LISTENONLY flag may be passed to the restart syscall
> +which will cause all connected AF_INET sockets to be closed during the
> +restore process. Listening sockets will still be restored to their
> +original state, which makes this mode a candidate for something like
> +an HTTP server.
> +
> +
> Kernel interfaces
> =================
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-10-23 19:41 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-20 21:06 c/r: Add support for connected AF_INET sockets Dan Smith
2009-10-20 21:06 ` [PATCH 2/4] [RFC] Add c/r support for connected INET sockets Dan Smith
2009-10-21 17:56 ` Serge E. Hallyn
2009-10-21 18:05 ` Dan Smith
2009-10-23 19:37 ` Oren Laadan
[not found] ` <1256072803-3518-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-20 21:06 ` [PATCH 1/4] Record and restore skb header marks Dan Smith
[not found] ` <1256072803-3518-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 15:52 ` Serge E. Hallyn
[not found] ` <20091021155201.GA15402-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 15:57 ` Dan Smith
2009-10-20 21:06 ` [PATCH 3/4] Adjust TCP timestamp values by a scalar value Dan Smith
[not found] ` <1256072803-3518-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 18:06 ` Serge E. Hallyn
[not found] ` <20091021180638.GA24465-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-21 18:10 ` Dan Smith
2009-10-20 21:06 ` [PATCH 4/4] Add some content to the readme.txt for socket c/r Dan Smith
[not found] ` <1256072803-3518-5-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-23 19:41 ` Oren Laadan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.