Network namespace and device support

All of lore.kernel.org
 help / color / mirror / Atom feed

* Network namespace and device support
@ 2010-01-20 15:01 Dan Smith
       [not found] ` <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Smith @ 2010-01-20 15:01 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

This set adds support for checkpointing network namespaces and devices,
with in-kernel restart.

Basic support includes veth pair and loopback adapters and allows me to
checkpoint and restart a containerized copy of sendmail with a veth
bridged to a physical network.  Follow-on work is detailed in the header
of the last patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* [PATCH 1/3] Expose rtnl_link_ops_get()
       [not found] ` <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-01-20 15:01   ` Dan Smith
  2010-01-20 15:01   ` [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions Dan Smith
  2010-01-20 15:01   ` [PATCH 3/3] C/R: Basic support for network namespaces and devices Dan Smith
  2 siblings, 0 replies; 12+ messages in thread
From: Dan Smith @ 2010-01-20 15:01 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/net/rtnetlink.h |    1 +
 net/core/rtnetlink.c    |    2 +-
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index c3aa044..a165769 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -82,6 +82,7 @@ extern void	rtnl_kill_links(struct net *net, struct rtnl_link_ops *ops);
 extern int	rtnl_link_register(struct rtnl_link_ops *ops);
 extern void	rtnl_link_unregister(struct rtnl_link_ops *ops);
 
+extern const struct rtnl_link_ops *rtnl_link_ops_get(const char *kind);
 extern struct net_device *rtnl_create_link(struct net *net, char *ifname,
 		const struct rtnl_link_ops *ops, struct nlattr *tb[]);
 extern const struct nla_policy ifla_policy[IFLA_MAX+1];
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index eb42873..2b2150c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -325,7 +325,7 @@ void rtnl_link_unregister(struct rtnl_link_ops *ops)
 
 EXPORT_SYMBOL_GPL(rtnl_link_unregister);
 
-static const struct rtnl_link_ops *rtnl_link_ops_get(const char *kind)
+const struct rtnl_link_ops *rtnl_link_ops_get(const char *kind)
 {
 	const struct rtnl_link_ops *ops;
 
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions
       [not found] ` <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-01-20 15:01   ` [PATCH 1/3] Expose rtnl_link_ops_get() Dan Smith
@ 2010-01-20 15:01   ` Dan Smith
  2010-01-21  9:24     ` David Miller
  2010-01-20 15:01   ` [PATCH 3/3] C/R: Basic support for network namespaces and devices Dan Smith
  2 siblings, 1 reply; 12+ messages in thread
From: Dan Smith @ 2010-01-20 15:01 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA

This is needed for C/R.  For checkpoint, we need to be able to follow
the link from one veth device to its peer.  For restart, we need to be
able to link two veth devices we've restored to re-establish their
relationship.

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/net/veth.c   |   14 ++++++++++++++
 include/linux/veth.h |    5 +++++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ade5b34..eb47080 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -454,6 +454,20 @@ static void veth_dellink(struct net_device *dev)
 	unregister_netdevice(peer);
 }
 
+struct net_device *veth_get_peer(struct net_device *veth)
+{
+	struct veth_priv *priv = netdev_priv(veth);
+
+	return priv->peer;
+}
+
+void veth_set_peer(struct net_device *veth, struct net_device *peer)
+{
+	struct veth_priv *priv = netdev_priv(veth);
+
+	priv->peer = peer;
+}
+
 static const struct nla_policy veth_policy[VETH_INFO_MAX + 1];
 
 static struct rtnl_link_ops veth_link_ops = {
diff --git a/include/linux/veth.h b/include/linux/veth.h
index 3354c1e..eee059d 100644
--- a/include/linux/veth.h
+++ b/include/linux/veth.h
@@ -9,4 +9,9 @@ enum {
 #define VETH_INFO_MAX	(__VETH_INFO_MAX - 1)
 };
 
+#ifdef __KERNEL__
+struct net_device *veth_get_peer(struct net_device *veth);
+void veth_set_peer(struct net_device *veth, struct net_device *peer);
+#endif
+
 #endif
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions
  2010-01-20 15:01   ` [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions Dan Smith
@ 2010-01-21  9:24     ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2010-01-21  9:24 UTC (permalink / raw)
  To: danms; +Cc: containers, orenl, netdev

If you're going to CC: netdev on your first two patches
that expose infrastructure, you should CC: us on the
third patch too so we can see how in the world you're
using this stuff.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found] ` <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-01-20 15:01   ` [PATCH 1/3] Expose rtnl_link_ops_get() Dan Smith
  2010-01-20 15:01   ` [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions Dan Smith
@ 2010-01-20 15:01   ` Dan Smith
       [not found]     ` <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Dan Smith @ 2010-01-20 15:01 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

When checkpointing a task tree with network namespaces, we hook into
do_checkpoint_ns() along with the others.  Any devices in a given namespace
are checkpointed (including their peer, in the case of veth) sequentially.
Each network device stores a list of protocol addresses, as well as other
information, such as hardware address.

This patch supports veth pairs, as well as the loopback adapter.  The
loopback support is there to make sure that any additional addresses and
state (such as up/down) is copied to the loopback adapter that we are
given in the new network namespace.

On restart, we instantiate new network namespaces and veth pairs as
necessary.  Any device we encounter that isn't in a network namespace
that was checkpointed as part of a task is left in the namespace of the
restarting process.  This will be the case for a veth half that exists
in the init netns to provide network access to a container.

Still to do are:

  1. Routes
  2. Netfilter rules
  3. IPv6 addresses
  4. Other virtual device types (e.g. bridges)

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c          |   21 ++-
 checkpoint/objhash.c             |   48 ++++
 checkpoint/restart.c             |    8 +
 include/linux/checkpoint.h       |    5 +
 include/linux/checkpoint_hdr.h   |   50 ++++
 include/linux/checkpoint_types.h |    1 +
 kernel/nsproxy.c                 |   21 ++-
 net/Makefile                     |    1 +
 net/checkpoint_dev.c             |  518 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 665 insertions(+), 8 deletions(-)
 create mode 100644 net/checkpoint_dev.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c345773..eb3aba5 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -180,16 +180,23 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 static int checkpoint_container(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_container *h;
+	int new;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CONTAINER);
 	if (!h)
 		return -ENOMEM;
-	ret = ckpt_write_obj(ctx, &h->h);
-	ckpt_hdr_put(ctx, h);
 
+	ret = ckpt_obj_lookup_add(ctx, current->nsproxy->net_ns,
+				  CKPT_OBJ_NET_NS, &new);
 	if (ret < 0)
-		return ret;
+		goto out;
+
+	ctx->init_netns_ref = h->init_netns_ref = ret;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
 
 	memset(ctx->lsm_name, 0, CHECKPOINT_LSM_NAME_MAX + 1);
 	strlcpy(ctx->lsm_name, security_get_lsm_name(),
@@ -197,9 +204,13 @@ static int checkpoint_container(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, ctx->lsm_name,
 				CHECKPOINT_LSM_NAME_MAX + 1);
 	if (ret < 0)
-		return ret;
+		goto out;
 
-	return security_checkpoint_header(ctx);
+	ret = security_checkpoint_header(ctx);
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
 }
 
 /* write the checkpoint trailer */
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 782661d..1c8be0f 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -307,6 +307,36 @@ static void lsm_string_drop(void *ptr, int lastref)
 	kref_put(&s->kref, lsm_string_free);
 }
 
+static int netns_grab(void *ptr)
+{
+	struct net *net = ptr;
+
+	get_net(net);
+	return 0;
+}
+
+static void netns_drop(void *ptr, int lastref)
+{
+	struct net *net = ptr;
+
+	put_net(net);
+}
+
+static int netdev_grab(void *ptr)
+{
+	struct net_device *dev = ptr;
+
+	dev_hold(dev);
+	return 0;
+}
+
+static void netdev_drop(void *ptr, int lastref)
+{
+	struct net_device *dev = ptr;
+
+	dev_put(dev);
+}
+
 /* security context strings */
 static int checkpoint_lsm_string(struct ckpt_ctx *ctx, void *ptr);
 static struct ckpt_lsm_string *restore_lsm_string(struct ckpt_ctx *ctx);
@@ -491,6 +521,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_lsm_string,
 		.restore = restore_lsm_string_wrap,
 	},
+	/* Network Namespace Object */
+	{
+		.obj_name = "NET_NS",
+		.obj_type = CKPT_OBJ_NET_NS,
+		.ref_grab = netns_grab,
+		.ref_drop = netns_drop,
+		.checkpoint = checkpoint_netns,
+		.restore = restore_netns,
+	},
+	/* Network Device Object */
+	{
+		.obj_name = "NET_DEV",
+		.obj_type = CKPT_OBJ_NETDEV,
+		.ref_grab = netdev_grab,
+		.ref_drop = netdev_drop,
+		.checkpoint = checkpoint_netdev,
+		.restore = restore_netdev,
+	},
 };
 
 
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 88d791b..d5e88d8 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -689,6 +689,14 @@ static int restore_container(struct ckpt_ctx *ctx)
 		return PTR_ERR(h);
 	ckpt_hdr_put(ctx, h);
 
+	/* Store the ref of the init netns so we know to leave its
+	 * devices where they fall */
+	ctx->init_netns_ref = h->init_netns_ref;
+	ret = ckpt_obj_insert(ctx, current->nsproxy->net_ns,
+			      ctx->init_netns_ref, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		return ret;
+
 	/* read the LSM name and info which follow ("are a part of")
 	 * the ckpt_hdr_container */
 	ret = restore_lsm(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1f85162..b9d337c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -113,6 +113,11 @@ extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 extern struct sk_buff *sock_restore_skb(struct ckpt_ctx *ctx, struct sock *sk);
 extern void sock_listening_list_free(struct list_head *head);
 
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr);
+void *restore_netns(struct ckpt_ctx *ctx);
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr);
+void *restore_netdev(struct ckpt_ctx *ctx);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4e57d37..bc0b4ac 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -176,6 +176,12 @@ enum {
 #define CKPT_HDR_SOCKET_UNIX CKPT_HDR_SOCKET_UNIX
 	CKPT_HDR_SOCKET_INET,
 #define CKPT_HDR_SOCKET_INET CKPT_HDR_SOCKET_INET
+	CKPT_HDR_NET_NS,
+#define CKPT_HDR_NET_NS CKPT_HDR_NET_NS
+	CKPT_HDR_NETDEV,
+#define CKPT_HDR_NETDEV CKPT_HDR_NETDEV
+	CKPT_HDR_NETDEV_ADDR,
+#define CKPT_HDR_NETDEV_ADDR CKPT_HDR_NETDEV_ADDR
 
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
@@ -242,6 +248,10 @@ enum obj_type {
 #define CKPT_OBJ_SECURITY_PTR CKPT_OBJ_SECURITY_PTR
 	CKPT_OBJ_SECURITY,
 #define CKPT_OBJ_SECURITY CKPT_OBJ_SECURITY
+	CKPT_OBJ_NET_NS,
+#define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
+	CKPT_OBJ_NETDEV,
+#define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -302,6 +312,7 @@ struct ckpt_hdr_tail {
 /* container configuration section header */
 struct ckpt_hdr_container {
 	struct ckpt_hdr h;
+	__s32 init_netns_ref;
 	/*
 	 * the header is followed by the string:
 	 *   char lsm_name[SECURITY_NAME_MAX + 1]
@@ -423,6 +434,7 @@ struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
 	__u32 ipc_objref;
+	__s32 net_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -740,6 +752,44 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_netns {
+	struct ckpt_hdr h;
+	__s32 this_ref;
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_types {
+	CKPT_NETDEV_LO,
+	CKPT_NETDEV_VETH,
+};
+
+struct ckpt_hdr_netdev {
+	struct ckpt_hdr h;
+ 	__s32 netns_ref;
+	__s32 this_ref;     /* veth only */
+	__s32 peer_ref;     /* veth only */
+	__u32 inet4_addrs;
+	__u16 type;
+	__u16 flags;
+	__u8 hwaddr[6];
+} __attribute__((aligned(8)));
+
+enum ckpt_netdev_addr_types {
+	CKPT_NETDEV_ADDR_IPV4,
+};
+
+struct ckpt_hdr_netdev_addr {
+	struct ckpt_hdr h;
+	__u16 type;
+	union {
+		struct {
+			__u32 inet4_local;
+			__u32 inet4_address;
+			__u32 inet4_mask;
+			__u32 inet4_broadcast;
+		};
+	};
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_eventpoll_items {
 	struct ckpt_hdr h;
 	__s32  epfile_objref;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f95c3ff..9d2a4ca 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -85,6 +85,7 @@ struct ckpt_ctx {
 	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 	struct list_head listen_sockets;/* listening parent sockets */
+	int init_netns_ref;             /* Objref of root net namespace */
 
 	struct ckpt_stats stats;	/* statistics */
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e7aaa00..88c48a0 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,9 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+	ret = ckpt_obj_collect(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 	if (ret < 0)
 		goto out;
@@ -281,6 +284,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret < 0)
 		goto out;
 	h->ipc_objref = ret;
+	ret = checkpoint_obj(ctx, nsproxy->net_ns, CKPT_OBJ_NET_NS);
+	if (ret < 0)
+		goto out;
+	h->net_objref = ret;
 
 	/* TODO: Write other namespaces here */
 
@@ -302,6 +309,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	struct nsproxy *nsproxy = NULL;
 	struct uts_namespace *uts_ns;
 	struct ipc_namespace *ipc_ns;
+	struct net *net_ns;
 	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -310,7 +318,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 
 	ret = -EINVAL;
 	if (h->uts_objref <= 0 ||
-	    h->ipc_objref <= 0)
+	    h->ipc_objref <= 0 ||
+	    h->net_objref <= 0)
 		goto out;
 
 	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
@@ -323,6 +332,11 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(ipc_ns);
 		goto out;
 	}
+	net_ns = ckpt_obj_fetch(ctx, h->net_objref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net_ns)) {
+		ret = PTR_ERR(net_ns);
+		goto out;
+	}
 
 #if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
 	ret = -ENOMEM;
@@ -334,19 +348,20 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	nsproxy->uts_ns = uts_ns;
 	get_ipc_ns(ipc_ns);
 	nsproxy->ipc_ns = ipc_ns;
+	get_net(net_ns);
+	nsproxy->net_ns = net_ns;
 
 	get_pid_ns(current->nsproxy->pid_ns);
 	nsproxy->pid_ns = current->nsproxy->pid_ns;
 	get_mnt_ns(current->nsproxy->mnt_ns);
 	nsproxy->mnt_ns = current->nsproxy->mnt_ns;
-	get_net(current->nsproxy->net_ns);
-	nsproxy->net_ns = current->nsproxy->net_ns;
 #else
 	nsproxy = current->nsproxy;
 	get_nsproxy(nsproxy);
 
 	BUG_ON(nsproxy->uts_ns != uts_ns);
 	BUG_ON(nsproxy->ipc_ns != ipc_ns);
+	BUG_ON(nsproxy->net_ns != net_ns);
 #endif
 
 	/* TODO: add more namespaces here */
diff --git a/net/Makefile b/net/Makefile
index 74b038f..9a9a6b8 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -67,3 +67,4 @@ endif
 obj-$(CONFIG_WIMAX)		+= wimax/
 
 obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint_dev.o
diff --git a/net/checkpoint_dev.c b/net/checkpoint_dev.c
new file mode 100644
index 0000000..380fb62
--- /dev/null
+++ b/net/checkpoint_dev.c
@@ -0,0 +1,518 @@
+/*
+ *  Copyright 2010 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/sched.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/inetdevice.h>
+#include <linux/veth.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include <net/net_namespace.h>
+#include <net/sch_generic.h>
+
+static int __kern_devinet_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = devinet_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+static int __kern_dev_ioctl(struct net *net, unsigned int cmd, void *arg)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = dev_ioctl(net, cmd, arg);
+	set_fs(fs);
+
+	return ret;
+}
+
+/*
+ * Determine if an interface should be checkpointed, skipped, or
+ * if it makes us uncheckpointable.  This needs to be improved
+ * dramatically, but works for the moment.
+ *
+ * Return 1 for yes, 0 for skip, -ERRNO for error
+ */
+static int should_checkpoint_netdev(struct net_device *dev)
+{
+	struct ethtool_drvinfo drvinfo;
+
+	if (strcmp(dev->name, "sit0") == 0) {
+		return 0;                                    /* Skip sit0 */
+	} else if (dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) {
+		dev->ethtool_ops->get_drvinfo(dev, &drvinfo);
+		if (strcmp(drvinfo.driver, "veth") == 0)
+			return 1;                            /* vethX is okay */
+	} else if (strcmp(dev->name, "lo") == 0)
+		return 1;                                    /* lo is okay */
+
+	return -EINVAL;
+}
+
+static int dev_in_init_netns(struct ckpt_ctx *ctx, struct net_device *dev)
+{
+	struct net *net = dev->nd_net;
+	int ref;
+
+	ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	return ref == ctx->init_netns_ref;
+}
+
+static int count_inet4_addrs(struct in_device *indev)
+{
+	int count = 0;
+	struct in_ifaddr *addr;
+
+	for (addr = indev->ifa_list; addr; addr = addr->ifa_next)
+		count++;
+
+	return count;
+}
+
+static int checkpoint_in_addrs(struct ckpt_ctx *ctx, struct in_device *indev)
+{
+	struct ckpt_hdr_netdev_addr *h;
+	struct in_ifaddr *addr = indev->ifa_list;
+	int ret;
+	int count = 0;
+
+	while (addr) {
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV_ADDR);
+		if (!h)
+			return -ENOMEM;
+
+		h->type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 right now */
+
+		h->inet4_local = addr->ifa_local;
+		h->inet4_address = addr->ifa_address;
+		h->inet4_mask = addr->ifa_mask;
+		h->inet4_broadcast = addr->ifa_broadcast;
+
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			break;
+
+		addr = addr->ifa_next;
+
+		count++;
+	}
+
+	return ret < 0 ? ret : count;
+}
+
+static int add_veth_refs(struct ckpt_ctx *ctx, struct ckpt_hdr_netdev *h,
+			 struct net_device *dev, struct net_device *peer)
+{
+	int new;
+
+	h->this_ref = ckpt_obj_lookup_add(ctx, dev, CKPT_OBJ_NETDEV, &new);
+	if (h->this_ref < 0)
+		return h->this_ref;
+
+	h->peer_ref = ckpt_obj_lookup_add(ctx, peer, CKPT_OBJ_NETDEV, &new);
+	if (h->peer_ref < 0)
+		return h->peer_ref;
+
+	ckpt_debug("netdev %s has peer %i addrs %i\n",
+		   dev->name, h->peer_ref, h->inet4_addrs);
+
+	return 0;
+}
+
+int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = ptr;
+	struct net_device *peer = NULL;
+	struct net *net = dev->nd_net;
+	int ret = 0;
+	struct ifreq req;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (!h)
+		return -ENOMEM;
+
+	if (strcmp(dev->name, "lo") == 0)
+		h->type = CKPT_NETDEV_LO;
+	else {
+		h->type = CKPT_NETDEV_VETH;
+		peer = veth_get_peer(dev);
+	}
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
+	h->flags = req.ifr_flags;
+	if (ret < 0)
+		goto out;
+
+	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
+	if (ret < 0)
+		goto out;
+	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
+
+	h->netns_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	if (!h->netns_ref) {
+		ret = -EINVAL;
+		ckpt_err(ctx, ret, "Found netdev with no netns");
+		goto out;
+	}
+
+	h->inet4_addrs = count_inet4_addrs(dev->ip_ptr);
+
+	if (h->type == CKPT_NETDEV_VETH) {
+		ret = add_veth_refs(ctx, h, dev, peer);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	if (h->type == CKPT_NETDEV_VETH) {
+		ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
+		if (ret < 0)
+			goto out;
+
+		ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = checkpoint_in_addrs(ctx, dev->ip_ptr);
+	if ((ret >= 0) && (ret != h->inet4_addrs)) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret,
+			 "Addresses on interface %s changed\n", dev->name);
+		goto out;
+	}
+	ret = 0;
+
+	if (peer && dev_in_init_netns(ctx, peer))
+		ret = checkpoint_obj(ctx, peer, CKPT_OBJ_NETDEV);
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct net *net = ptr;
+	struct net_device *dev;
+	struct ckpt_hdr_netns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (!h)
+		return -ENOMEM;
+
+	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
+	if (h->this_ref == 0) {
+		/* This shouldn't happen because we're called from
+		 * checkpoint_obj() which should have already put
+		 * us in the hash
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	for_each_netdev(net, dev) {
+		ret = should_checkpoint_netdev(dev);
+		if (ret > 0)
+			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+static int restore_in_addrs(struct ckpt_ctx *ctx,
+			    __u32 addrs,
+			    struct net *net,
+			    struct net_device *dev)
+{
+	__u32 i;
+	int ret = 0;
+
+	for (i = 0; i < addrs; i++) {
+		struct ckpt_hdr_netdev_addr *h;
+		struct ifreq req;
+		struct sockaddr_in *inaddr;
+
+		h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV_ADDR);
+		if (IS_ERR(h)) {
+			ckpt_err(ctx, PTR_ERR(h), "failed to read addr\n");
+			ret = PTR_ERR(h);
+			break;
+		}
+
+		if (h->type != CKPT_NETDEV_ADDR_IPV4) {
+			ret = -EINVAL;
+			ckpt_err(ctx, ret, "Unsupported netdev addr type %i\n",
+				 h->type);
+			goto end;
+		}
+
+		ckpt_debug("restoring %s: %x/%x/%x\n", dev->name,
+			   h->inet4_address, h->inet4_mask, h->inet4_broadcast);
+
+		memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = h->inet4_address;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set address\n");
+			goto end;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = h->inet4_mask;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFNETMASK, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set netmask\n");
+			goto end;
+		}
+
+		inaddr = (struct sockaddr_in *)&req.ifr_addr;
+		inaddr->sin_addr.s_addr = h->inet4_broadcast;
+		inaddr->sin_family = AF_INET;
+		ret = __kern_devinet_ioctl(net, SIOCSIFBRDADDR, &req);
+		if (ret < 0) {
+			ckpt_err(ctx, ret, "Failed to set broadcast\n");
+			goto end;
+		}
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret)
+			break;
+
+	}
+
+	return ret;
+}
+
+static void cleanup_veth(struct net_device *dev)
+{
+	struct net_device *peer = veth_get_peer(dev);
+
+	unregister_netdev(peer);
+	free_netdev(peer);
+	unregister_netdev(dev);
+	free_netdev(dev);
+}
+
+static struct net_device *new_veth_pair(char *this_name, char *peer_name)
+{
+	int ret;
+	struct nlattr **tb;
+	struct net_device *this;
+	struct net_device *peer;
+	const struct rtnl_link_ops *ops = rtnl_link_ops_get("veth");
+
+	tb = kcalloc(IFLA_MAX+1, sizeof(struct nlattr *), GFP_KERNEL);
+	if (!tb)
+		return ERR_PTR(-ENOMEM);
+
+	this = rtnl_create_link(current->nsproxy->net_ns, this_name, ops, tb);
+	if (IS_ERR(this)) {
+		ret = PTR_ERR(this);
+		goto err1;
+	}
+
+	peer = rtnl_create_link(current->nsproxy->net_ns, peer_name, ops, tb);
+	if (IS_ERR(peer)) {
+		ret = PTR_ERR(peer);
+		goto err2;
+	}
+
+	ret = register_netdev(this);
+	if (ret < 0)
+		goto err3;
+
+	ret = register_netdev(peer);
+	if (ret < 0)
+		goto err4;
+
+	dev_hold(this);
+	dev_hold(peer);
+
+	veth_set_peer(this, peer);
+	veth_set_peer(peer, this);
+
+	netif_carrier_on(this);
+	netif_carrier_on(peer);
+
+	kfree(tb);
+
+	return this;
+ err4:
+	unregister_netdevice(this);
+ err3:
+	free_netdev(peer);
+ err2:
+	free_netdev(this);
+ err1:
+	kfree(tb);
+
+	return ERR_PTR(ret);
+}
+
+static struct net_device *restore_veth(struct ckpt_ctx *ctx,
+				       struct ckpt_hdr_netdev *h,
+				       struct net *net)
+{
+	int ret;
+	char this_name[IFNAMSIZ];
+	char peer_name[IFNAMSIZ];
+	struct net_device *dev;
+	struct net_device *peer;
+
+	ret = _ckpt_read_buffer(ctx, this_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ret = _ckpt_read_buffer(ctx, peer_name, IFNAMSIZ);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ckpt_debug("restored veth netdev %s:%s\n", this_name, peer_name);
+
+	peer = ckpt_obj_try_fetch(ctx, h->peer_ref, CKPT_OBJ_NETDEV);
+	if (IS_ERR(peer)) {       /* We're first: allocate the veth pair */
+		dev = new_veth_pair(this_name, peer_name);
+		if (IS_ERR(dev))
+			return dev;
+		peer = veth_get_peer(dev);
+	} else                    /* We're second: get our dev from our peer */
+		dev = veth_get_peer(peer);
+
+	/* Move to our new netns */
+	rtnl_lock();
+	ret = dev_change_net_namespace(dev, net, dev->name);
+	rtnl_unlock();
+	if (ret) {
+		cleanup_veth(dev);
+		dev = ERR_PTR(ret);
+	}
+
+	return dev;
+}
+
+void *restore_netdev(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netdev *h;
+	struct net_device *dev = NULL;
+	struct ifreq req;
+	struct net *net;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netdev\n");
+		return h;
+	}
+
+	net = ckpt_obj_try_fetch(ctx, h->netns_ref, CKPT_OBJ_NET_NS);
+	if (IS_ERR(net)) {
+		ret = PTR_ERR(net);
+		goto out;
+	}
+
+	if (h->type == CKPT_NETDEV_VETH)
+		dev = restore_veth(ctx, h, net);
+	else if (h->type == CKPT_NETDEV_LO)
+		dev = dev_get_by_name(net, "lo");
+	else
+		dev = ERR_PTR(-EINVAL);
+
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		ckpt_err(ctx, ret, "Netdev type %i not supported\n", h->type);
+		goto out;
+	}
+
+	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
+
+	if (h->type != CKPT_NETDEV_LO) {
+		/* Restore MAC address */
+		memcpy(req.ifr_hwaddr.sa_data, h->hwaddr, sizeof(h->hwaddr));
+		req.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+		ret = __kern_dev_ioctl(net, SIOCSIFHWADDR, &req);
+		if (ret < 0)
+			goto out;
+	}
+
+	/* Restore flags (which will likely bring the interface up) */
+	req.ifr_flags = h->flags;
+	ret = __kern_dev_ioctl(net, SIOCSIFFLAGS, &req);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_in_addrs(ctx, h->inet4_addrs, net, dev);
+ out:
+	if (ret) {
+		ckpt_err(ctx, ret, "Failed to restore netdevice %s\n",
+			 dev->name);
+		if (h->type == CKPT_NETDEV_VETH)
+			cleanup_veth(dev);
+		dev = ERR_PTR(ret);
+	}
+	ckpt_hdr_put(ctx, h);
+
+	return dev;
+}
+
+void *restore_netns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_netns *h;
+	struct net *net;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
+	if (IS_ERR(h)) {
+		ckpt_err(ctx, PTR_ERR(h), "failed to read netns\n");
+		return h;
+	}
+
+	if (h->this_ref != ctx->init_netns_ref) {
+		net = copy_net_ns(CLONE_NEWNET, current->nsproxy->net_ns);
+		if (IS_ERR(net))
+			goto out;
+	} else
+		net = current->nsproxy->net_ns;
+ out:
+	ckpt_hdr_put(ctx, h);
+
+	return net;
+}
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

[parent not found: <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]     ` <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-01-20 17:36       ` Serge E. Hallyn
  2010-01-20 21:26       ` Oren Laadan
  2010-01-20 22:21       ` Brian Haley
  2 siblings, 0 replies; 12+ messages in thread
From: Serge E. Hallyn @ 2010-01-20 17:36 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> When checkpointing a task tree with network namespaces, we hook into
> do_checkpoint_ns() along with the others.  Any devices in a given namespace
> are checkpointed (including their peer, in the case of veth) sequentially.
> Each network device stores a list of protocol addresses, as well as other
> information, such as hardware address.
> 
> This patch supports veth pairs, as well as the loopback adapter.  The
> loopback support is there to make sure that any additional addresses and
> state (such as up/down) is copied to the loopback adapter that we are
> given in the new network namespace.
> 
> On restart, we instantiate new network namespaces and veth pairs as
> necessary.  Any device we encounter that isn't in a network namespace
> that was checkpointed as part of a task is left in the namespace of the
> restarting process.  This will be the case for a veth half that exists
> in the init netns to provide network access to a container.
> 
> Still to do are:
> 
>   1. Routes
>   2. Netfilter rules
>   3. IPv6 addresses
>   4. Other virtual device types (e.g. bridges)
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Cool - I don't see any issues in the patchset.

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]     ` <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-01-20 17:36       ` Serge E. Hallyn
@ 2010-01-20 21:26       ` Oren Laadan
  2010-01-21 15:38         ` Dan Smith
  2010-01-20 22:21       ` Brian Haley
  2 siblings, 1 reply; 12+ messages in thread
From: Oren Laadan @ 2010-01-20 21:26 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg


Cool - looks good !

Would this compile without CONFIG_NET ?  without CONFIG_NET_NS ?

How can a user ask to not checkpoint the network-ns ?  (e.g. in
a subtree checkpoint)

And a few minor comments inline...


Dan Smith wrote:
> When checkpointing a task tree with network namespaces, we hook into
> do_checkpoint_ns() along with the others.  Any devices in a given namespace
> are checkpointed (including their peer, in the case of veth) sequentially.
> Each network device stores a list of protocol addresses, as well as other
> information, such as hardware address.
> 
> This patch supports veth pairs, as well as the loopback adapter.  The
> loopback support is there to make sure that any additional addresses and
> state (such as up/down) is copied to the loopback adapter that we are
> given in the new network namespace.
> 
> On restart, we instantiate new network namespaces and veth pairs as
> necessary.  Any device we encounter that isn't in a network namespace
> that was checkpointed as part of a task is left in the namespace of the
> restarting process.  This will be the case for a veth half that exists
> in the init netns to provide network access to a container.
> 
> Still to do are:
> 
>   1. Routes
>   2. Netfilter rules
>   3. IPv6 addresses
>   4. Other virtual device types (e.g. bridges)
> 
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

[...]

> +/*
> + * Determine if an interface should be checkpointed, skipped, or
> + * if it makes us uncheckpointable.  This needs to be improved
> + * dramatically, but works for the moment.

Maybe be a bit more verbose about what's missing ?

> + *
> + * Return 1 for yes, 0 for skip, -ERRNO for error
> + */

[...]

> +static int count_inet4_addrs(struct in_device *indev)
> +{
> +	int count = 0;
> +	struct in_ifaddr *addr;
> +
> +	for (addr = indev->ifa_list; addr; addr = addr->ifa_next)
> +		count++;
> +
> +	return count;
> +}
> +
> +static int checkpoint_in_addrs(struct ckpt_ctx *ctx, struct in_device *indev)
> +{
> +	struct ckpt_hdr_netdev_addr *h;
> +	struct in_ifaddr *addr = indev->ifa_list;
> +	int ret;
> +	int count = 0;
> +

Is there a reason not to collect all addresses into one buffer (can
there be more than a page worth of them ?) and write in one go ?

> +	while (addr) {
> +		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV_ADDR);
> +		if (!h)
> +			return -ENOMEM;
> +
> +		h->type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 right now */
> +
> +		h->inet4_local = addr->ifa_local;
> +		h->inet4_address = addr->ifa_address;
> +		h->inet4_mask = addr->ifa_mask;
> +		h->inet4_broadcast = addr->ifa_broadcast;
> +
> +		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +		ckpt_hdr_put(ctx, h);
> +		if (ret < 0)
> +			break;
> +
> +		addr = addr->ifa_next;
> +
> +		count++;
> +	}
> +
> +	return ret < 0 ? ret : count;
> +}

[...]

> +
> +int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
> +{
> +	struct ckpt_hdr_netdev *h;
> +	struct net_device *dev = ptr;
> +	struct net_device *peer = NULL;
> +	struct net *net = dev->nd_net;
> +	int ret = 0;
> +	struct ifreq req;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	if (strcmp(dev->name, "lo") == 0)
> +		h->type = CKPT_NETDEV_LO;
> +	else {

While this is correct, perhaps be more verbose and change to:
	else if (strncmp(dev->name, "veth", 4) == 0)

> +		h->type = CKPT_NETDEV_VETH;
> +		peer = veth_get_peer(dev);
> +	}

and then
	} else {
		/* error */
	}

> +
> +	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
> +	h->flags = req.ifr_flags;
> +	if (ret < 0)
> +		goto out;
> +

[...]

> +
> +int checkpoint_netns(struct ckpt_ctx *ctx, void *ptr)
> +{
> +	struct net *net = ptr;
> +	struct net_device *dev;
> +	struct ckpt_hdr_netns *h;
> +	int ret;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NET_NS);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	h->this_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
> +	if (h->this_ref == 0) {
> +		/* This shouldn't happen because we're called from
> +		 * checkpoint_obj() which should have already put
> +		 * us in the hash
> +		 */

If this can only happen due to a bug, then BUG_ON(). Otherwise,
maybe use ckpt_err() ?

> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	for_each_netdev(net, dev) {
> +		ret = should_checkpoint_netdev(dev);
> +		if (ret > 0)
> +			ret = checkpoint_obj(ctx, dev, CKPT_OBJ_NETDEV);
> +		if (ret < 0)
> +			break;
> +	}
> + out:
> +	ckpt_hdr_put(ctx, h);
> +
> +	return ret;
> +}
> +

[...]

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
  2010-01-20 21:26       ` Oren Laadan
@ 2010-01-21 15:38         ` Dan Smith
       [not found]           ` <878wbrpix6.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Smith @ 2010-01-21 15:38 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

OL> Would this compile without CONFIG_NET ?  without CONFIG_NET_NS ?

Nope, but it does depend on CONFIG_CHECKPOINT, of course.  I'll add
some Kconfig magic to try to straighten that out.

OL> How can a user ask to not checkpoint the network-ns ?  (e.g. in
OL> a subtree checkpoint)

Do we really want to start adding fine-grained control over everything
that we checkpoint?  Can you ask it to not checkpoint the ipc_ns,
uts_ns, etc?  The previous example of this was connected sockets,
which I think is different, given the potential for a long delay
before restart ensuring the sockets will be dead anyway.

OL> Is there a reason not to collect all addresses into one buffer
OL> (can there be more than a page worth of them ?) and write in one
OL> go ?

I don't really see anywhere that the list is bounded.  I'd say that in
most cases each interface will only have one address anyway.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <878wbrpix6.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>]

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]           ` <878wbrpix6.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2010-01-21 16:14             ` Oren Laadan
  0 siblings, 0 replies; 12+ messages in thread
From: Oren Laadan @ 2010-01-21 16:14 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

On Thu, 21 Jan 2010, Dan Smith wrote:

> OL> Would this compile without CONFIG_NET ?  without CONFIG_NET_NS ?
> 
> Nope, but it does depend on CONFIG_CHECKPOINT, of course.  I'll add
> some Kconfig magic to try to straighten that out.
> 
> OL> How can a user ask to not checkpoint the network-ns ?  (e.g. in
> OL> a subtree checkpoint)
> 
> Do we really want to start adding fine-grained control over everything
> that we checkpoint?  Can you ask it to not checkpoint the ipc_ns, 
> uts_ns, etc?

Basically yes. So far we didn't address how to do that.

With network-ns it's a bit different, because, iiuc, it will require root 
for a restart even of a subtree checkpoint (that doesn't care about pids), 
just by being there.

> The previous example of this was connected sockets,
> which I think is different, given the potential for a long delay
> before restart ensuring the sockets will be dead anyway.

Agreed.

Oren.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]     ` <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-01-20 17:36       ` Serge E. Hallyn
  2010-01-20 21:26       ` Oren Laadan
@ 2010-01-20 22:21       ` Brian Haley
       [not found]         ` <4B5781DF.6050106-VXdhtT5mjnY@public.gmane.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Brian Haley @ 2010-01-20 22:21 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

Dan Smith wrote:
> When checkpointing a task tree with network namespaces, we hook into
> do_checkpoint_ns() along with the others.  Any devices in a given namespace
> are checkpointed (including their peer, in the case of veth) sequentially.
> Each network device stores a list of protocol addresses, as well as other
> information, such as hardware address.
> 
> This patch supports veth pairs, as well as the loopback adapter.  The
> loopback support is there to make sure that any additional addresses and
> state (such as up/down) is copied to the loopback adapter that we are
> given in the new network namespace.
> 
> On restart, we instantiate new network namespaces and veth pairs as
> necessary.  Any device we encounter that isn't in a network namespace
> that was checkpointed as part of a task is left in the namespace of the
> restarting process.  This will be the case for a veth half that exists
> in the init netns to provide network access to a container.
> 
> Still to do are:
> 
>   1. Routes
>   2. Netfilter rules
>   3. IPv6 addresses
>   4. Other virtual device types (e.g. bridges)

What about:

    1. Multicast
    2. Device config info (ipv4_devconf)

> +static int checkpoint_in_addrs(struct ckpt_ctx *ctx, struct in_device *indev)
> +{
> +	struct ckpt_hdr_netdev_addr *h;
> +	struct in_ifaddr *addr = indev->ifa_list;
> +	int ret;
> +	int count = 0;
> +
> +	while (addr) {
> +		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV_ADDR);
> +		if (!h)
> +			return -ENOMEM;
> +
> +		h->type = CKPT_NETDEV_ADDR_IPV4; /* Only IPv4 right now */
> +
> +		h->inet4_local = addr->ifa_local;
> +		h->inet4_address = addr->ifa_address;
> +		h->inet4_mask = addr->ifa_mask;
> +		h->inet4_broadcast = addr->ifa_broadcast;

What about addr->ifa_flags and all the other elements like prefixlen, scope and label?

> +int checkpoint_netdev(struct ckpt_ctx *ctx, void *ptr)
> +{
> +	struct ckpt_hdr_netdev *h;
> +	struct net_device *dev = ptr;
> +	struct net_device *peer = NULL;
> +	struct net *net = dev->nd_net;
> +	int ret = 0;
> +	struct ifreq req;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NETDEV);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	if (strcmp(dev->name, "lo") == 0)
> +		h->type = CKPT_NETDEV_LO;
> +	else {
> +		h->type = CKPT_NETDEV_VETH;
> +		peer = veth_get_peer(dev);
> +	}
> +
> +	memcpy(req.ifr_name, dev->name, IFNAMSIZ);
> +	ret = __kern_dev_ioctl(net, SIOCGIFFLAGS, &req);
> +	h->flags = req.ifr_flags;
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = __kern_dev_ioctl(net, SIOCGIFHWADDR, &req);
> +	if (ret < 0)
> +		goto out;
> +	memcpy(h->hwaddr, req.ifr_hwaddr.sa_data, sizeof(h->hwaddr));
> +
> +	h->netns_ref = ckpt_obj_lookup(ctx, net, CKPT_OBJ_NET_NS);
> +	if (!h->netns_ref) {
> +		ret = -EINVAL;
> +		ckpt_err(ctx, ret, "Found netdev with no netns");
> +		goto out;
> +	}
> +
> +	h->inet4_addrs = count_inet4_addrs(dev->ip_ptr);
> +
> +	if (h->type == CKPT_NETDEV_VETH) {
> +		ret = add_veth_refs(ctx, h, dev, peer);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (h->type == CKPT_NETDEV_VETH) {
> +		ret = ckpt_write_buffer(ctx, dev->name, IFNAMSIZ);
> +		if (ret < 0)
> +			goto out;
> +
> +		ret = ckpt_write_buffer(ctx, peer->name, IFNAMSIZ);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = checkpoint_in_addrs(ctx, dev->ip_ptr);
> +	if ((ret >= 0) && (ret != h->inet4_addrs)) {
> +		ret = -EBUSY;
> +		ckpt_err(ctx, ret,
> +			 "Addresses on interface %s changed\n", dev->name);
> +		goto out;
> +	}

This isn't guaranteed to catch every change to the address list, just that
the number of addresses is the same, is there no way to hold a lock the whole
time?

-Brian

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <4B5781DF.6050106-VXdhtT5mjnY@public.gmane.org>]

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]         ` <4B5781DF.6050106-VXdhtT5mjnY@public.gmane.org>
@ 2010-01-21 15:37           ` Dan Smith
       [not found]             ` <87fx5zpiy7.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Smith @ 2010-01-21 15:37 UTC (permalink / raw)
  To: Brian Haley; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

BH> What about:

BH>     1. Multicast
BH>     2. Device config info (ipv4_devconf)

<snip>

BH> What about addr->ifa_flags and all the other elements like
BH> prefixlen, scope and label?

I thought I was covered by calling it "Basic support..." :)

I've added these to the list of TODOs and will cook those up once the
basic bits are in.

>> +	if ((ret >= 0) && (ret != h->inet4_addrs)) {
>> +		ret = -EBUSY;
>> +		ckpt_err(ctx, ret,
>> +			 "Addresses on interface %s changed\n", dev->name);
>> +		goto out;
>> +	}

BH> This isn't guaranteed to catch every change to the address list,
BH> just that the number of addresses is the same, is there no way to
BH> hold a lock the whole time?

Yeah, there probably is.  I'll take a look.

Thanks!

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <87fx5zpiy7.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>]

* Re: [PATCH 3/3] C/R: Basic support for network namespaces and devices
       [not found]             ` <87fx5zpiy7.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2010-01-21 16:08               ` Oren Laadan
  0 siblings, 0 replies; 12+ messages in thread
From: Oren Laadan @ 2010-01-21 16:08 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg

On Thu, 21 Jan 2010, Dan Smith wrote:

> BH> What about:
> 
> BH>     1. Multicast
> BH>     2. Device config info (ipv4_devconf)
> 
> <snip>
> 
> BH> What about addr->ifa_flags and all the other elements like
> BH> prefixlen, scope and label?
> 
> I thought I was covered by calling it "Basic support..." :)
> 
> I've added these to the list of TODOs and will cook those up once the
> basic bits are in.
> 
> >> +	if ((ret >= 0) && (ret != h->inet4_addrs)) {
> >> +		ret = -EBUSY;
> >> +		ckpt_err(ctx, ret,
> >> +			 "Addresses on interface %s changed\n", dev->name);
> >> +		goto out;
> >> +	}
> 
> BH> This isn't guaranteed to catch every change to the address list,
> BH> just that the number of addresses is the same, is there no way to
> BH> hold a lock the whole time?
> 
> Yeah, there probably is.  I'll take a look.

Qouting from your reply to my email:

> OL> Is there a reason not to collect all addresses into one buffer
> OL> (can there be more than a page worth of them ?) and write in one
> OL> go ?
>
> I don't really see anywhere that the list is bounded.  I'd say that in
> most cases each interface will only have one address anyway.

So collecting them in one buffer (if one page isn't enough then realloc
and retry) would solve this too.

Oren

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-01-21 16:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-20 15:01 Network namespace and device support Dan Smith
     [not found] ` <1263999673-11279-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-01-20 15:01   ` [PATCH 1/3] Expose rtnl_link_ops_get() Dan Smith
2010-01-20 15:01   ` [PATCH 2/3] Add a veth_get_peer() and veth_set_peer() functions Dan Smith
2010-01-21  9:24     ` David Miller
2010-01-20 15:01   ` [PATCH 3/3] C/R: Basic support for network namespaces and devices Dan Smith
     [not found]     ` <1263999673-11279-4-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-01-20 17:36       ` Serge E. Hallyn
2010-01-20 21:26       ` Oren Laadan
2010-01-21 15:38         ` Dan Smith
     [not found]           ` <878wbrpix6.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2010-01-21 16:14             ` Oren Laadan
2010-01-20 22:21       ` Brian Haley
     [not found]         ` <4B5781DF.6050106-VXdhtT5mjnY@public.gmane.org>
2010-01-21 15:37           ` Dan Smith
     [not found]             ` <87fx5zpiy7.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2010-01-21 16:08               ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.