From: Oren Laadan <orenl@librato.com>
To: Dan Smith <danms@us.ibm.com>
Cc: John Dykstra <john.dykstra1@gmail.com>,
containers@lists.osdl.org, netdev@vger.kernel.org,
Alexey Dobriyan <adobriyan@gmail.com>
Subject: Re: [PATCH 2/2] c/r: Add AF_INET support (v3)
Date: Tue, 28 Jul 2009 13:07:54 -0400 [thread overview]
Message-ID: <4A6F306A.40303@librato.com> (raw)
In-Reply-To: <874oswer21.fsf@caffeine.danplanet.com>
Dan Smith wrote:
> JD> I don't know whether this would affect palatibility (sic) for the
> JD> mainline, but it bothers me... socket.h (and sock.h) are included
> JD> all over the place in the kernel, and this definition is only
> JD> needed in very limited locations. Can it be placed in a .h used
> JD> only by checkpoint code?
>
> This was moved here based on previous comments on the patch.
> Personally, I think socket.h is a little too wide. While iterating on
> this patch locally, I've discovered that socket.h won't really work
> because in6.h includes it early and thus I'm unable to include some of
> the address structures as a result. I think going back to a
> checkpoint-specific header would be helpful.
>
> JD> There's an interesting design choice re TIME_WAIT and similar
> JD> states. In a migration scenario, do those sks migrate? If the
> JD> tree isn't going to be restored for a while., you want the
> JD> original host to continue to respond to those four-tuples If the
> JD> IP address of the original is immediately migrated to another
> JD> machine, you want the TIME_WAIT sks to migrate too.
>
> Well, as far as I'm concerned, the only sane migration scenario is one
> where you migrate the IP address and virtual interface with the
> task. When you destroy the virtual interface after (or during) the
> migration, the TIME_WAIT socket will disappear from the sending host.
Note that such sockets are unlikely to be referenced by _any_ process
because they will have been closed already. So the checkpoint code
will need to find such sockets not through file descriptors.
>
> I think that we need to increment timers like this on restore anyway,
> which should make sure that the TIME_WAIT timers run for the same
> amount of wall-clock time regardless of how much time was spent in the
> process of migration, right?
>
> Since checkpoint is not aware of a potential migration, a regular
> (i.e. not intended for migration) checkpoint operation on a task
> running in the init netns will leave the TIME_WAIT socket in place
> until the timer expires.
>
>>> +static int sock_inet_restart(struct ckpt_ctx *ctx,
>>> + struct ckpt_hdr_socket *h,
>>> + struct socket *socket)
>>> +{
>>> + int ret;
>>> + struct ckpt_hdr_socket_inet *in;
>>> + struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
>>> +
>>> + in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
>>> + if (IS_ERR(in))
>>> + return PTR_ERR(in);
>>> +
>>> + /* Listening sockets and those that are closed but have a local
>>> + * address need to call bind()
>>> + */
>>> + if ((h->sock.state == TCP_LISTEN) ||
>>> + ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
>>> + socket->sk->sk_reuse = 2;
>>> + inet_sk(socket->sk)->freebind = 1;
>>> + ret = socket->ops->bind(socket,
>>> + (struct sockaddr *)l,
>>> + h->laddr_len);
>>> + ckpt_debug("inet bind: %i", ret);
>>> + if (ret < 0)
>>> + goto out;
>>> +
>>> + if (h->sock.state == TCP_LISTEN) {
>>> + ret = socket->ops->listen(socket, h->sock.backlog);
>>> + ckpt_debug("inet listen: %i", ret);
>>> + if (ret < 0)
>>> + goto out;
>>> +
>>> + ret = sock_add_parent(ctx, socket->sk);
>>> + if (ret < 0)
>>> + goto out;
>
> JD> So this is just dummied off as a proof-of-concept for LISTEN?
>
> I'm not sure what you mean...
>
>>> + }
>>> + } else {
>>> + ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
>>> + if (ret)
>>> + goto out;
>
> JD> At a minimum, you'll want to start the TCP retransmit timer if there is
> JD> unacknowledged data outstanding. And other timers for other states, as
> JD> they're supported.
>
> JD> And you probably do want to do slow-start again--disregard my babbling
> JD> from yesterday.
>
> Heh, okay :)
>
> JD> This is part of your AF_UNIX patch:
>
> JD> struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
> JD> struct ckpt_hdr_socket *h)
> JD> {
> JD> struct socket *socket;
> JD> int ret;
>
> JD> ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
> JD> if (ret < 0)
> JD> return ERR_PTR(ret);
>
>
> JD> You _really_ want to pass the actual protocol number to sock_create().
> JD> The size of the sk it creates depends on this. You'll quickly be in
> JD> memory corruption hell without it.
>
> You mean I need to verify that the protocol is one of IPPROTO_TCP,
> IPPROTO_UDP, or PF_UNIX, right?
>
> JD> Also from the AF_UNIX patch:
>
> JD> static int obj_sock_users(void *ptr)
> JD> {
> JD> return atomic_read(&((struct sock *) ptr)->sk_refcnt);
> JD> }
>
> JD> Network sockets also use sk->sk_wmem_alloc to track references to
> JD> the sock from egress skb's
>
> IIUC, this function is part of the framework's leak detection. That
> code looks to make sure objects don't gain any additional users
> between the start and end of the checkpoint operation. I think the
> sk_refcnt satisfies that requirement here, no?
As I commented on patch 5/5 AF_UNIX -- IMO socket objects don't require
leak-detection, since they are 1-to-1 with files, which are already
accounted for. sockets can't be otherwise cloned/inherited/transferred
between processes.
>
> JD> And:
>
> JD> static int sock_copy_buffers(struct sk_buff_head *from, struct
> JD> sk_buff_head *to)
> JD> {
> JD> int count = 0;
> JD> struct sk_buff *skb;
>
> JD> skb_queue_walk(from, skb) {
> JD> struct sk_buff *tmp;
>
> JD> tmp = dev_alloc_skb(skb->len);
> JD> if (!tmp)
> JD> return -ENOMEM;
>
> JD> spin_lock(&from->lock);
> JD> skb_morph(tmp, skb);
> JD> spin_unlock(&from->lock);
>
> JD> Why not just clone the skb?
>
> Okay, good call.
>
> Thanks!
>
Oren.
next prev parent reply other threads:[~2009-07-28 17:07 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-07-07 19:26 [RFC] Add Checkpoint/Restart support for UNIX and INET sockets Dan Smith
[not found] ` <1246994776-1882-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-07 19:26 ` [PATCH 1/2] c/r: Add AF_UNIX support (v5) Dan Smith
[not found] ` <1246994776-1882-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-08 6:32 ` Oren Laadan
2009-07-08 14:01 ` Serge E. Hallyn
2009-07-08 19:27 ` Dan Smith
2009-07-08 22:01 ` Serge E. Hallyn
[not found] ` <4A543D82.5080408-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-07-08 15:23 ` Dan Smith
2009-07-08 16:44 ` Oren Laadan
[not found] ` <4A54CCDB.1090602-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-07-08 16:55 ` Dan Smith
2009-07-08 18:16 ` Oren Laadan
2009-07-07 19:26 ` [PATCH 2/2] c/r: Add AF_INET support (v3) Dan Smith
2009-07-08 1:23 ` Brian Haley
[not found] ` <4A53F50D.30001-VXdhtT5mjnY@public.gmane.org>
2009-07-08 1:31 ` Dan Smith
2009-07-08 13:58 ` Oren Laadan
2009-07-08 15:30 ` Dan Smith
2009-07-13 19:02 ` John Dykstra
2009-07-13 19:10 ` Dan Smith
2009-07-24 20:44 ` John Dykstra
2009-07-28 17:22 ` Oren Laadan
2009-07-25 21:02 ` John Dykstra
2009-07-28 16:00 ` Dan Smith
2009-07-28 17:07 ` Oren Laadan [this message]
[not found] ` <4A6F306A.40303-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-07-29 22:10 ` John Dykstra
2009-07-29 0:28 ` John Dykstra
2009-07-31 19:35 ` John Dykstra
2009-07-31 19:40 ` Dan Smith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A6F306A.40303@librato.com \
--to=orenl@librato.com \
--cc=adobriyan@gmail.com \
--cc=containers@lists.osdl.org \
--cc=danms@us.ibm.com \
--cc=john.dykstra1@gmail.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.