netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Oren Laadan <orenl@librato.com>
To: Dan Smith <danms@us.ibm.com>
Cc: John Dykstra <john.dykstra1@gmail.com>,
	containers@lists.osdl.org, netdev@vger.kernel.org,
	Alexey Dobriyan <adobriyan@gmail.com>
Subject: Re: [PATCH 2/2] c/r: Add AF_INET support (v3)
Date: Tue, 28 Jul 2009 13:07:54 -0400	[thread overview]
Message-ID: <4A6F306A.40303@librato.com> (raw)
In-Reply-To: <874oswer21.fsf@caffeine.danplanet.com>



Dan Smith wrote:
> JD> I don't know whether this would affect palatibility (sic) for the
> JD> mainline, but it bothers me...  socket.h (and sock.h) are included
> JD> all over the place in the kernel, and this definition is only
> JD> needed in very limited locations.  Can it be placed in a .h used
> JD> only by checkpoint code?
> 
> This was moved here based on previous comments on the patch.
> Personally, I think socket.h is a little too wide.  While iterating on
> this patch locally, I've discovered that socket.h won't really work
> because in6.h includes it early and thus I'm unable to include some of
> the address structures as a result.  I think going back to a
> checkpoint-specific header would be helpful.
> 
> JD> There's an interesting design choice re TIME_WAIT and similar
> JD> states.  In a migration scenario, do those sks migrate?  If the
> JD> tree isn't going to be restored for a while., you want the
> JD> original host to continue to respond to those four-tuples If the
> JD> IP address of the original is immediately migrated to another
> JD> machine, you want the TIME_WAIT sks to migrate too.
> 
> Well, as far as I'm concerned, the only sane migration scenario is one
> where you migrate the IP address and virtual interface with the
> task.  When you destroy the virtual interface after (or during) the
> migration, the TIME_WAIT socket will disappear from the sending host.

Note that such sockets are unlikely to be referenced by _any_ process
because they will have been closed already. So the checkpoint code
will need to find such sockets not through file descriptors.

> 
> I think that we need to increment timers like this on restore anyway,
> which should make sure that the TIME_WAIT timers run for the same
> amount of wall-clock time regardless of how much time was spent in the
> process of migration, right?
> 
> Since checkpoint is not aware of a potential migration, a regular
> (i.e. not intended for migration) checkpoint operation on a task
> running in the init netns will leave the TIME_WAIT socket in place
> until the timer expires.
> 
>>> +static int sock_inet_restart(struct ckpt_ctx *ctx,
>>> +			     struct ckpt_hdr_socket *h,
>>> +			     struct socket *socket)
>>> +{
>>> +	int ret;
>>> +	struct ckpt_hdr_socket_inet *in;
>>> +	struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
>>> +
>>> +	in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
>>> +	if (IS_ERR(in))
>>> +		return PTR_ERR(in);
>>> +
>>> +	/* Listening sockets and those that are closed but have a local
>>> +	 * address need to call bind()
>>> +	 */
>>> +	if ((h->sock.state == TCP_LISTEN) ||
>>> +	    ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
>>> +		socket->sk->sk_reuse = 2;
>>> +		inet_sk(socket->sk)->freebind = 1;
>>> +		ret = socket->ops->bind(socket,
>>> +					(struct sockaddr *)l,
>>> +					h->laddr_len);
>>> +		ckpt_debug("inet bind: %i", ret);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +
>>> +		if (h->sock.state == TCP_LISTEN) {
>>> +			ret = socket->ops->listen(socket, h->sock.backlog);
>>> +			ckpt_debug("inet listen: %i", ret);
>>> +			if (ret < 0)
>>> +				goto out;
>>> +
>>> +			ret = sock_add_parent(ctx, socket->sk);
>>> +			if (ret < 0)
>>> +				goto out;
> 
> JD> So this is just dummied off as a proof-of-concept for LISTEN?
> 
> I'm not sure what you mean...
> 
>>> +		}
>>> +	} else {
>>> +		ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
>>> +		if (ret)
>>> +			goto out;
> 
> JD> At a minimum, you'll want to start the TCP retransmit timer if there is
> JD> unacknowledged data outstanding.  And other timers for other states, as
> JD> they're supported.
> 
> JD> And you probably do want to do slow-start again--disregard my babbling
> JD> from yesterday. 
> 
> Heh, okay :)
> 
> JD> This is part of your AF_UNIX patch:
> 
> JD>         struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
> JD>         				    struct ckpt_hdr_socket *h)
> JD>         {
> JD>         	struct socket *socket;
> JD>         	int ret;
>         
> JD>         	ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
> JD>         	if (ret < 0)
> JD>         		return ERR_PTR(ret);
>         
> 
> JD> You _really_ want to pass the actual protocol number to sock_create().
> JD> The size of the sk it creates depends on this.  You'll quickly be in
> JD> memory corruption hell without it.
> 
> You mean I need to verify that the protocol is one of IPPROTO_TCP,
> IPPROTO_UDP, or PF_UNIX, right?
> 
> JD> Also from the AF_UNIX patch:
>         
> JD>         static int obj_sock_users(void *ptr)
> JD>         {
> JD>         	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
> JD>         }
>         
> JD> Network sockets also use sk->sk_wmem_alloc to track references to
> JD> the sock from egress skb's
> 
> IIUC, this function is part of the framework's leak detection.  That
> code looks to make sure objects don't gain any additional users
> between the start and end of the checkpoint operation.  I think the
> sk_refcnt satisfies that requirement here, no?

As I commented on patch 5/5 AF_UNIX -- IMO socket objects don't require
leak-detection, since they are 1-to-1 with files, which are already
accounted for. sockets can't be otherwise cloned/inherited/transferred
between processes.

> 
> JD> And:
> 
> JD>         static int sock_copy_buffers(struct sk_buff_head *from, struct
> JD>         sk_buff_head *to)
> JD>         {
> JD>         	int count = 0;
> JD>         	struct sk_buff *skb;
>         
> JD>         	skb_queue_walk(from, skb) {
> JD>         		struct sk_buff *tmp;
>         
> JD>         		tmp = dev_alloc_skb(skb->len);
> JD>         		if (!tmp)
> JD>         			return -ENOMEM;
>         
> JD>         		spin_lock(&from->lock);
> JD>         		skb_morph(tmp, skb);
> JD>         		spin_unlock(&from->lock);
> 
> JD> Why not just clone the skb?
> 
> Okay, good call.
> 
> Thanks!
> 

Oren.

  reply	other threads:[~2009-07-28 17:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-07 19:26 [RFC] Add Checkpoint/Restart support for UNIX and INET sockets Dan Smith
     [not found] ` <1246994776-1882-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-07 19:26   ` [PATCH 1/2] c/r: Add AF_UNIX support (v5) Dan Smith
     [not found]     ` <1246994776-1882-2-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-07-08  6:32       ` Oren Laadan
2009-07-08 14:01         ` Serge E. Hallyn
2009-07-08 19:27           ` Dan Smith
2009-07-08 22:01             ` Serge E. Hallyn
     [not found]         ` <4A543D82.5080408-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-07-08 15:23           ` Dan Smith
2009-07-08 16:44             ` Oren Laadan
     [not found]               ` <4A54CCDB.1090602-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-07-08 16:55                 ` Dan Smith
2009-07-08 18:16                   ` Oren Laadan
2009-07-07 19:26   ` [PATCH 2/2] c/r: Add AF_INET support (v3) Dan Smith
2009-07-08  1:23     ` Brian Haley
     [not found]       ` <4A53F50D.30001-VXdhtT5mjnY@public.gmane.org>
2009-07-08  1:31         ` Dan Smith
2009-07-08 13:58     ` Oren Laadan
2009-07-08 15:30       ` Dan Smith
2009-07-13 19:02     ` John Dykstra
2009-07-13 19:10       ` Dan Smith
2009-07-24 20:44     ` John Dykstra
2009-07-28 17:22       ` Oren Laadan
2009-07-25 21:02     ` John Dykstra
2009-07-28 16:00       ` Dan Smith
2009-07-28 17:07         ` Oren Laadan [this message]
     [not found]           ` <4A6F306A.40303-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-07-29 22:10             ` John Dykstra
2009-07-29  0:28         ` John Dykstra
2009-07-31 19:35     ` John Dykstra
2009-07-31 19:40       ` Dan Smith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A6F306A.40303@librato.com \
    --to=orenl@librato.com \
    --cc=adobriyan@gmail.com \
    --cc=containers@lists.osdl.org \
    --cc=danms@us.ibm.com \
    --cc=john.dykstra1@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).