Linux Container Development
 help / color / mirror / Atom feed
From: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>,
	Cyrill Gorcunov
	<gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Linux Containers
	<containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org>,
	Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
	Daniel Lezcano <dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
Date: Sat, 23 Jul 2011 12:39:24 +0400	[thread overview]
Message-ID: <4E2A88BC.5010804@parallels.com> (raw)
In-Reply-To: <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>

On 07/23/2011 04:25 AM, Matt Helsley wrote:
> On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
>> Hi guys!
>>
>> There have already been made many attempts to have the checkpoint/restore functionality
>> in Linux, but as far as I can see there's still no final solutions that suits most of
>> the interested people. The main concern about the previous approaches as I see it was
>> about - all that stuff was supposed to sit in the kernel thus creating various problems.
>>
>> I'd like to bring this subject back again proposing the way of how to implement c/r
>> mostly in the userspace with the reasonable help of a kernel.
>>
>>
>> That said, I propose to start with very basic set of objects to c/r that can work with
>>
>> * x86_64 tasks (subtree) which includes
>>    - registers
>>    - TLS
>>    - memory of all kinds (file and anon both shared and private)
> 
> Do mixes of 32 and 64-bit tasks present any problems with this
> method?

In theory - no. But in practice I didn't write the 32-bit support yet.

>> * open regular files
>> * pipes (with data in it)
>>
>> Core idea:
>>
>> The core idea of the restore process is to implement the binary handler that can execve-ute
>> image files recreating the register and the memory state of a task. Restoring the process 
> 
> I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt
> kernel code.
> 
>> tree and opening files is done completely in the user space, i.e. when restoring the subtree
>> of processes I first fork all the tasks in respective order, then open required files and 
> 
> OK. Oren's code also forked all the tasks in userspace prior to completing the restart.
> 
>> then call execve() to restore registers and memory.
> 
> That's kind of neat, but won't this interfere with restoring O_CLOEXEC
> flags? (I also asked this in a reply to the TOOLS email)
> 
>>
>> The checkpointing process is quite simple - all we need about processes can be read from /proc
>> except for several things - registers and private memory. In current implementation to get 
> 
> I put this to Tejun as well: What about stuff like epoll sets? Sure, you
> can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell
> which fds are in it. Worse, even if you got the fds from the epoll items
> via /proc, the way epoll holds onto them does not guarantee they'll refer
> to the files the set would actuall wait on.
> 
> As best I can tell you can't reliably checkpoint epoll sets from userspace.

With the existing interfaces - yes. My aim was to start the discussion whether we can
extend the kernel APIs to make it possible to do so.

> Then there's the matter of unlinked files. How do you plan to deal
> with those without kernel code?

You will have the same problem even with the c/r in the kernel. Frankly, I don't see
much difference in where to solve this one, can you elaborate?

>> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
>> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
>> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
>> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
>> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
>> if required map one and read the contents of anon shared memory.
> 
> Finally, I think there's substantial room here for quiet and subtle
> races to corrupt checkpoint images. If we add /proc interfaces only to
> find they're racy will we need to add yet more /proc interfaces to
> maintain backward compatibility yet fix the races? To get the locking
> that ensures a consistent subset of information with this /proc-based
> approach I think we'll frequently need to change the contents of
> existing /proc files.
> 
> Imagine trusting the output of top to exactly represent the state of
> your system's cpu usage. That's the sort of thing a piecemeal /proc
> interface gets us. You're asking us to trust that frequent checkpoints
> (say once every five minutes) of large, multiprocess, month-long
> program runs won't quietly get corrupted and will leave plenty of
> performance to not interfere with the throughput of the work.
> 
> A kernel syscall interface has a better chance of allowing us to fix
> races without changing the interface. We've fixed a few races with
> Oren's tree and none of them required us to change the output format.

If we all decide, that we do want to have the checkpoint/restart as all-in-kernel approach,
then OK. But my impression is - the community is not happy with it.

> Cheers,
> 	-Matt Helsley
> .
> 

      parent reply	other threads:[~2011-07-23  8:39 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-15 13:45 [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Pavel Emelyanov
     [not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-15 13:45   ` [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory Pavel Emelyanov
     [not found]     ` <4E20448A.5010207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  7:21       ` Tejun Heo
2011-07-15 13:46   ` [PATCH 2/7] vfs: Introduce the fd closing helper Pavel Emelyanov
     [not found]     ` <4E2044A7.4030103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 15:47       ` Serge E. Hallyn
2011-07-15 13:46   ` [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status Pavel Emelyanov
     [not found]     ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:54       ` Tejun Heo
     [not found]         ` <20110721065436.GT3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23  8:06           ` Pavel Emelyanov
     [not found]             ` <4E2A8116.1040309-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:41               ` Tejun Heo
     [not found]                 ` <20110723084110.GG21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:45                   ` Pavel Emelyanov
     [not found]                     ` <4E2A8A0E.5030208-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:50                       ` Tejun Heo
     [not found]                         ` <20110723085014.GI21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:51                           ` Pavel Emelyanov
2011-07-21 15:54       ` Serge E. Hallyn
2011-07-15 13:47   ` [PATCH 4/7] vfs: Add ->statfs callback for pipefs Pavel Emelyanov
     [not found]     ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:59       ` Tejun Heo
2011-07-21 15:59       ` Serge E. Hallyn
2011-07-15 13:47   ` [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality Pavel Emelyanov
     [not found]     ` <4E2044EB.20001-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 16:04       ` Serge E. Hallyn
     [not found]         ` <20110721160459.GD19012-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-22 23:08           ` Matt Helsley
     [not found]             ` <20110722230848.GB16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:09               ` Pavel Emelyanov
2011-07-15 13:47   ` [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file Pavel Emelyanov
     [not found]     ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-16 22:57       ` Kirill A. Shutemov
     [not found]         ` <20110716225709.GA25606-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
2011-07-17  8:06           ` Cyrill Gorcunov
2011-07-21  6:44       ` Tejun Heo
     [not found]         ` <20110721064408.GR3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23  8:11           ` Pavel Emelyanov
     [not found]             ` <4E2A8239.5060908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:37               ` Tejun Heo
     [not found]                 ` <20110723083711.GF21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:49                   ` Pavel Emelyanov
     [not found]                     ` <4E2A8B12.4010709-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:58                       ` Tejun Heo
2011-07-15 13:48   ` [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler Pavel Emelyanov
     [not found]     ` <4E204519.3040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21  6:51       ` Tejun Heo
     [not found]         ` <20110721065127.GS3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-22 22:46           ` Matt Helsley
     [not found]             ` <20110722224617.GA16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:17               ` Pavel Emelyanov
     [not found]                 ` <4E2A83AC.6090504-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  8:45                   ` Tejun Heo
     [not found]                     ` <20110723084529.GH21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23  8:51                       ` Pavel Emelyanov
     [not found]                         ` <4E2A8B7D.8010807-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23  9:04                           ` Tejun Heo
2011-07-15 13:49   ` [TOOLS] To make use of the patches Pavel Emelyanov
     [not found]     ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-22 23:45       ` Matt Helsley
     [not found]         ` <20110722234558.GD16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:32           ` Pavel Emelyanov
     [not found]             ` <4E2A8704.3030306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-27 23:00               ` Matt Helsley
     [not found]                 ` <20110727230003.GE15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28  8:23                   ` James Bottomley
2011-07-23  0:40       ` Reply #2: " Matt Helsley
     [not found]         ` <20110723004045.GC21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  8:33           ` Pavel Emelyanov
2011-07-15 15:01   ` [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Tejun Heo
2011-07-18 13:27   ` Serge E. Hallyn
     [not found]     ` <20110718132759.GB8127-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-23  8:43       ` Pavel Emelyanov
2011-07-23  0:25   ` Matt Helsley
     [not found]     ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  3:29       ` Matt Helsley
     [not found]         ` <20110723032945.GD21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23  4:58           ` Tejun Heo
     [not found]             ` <20110723045842.GD21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 18:11               ` Matt Helsley
     [not found]                 ` <20110726181128.GD14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:45                   ` Tejun Heo
     [not found]                     ` <20110726224525.GC28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 23:07                       ` Matt Helsley
2011-07-23  3:53       ` Tejun Heo
     [not found]         ` <CAOS58YPqLSYi2xECUk4O5GG3s6aokT=VykmkL6UnAOzyHXNAgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-07-26 22:59           ` Matt Helsley
     [not found]             ` <20110726225911.GF14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 23:46               ` Tejun Heo
     [not found]                 ` <20110726234657.GD28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27  0:53                   ` Matt Helsley
     [not found]                     ` <20110727005341.GB15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 10:12                       ` Tejun Heo
     [not found]                         ` <20110727101228.GY2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 22:26                           ` Matt Helsley
2011-07-23  5:10       ` Tejun Heo
     [not found]         ` <20110723051005.GE21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 22:02           ` Matt Helsley
     [not found]             ` <20110726220215.GE14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:21               ` Tejun Heo
     [not found]                 ` <20110726222109.GB28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27  0:06                   ` Matt Helsley
     [not found]                     ` <20110727000651.GA15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 12:01                       ` Tejun Heo
     [not found]                         ` <20110727120114.GZ2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 21:35                           ` Matt Helsley
     [not found]                             ` <20110727213510.GC15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28  7:21                               ` Tejun Heo
     [not found]                                 ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-28  7:23                                   ` Tejun Heo
2011-07-28  8:37                                   ` James Bottomley
2011-07-28  9:10                                     ` Tejun Heo
2011-07-23  8:39       ` Pavel Emelyanov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E2A88BC.5010804@parallels.com \
    --to=xemul-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
    --cc=containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org \
    --cc=dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org \
    --cc=glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org \
    --cc=gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org \
    --cc=serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox