From: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>,
Cyrill Gorcunov
<gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Linux Containers
<containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org>,
Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>,
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
Daniel Lezcano <dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace
Date: Sat, 23 Jul 2011 12:39:24 +0400 [thread overview]
Message-ID: <4E2A88BC.5010804@parallels.com> (raw)
In-Reply-To: <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
On 07/23/2011 04:25 AM, Matt Helsley wrote:
> On Fri, Jul 15, 2011 at 05:45:10PM +0400, Pavel Emelyanov wrote:
>> Hi guys!
>>
>> There have already been made many attempts to have the checkpoint/restore functionality
>> in Linux, but as far as I can see there's still no final solutions that suits most of
>> the interested people. The main concern about the previous approaches as I see it was
>> about - all that stuff was supposed to sit in the kernel thus creating various problems.
>>
>> I'd like to bring this subject back again proposing the way of how to implement c/r
>> mostly in the userspace with the reasonable help of a kernel.
>>
>>
>> That said, I propose to start with very basic set of objects to c/r that can work with
>>
>> * x86_64 tasks (subtree) which includes
>> - registers
>> - TLS
>> - memory of all kinds (file and anon both shared and private)
>
> Do mixes of 32 and 64-bit tasks present any problems with this
> method?
In theory - no. But in practice I didn't write the 32-bit support yet.
>> * open regular files
>> * pipes (with data in it)
>>
>> Core idea:
>>
>> The core idea of the restore process is to implement the binary handler that can execve-ute
>> image files recreating the register and the memory state of a task. Restoring the process
>
> I suspect this can be done with Oren's patches too using binfmt-misc -- without any binfmt
> kernel code.
>
>> tree and opening files is done completely in the user space, i.e. when restoring the subtree
>> of processes I first fork all the tasks in respective order, then open required files and
>
> OK. Oren's code also forked all the tasks in userspace prior to completing the restart.
>
>> then call execve() to restore registers and memory.
>
> That's kind of neat, but won't this interfere with restoring O_CLOEXEC
> flags? (I also asked this in a reply to the TOOLS email)
>
>>
>> The checkpointing process is quite simple - all we need about processes can be read from /proc
>> except for several things - registers and private memory. In current implementation to get
>
> I put this to Tejun as well: What about stuff like epoll sets? Sure, you
> can see the epoll fd in /proc/<pid>/fd, but you can't read it to tell
> which fds are in it. Worse, even if you got the fds from the epoll items
> via /proc, the way epoll holds onto them does not guarantee they'll refer
> to the files the set would actuall wait on.
>
> As best I can tell you can't reliably checkpoint epoll sets from userspace.
With the existing interfaces - yes. My aim was to start the discussion whether we can
extend the kernel APIs to make it possible to do so.
> Then there's the matter of unlinked files. How do you plan to deal
> with those without kernel code?
You will have the same problem even with the c/r in the kernel. Frankly, I don't see
much difference in where to solve this one, can you elaborate?
>> them I introduce the /proc/<pid>/dump file which produces the file that can be executed by the
>> described above binfmt. Additionally I introduce the /proc/<pid>/mfd/ dir with info about
>> mappings. It is populated with symbolc links with names equal to vma->vm_start and pointing to
>> mapped files (including anon shared which are tmpfs ones). Thus we can open some task's
>> /proc/<pid>/mfd/<address> link and find out the mapped file inode (to check for sharing) and
>> if required map one and read the contents of anon shared memory.
>
> Finally, I think there's substantial room here for quiet and subtle
> races to corrupt checkpoint images. If we add /proc interfaces only to
> find they're racy will we need to add yet more /proc interfaces to
> maintain backward compatibility yet fix the races? To get the locking
> that ensures a consistent subset of information with this /proc-based
> approach I think we'll frequently need to change the contents of
> existing /proc files.
>
> Imagine trusting the output of top to exactly represent the state of
> your system's cpu usage. That's the sort of thing a piecemeal /proc
> interface gets us. You're asking us to trust that frequent checkpoints
> (say once every five minutes) of large, multiprocess, month-long
> program runs won't quietly get corrupted and will leave plenty of
> performance to not interfere with the throughput of the work.
>
> A kernel syscall interface has a better chance of allowing us to fix
> races without changing the interface. We've fixed a few races with
> Oren's tree and none of them required us to change the output format.
If we all decide, that we do want to have the checkpoint/restart as all-in-kernel approach,
then OK. But my impression is - the community is not happy with it.
> Cheers,
> -Matt Helsley
> .
>
prev parent reply other threads:[~2011-07-23 8:39 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-15 13:45 [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Pavel Emelyanov
[not found] ` <4E204466.8010204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-15 13:45 ` [PATCH 0/1] proc: Introduce the /proc/<pid>/mfd/ directory Pavel Emelyanov
[not found] ` <4E20448A.5010207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 7:21 ` Tejun Heo
2011-07-15 13:46 ` [PATCH 2/7] vfs: Introduce the fd closing helper Pavel Emelyanov
[not found] ` <4E2044A7.4030103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 15:47 ` Serge E. Hallyn
2011-07-15 13:46 ` [PATCH 3/7] proc: Introduce the Children: line in /proc/<pid>/status Pavel Emelyanov
[not found] ` <4E2044C3.7050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 6:54 ` Tejun Heo
[not found] ` <20110721065436.GT3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23 8:06 ` Pavel Emelyanov
[not found] ` <4E2A8116.1040309-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 8:41 ` Tejun Heo
[not found] ` <20110723084110.GG21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23 8:45 ` Pavel Emelyanov
[not found] ` <4E2A8A0E.5030208-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 8:50 ` Tejun Heo
[not found] ` <20110723085014.GI21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23 8:51 ` Pavel Emelyanov
2011-07-21 15:54 ` Serge E. Hallyn
2011-07-15 13:47 ` [PATCH 4/7] vfs: Add ->statfs callback for pipefs Pavel Emelyanov
[not found] ` <4E2044D6.3060205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 6:59 ` Tejun Heo
2011-07-21 15:59 ` Serge E. Hallyn
2011-07-15 13:47 ` [PATCH 5/7] clone: Introduce the CLONE_CHILD_USEPID functionality Pavel Emelyanov
[not found] ` <4E2044EB.20001-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 16:04 ` Serge E. Hallyn
[not found] ` <20110721160459.GD19012-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-22 23:08 ` Matt Helsley
[not found] ` <20110722230848.GB16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 8:09 ` Pavel Emelyanov
2011-07-15 13:47 ` [PATCH 6/7] proc: Introduce the /proc/<pid>/dump file Pavel Emelyanov
[not found] ` <4E204500.6040800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-16 22:57 ` Kirill A. Shutemov
[not found] ` <20110716225709.GA25606-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
2011-07-17 8:06 ` Cyrill Gorcunov
2011-07-21 6:44 ` Tejun Heo
[not found] ` <20110721064408.GR3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-23 8:11 ` Pavel Emelyanov
[not found] ` <4E2A8239.5060908-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 8:37 ` Tejun Heo
[not found] ` <20110723083711.GF21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23 8:49 ` Pavel Emelyanov
[not found] ` <4E2A8B12.4010709-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 8:58 ` Tejun Heo
2011-07-15 13:48 ` [PATCH 7/7] binfmt: Introduce the binfmt_img exec handler Pavel Emelyanov
[not found] ` <4E204519.3040804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-21 6:51 ` Tejun Heo
[not found] ` <20110721065127.GS3455-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-22 22:46 ` Matt Helsley
[not found] ` <20110722224617.GA16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 8:17 ` Pavel Emelyanov
[not found] ` <4E2A83AC.6090504-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 8:45 ` Tejun Heo
[not found] ` <20110723084529.GH21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-23 8:51 ` Pavel Emelyanov
[not found] ` <4E2A8B7D.8010807-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-23 9:04 ` Tejun Heo
2011-07-15 13:49 ` [TOOLS] To make use of the patches Pavel Emelyanov
[not found] ` <4E204554.6040901-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-22 23:45 ` Matt Helsley
[not found] ` <20110722234558.GD16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 8:32 ` Pavel Emelyanov
[not found] ` <4E2A8704.3030306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-07-27 23:00 ` Matt Helsley
[not found] ` <20110727230003.GE15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28 8:23 ` James Bottomley
2011-07-23 0:40 ` Reply #2: " Matt Helsley
[not found] ` <20110723004045.GC21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 8:33 ` Pavel Emelyanov
2011-07-15 15:01 ` [RFC][PATCH 0/7 + tools] Checkpoint/restore mostly in the userspace Tejun Heo
2011-07-18 13:27 ` Serge E. Hallyn
[not found] ` <20110718132759.GB8127-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2011-07-23 8:43 ` Pavel Emelyanov
2011-07-23 0:25 ` Matt Helsley
[not found] ` <20110723002558.GE16940-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 3:29 ` Matt Helsley
[not found] ` <20110723032945.GD21563-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-23 4:58 ` Tejun Heo
[not found] ` <20110723045842.GD21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 18:11 ` Matt Helsley
[not found] ` <20110726181128.GD14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:45 ` Tejun Heo
[not found] ` <20110726224525.GC28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 23:07 ` Matt Helsley
2011-07-23 3:53 ` Tejun Heo
[not found] ` <CAOS58YPqLSYi2xECUk4O5GG3s6aokT=VykmkL6UnAOzyHXNAgQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-07-26 22:59 ` Matt Helsley
[not found] ` <20110726225911.GF14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 23:46 ` Tejun Heo
[not found] ` <20110726234657.GD28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27 0:53 ` Matt Helsley
[not found] ` <20110727005341.GB15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 10:12 ` Tejun Heo
[not found] ` <20110727101228.GY2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 22:26 ` Matt Helsley
2011-07-23 5:10 ` Tejun Heo
[not found] ` <20110723051005.GE21089-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-26 22:02 ` Matt Helsley
[not found] ` <20110726220215.GE14808-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-26 22:21 ` Tejun Heo
[not found] ` <20110726222109.GB28497-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-07-27 0:06 ` Matt Helsley
[not found] ` <20110727000651.GA15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-27 12:01 ` Tejun Heo
[not found] ` <20110727120114.GZ2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-27 21:35 ` Matt Helsley
[not found] ` <20110727213510.GC15501-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-07-28 7:21 ` Tejun Heo
[not found] ` <20110728072141.GB2622-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2011-07-28 7:23 ` Tejun Heo
2011-07-28 8:37 ` James Bottomley
2011-07-28 9:10 ` Tejun Heo
2011-07-23 8:39 ` Pavel Emelyanov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E2A88BC.5010804@parallels.com \
--to=xemul-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
--cc=containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org \
--cc=dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org \
--cc=glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org \
--cc=gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
--cc=ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org \
--cc=serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
--cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox