From: Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
To: Sukadev Bhattiprolu
<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: sqazi-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
Containers
<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Subject: Re: [LPC] Notes from Checkpoint/Restart BOF
Date: Mon, 12 Oct 2009 14:52:38 -0400 [thread overview]
Message-ID: <4AD37AF6.8010903@librato.com> (raw)
In-Reply-To: <20090929001754.GA19933-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Hi,
Thanks for posting the notes. I place a (modified) summary of the BOF
on the linux-c/r wiki:
http://ckpt.wiki.kernel.org/index.php/LPC2009
Oren.
Sukadev Bhattiprolu wrote:
>
> Notes from Checkpoint/Restart BOF at Linux Plumbers Conference, Sep 24, 2009.
>
> (I am missing some details and couple of names. They said they were on
> Containers mailing list though. If you have any other topics that we
> discussed or have any details, please add to this mail).
>
> ---
>
> Attendees:
> Oren Laadan, Joeseph Ruscio, <One more person> (Librato)
> Pavel Emelyanov, <One more person ?> (OpenVZ)
> Ying Han, Salman Qazi (Google)
> Dan Smith, Matt Helsley, Sukadev Bhattiprolu (IBM)
>
> 1. Pavel: A few months ago there were discussions about making a "dry-run"
> to see if checkpoint of an application will succeed. What is the
> current status of that ?
>
> The answer was there is no dry-run - user should just try the
> actual C/R. If application is using an uncheckpointable resource
> the C/R will fail cleanly without side-effects.
> The dry-run may not mean anything unless we freeze the application
> during the check and leave it frozen until the checkpoint is done.
> IOW, the dry-run does not guarantee that application is checkpointable
> unless the application is frozen.
>
> 2. Pavel: Alexey Dobriyan had earlier submitted some code for leak-detection. Do
> we still have that ?
>
> The answer was that most of the code was used and we also added reverse
> detection.
>
> 3. Do we have a config-option to make a process checkpointable.
>
> <Missed the context of this question> We have CONFIG_CHECKPOINT.
>
> 4 Checkpointing network connections:
>
> We quickly reviewed the status (AF_UNIX done, AF_INET done in a
> prototype and needs to be forward ported). Checkpoint of one-end
> of a network connection can cause the connection to be reset.
>
> 5. Briefly discussed distinction between Live migration and static migration
>
> 6. Do we need a pre-check during restart to ensure that the application can
> be restarted ? Eg: if the application used a specific math co-processor
> or futex at checkpoint and that resource is not available at restart,
> the restart may encounter some undefined behavior. Should we encode the
> hardware/OS capabilities in the checkpoint image and check these
> capabilities during restart (before actual restart). Reason for this
> check being the restart may not fail cleanly if the resource is missing.
>
> Conclusion was that there could be too many such capabilities that
> we would have to track and even so there may be some unexpected
> difference between checkpoint machine and restart machine.
>
> For now, let the restart fail and/or deal with in user-space.
>
> 7. Discussed briefly about clone2() aka clone_with_pids().
>
> Everyone seemed to agree that restoring process-tree even in user-space
> will work and can be used.
>
> 8. Oren: Error reporting during restart
>
> We currently fail the system call with an error code and if we ant
> more information on the failure, we have to add debug messages to
> the code. We discussed couple of options for error reporting on restart:
> - log detailed message(s) to console (risk wrapping dmesg buf)
> - pass an extra-buffer to the system call and have kernel
> fill-in more detailed error message (would need two new
> parameters, one pointer to the buf, one size of the buf).
>
> - Pass-in an extra 'log_fd' parameter to system call and have
> kernel write detailed messags to that log_fd (unless log_fd
> is -1). This seemed more flexible than the other two.
>
> We agreed that the format of the log messages can be free-format
> and that there is no guarantee that the format of the log
> messages will not change.
>
> But it was not clear (at least to me) if the log file should
> contain all log messages relating to the C/R or just the
> last (few) error messages.
>
> 9. Any application to summarize the checkpoint ?
>
> We have a 'ckptinfo' that could summarize the contents of a checkpoint.
>
> 10. Ying Han: Is there a performance difference between the original instance
> of the application and the restarted instance ? (Eg: on NUMA if application
> was on one node at checkpoint and after restart, ended up on another node).
>
> Not sure if there was a conclusion to this point.
>
> 11. Discussed that devices like tty, /dev/rtc etc must be virtualized before
> we can checkpoint them.
>
> 12. Oren: Checkpointing/Restoring mount namespaces
>
> Bind mounts are restored in container.
>
> NFS: at least on OpenVZ, since network is frozen, reopening files over
> NFS is not possible until restart is complete. OpenVZ creates fake
> dentries to allow the open to proceed.
>
> Loopback devices - cannot open them in a container since they can
> lockup system with huge memory footprint ??
>
> We should disable shared-mount propogation at least for now.
>
> 13. Oren: cradvise()
>
> Use a single system call to optimize the checkpoint/restart ?
> Eg: If an fd refers to /dev/tty1 in the checkpoint-image and that tty
> is not available on restart, user-space could open another tty and
> teach the kernel to use a different tty, /dev/tty2, during
> restart. Another example is if an application has several megs of
> "scratch" memory that does not need to checkpointed, they could
> use 'cradvise') system call to optimize the checkpoint or restart.
>
> The conclusion was it would be hard to get acceptance from community,
> for a new variant of ioctl/fcntl call. So, we should instead try to
> add the necessary features to existing system calls like fcntl(),
> shmctl() or madvise().
>
> 14. Oren: Unlinked files/directories
>
> May need to copy the contents of the deleted file to the
> checkpoint image (only on ext4?). Create a fake hard link to the
> file so the file still exists in the filesystem snapshot and remove
> the link during restart.
>
> There is a good paper discussing snapshot/restore of unlinked files
> on Xen. The same concept could be used in C/R too ?
>
> (If you have links to the paper, please add)
>
> 15. Network namespaces
>
> Restore namespaces in user-space, restore sockets in-kernel.
>
> Cannot create devices in user-space unless we know the index for
> the network device ?
>
> (Missed details on this discussion)
>
> 16. Time
>
> Will need some policies on restart like:
> - use absolute time or relative time
> - do new children inherit the policy ?
> - do we gradually adjust from relative to absolute time ?
>
> If not cradvise(), maybe timectl() :-p
>
> 17. VDSO
>
> (Missed details on this discussion)
>
> 18. Async I/O
>
> Getting a lockdep report during checkpoint ?
> OpenVZ flushes I/O, waits for pending I/O and then retries checkpoint
> We may need to the do the same for mmap I/O ?
>
> 19. Checkpoint data structures:
>
> - Try to keep extensions to existing data structures minimal
> - If necessary, add to end of data structures
> - But do not get locked down to an ABI at this point. i.e. even after
> entering mainline, format of checkpoint image may change for a while
> before stabilizing.
>
> 20. Test suite:
>
> OpenVZ has some test cases that has various applications go to specific
> states and wait for a checkpoint. After that and after restart they
> check that nothing has changed unexpectedly.
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
prev parent reply other threads:[~2009-10-12 18:52 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-29 0:17 [LPC] Notes from Checkpoint/Restart BOF Sukadev Bhattiprolu
[not found] ` <20090929001754.GA19933-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-12 18:52 ` Oren Laadan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4AD37AF6.8010903@librato.com \
--to=orenl-rdfvbdnroixbdgjk7y7tuq@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
--cc=sqazi-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
--cc=xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox